We define and advance data and environments to push the AI frontier
building benchmarks and collaborating with
Featured research
Vision and impact
We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.
Benchmarking &
Evaluation
Build benchmarks that define and advance the AI frontier
Scaling Subject Matter Expertise
Define how subject matter experts encode their knowledge into data
RL, Training, & Data Valuation
Drive dataset development based on feedback from RL and model training
Community and open science
Open benchmarks, conversations, and research for real-world AI performance.

Open Benchmarks Grants
Backed by a $3M commitment, the program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.

Bench Talks

Reading Group
DEEP RESEARCH Expertise
Technical advisors and distinguished affiliates
Browse research blogs and academic papers
Snorkel Chief Scientist Fred Sala and Kobie Crawford chat with the Terminal-Bench team to unpack the design behind Terminal-Bench 2.0 and the new Harbor framework.
Snorkel AI contributes specialized datasets to Hazy Research’s “Intelligence-per-Watt” study, advancing how efficiently AI turns energy into intelligence.
We unpack what makes a high-quality RL environment for LLMs and show how we build realistic, enterprise-grade environments at Snorkel AI.
The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark…
A procedurally generated and programmatically verified benchmark for evaluating spatial reasoning capabilities in LLMs Large language models (LLMs) are showing remarkable results on solving complex reasoning problems across domains—from mathematical proofs and logical puzzles to graduate-level science and engineering questions. On the other hand, their spatial reasoning capabilities are less understood, even though such reasoning underlies many everyday tasks. We…
Snorkel’s “Trusted Scale” philosophy Welcome to Part 4 of Snorkel AI’s rubric series. In previous posts, we explored how rubrics enable structured evaluation (Part 1), the spectrum of rubric types and use cases (Part 2), and the science behind designing and validating them (Part 3). In this latest installment, we pull back the curtain on how Snorkel puts these principles…
In recent months, there has been increasing interest in the area of multi-agent systems and how they can be used to solve more complex tasks than a single agent could accomplish on its own. The topic is particularly interesting and raises several questions and ideas to consider: Anthropic’s blog post about how they architected a multi-agent deep research system is…
Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing…
Behind every AI benchmark is a hidden choice: how to read the model’s answers. That choice—parsing—can quietly tilt results more than the model itself. Parsing is where we take an AI system’s raw response and extract the “answer” we use for scoring. It sounds mechanical, but as our research shows, the choice of parser can dramatically change measured accuracy. In…









