Introducing Open Benchmarks Grants, a $3M commitment to open benchmarks. Apply now
From cutting-edge research to enterprise and frontier impact
Deep research roots




Featured benchmarks
Exclusive to Snorkel, these benchmarks are meticulously designed and validated by subject matter experts to probe frontier AI models on demanding, specialized tasks.
These are just a few of our featured benchmarks—new ones are added regularly, so check back often to see the latest from our research team.
Agentic Coding
SnorkelUnderwrite
Finance Reasoning
Leaderboards
Challenging benchmarks for models and agents
Snorkel benchmarks are built with human expertise to test models on realistic tasks ranging from coding, to financial analysis, healthcare, and much more. For example, our SnorkelUnderwrite benchmark includes multi-turn agentic tasks germane to the insurance industry.
Rubrics
Aligning human expertise and automated evaluation
We investigate how to scalably develop rubrics that are both comprehensive of the desired agentic capabilities and reliably assessed by both human experts and AI judges.
RL ENvironments
Environments give agents a fully realized simulation
As tool-calling and more open-ended application requirements break simple test frameworks, agent validation must be done with techniques that reproduce real-world variability. For example, our contributions to Terminal-Bench (tbench.ai) include containerized simulation environments.
Browse blog posts and 100+ peer reviewed academic papers
Parsing Isn’t Neutral: Why Evaluation Choices Matter
Shrinking the Generation-Verification Gap with Weak Verifiers
Data quality and rubrics: how to build trust in your models
Theoretical Physics Benchmark (TPBench)- a Dataset and Study of AI Reasoning Capabilities in Theoretical Physics
WONDERBREAD: A Benchmark for Evaluating Multimodal Foundation Models on Business Process Management Tasks
The ALCHEmist: Automated Labeling 500x CHEaper Than LLM Data Annotators
Skill-It! A Data-Driven Skills Framework for Understanding and Training Language Models
Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory
Building FinQA: An Open RL Environment for Financial Reasoning Agents
How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Backed by a $3M commitment, the Open Benchmarks Grants program — in partnership with Hugging Face, Prime Intellect, Together AI and Factory HQ and Harbor— funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.
Applications are rolling — starting March 1st.



