Continual Learning Bench by Berkeley & Snorkel

We define and advance data and environments to push the AI frontier

Built on 10+ years of pioneering research in data-centric AI, including 250+ publications and benchmarks.

Browse research library

building benchmarks and collaborating with

from the lab

Featured research

Research Paper

Accepted to MLSys

Learning from Less: Measuring the Effectiveness of RLVR in Low Data Compute Regimes

Benchmark

Open Benchmark Grants

Benchmarking Agents in Insurance Underwriting Environments

key research areas

Vision and impact

We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.

Benchmarking &
Evaluation

Build benchmarks that define and advance the AI frontier

featured work

Continual Learning Bench
Co-published with Berkeley

Terminal-Bench 2.0 (+3.0)
Co-authored with Laude Institute

BigLaw Bench: Research
Co-released with Harvey

SlopCode Bench
Co-released with UW-Madison

Scaling Subject Matter Expertise

Define how subject matter experts encode their knowledge into data

featured work

Weak-to-Strong Generalization Through Data-Centric Lens
ICLR 2025

Rapid Data Creation with Weak Supervision
Best of VLDB 2017

RL, Training, & Data Valuation

Drive dataset development based on feedback from RL and model training

featured work

Learning from Less: Effectiveness of RLVR in Low Data and Compute Regimes
MLSys 2026

4B FinQA Model Outperforms 235B Model with the Right Data
Co-authored with Berkeley

RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics
ICLR Workshop 2026

initiatives

Community and open science

Open benchmarks, conversations, and research for real-world AI performance.

Open Benchmarks Grants

Backed by a $3M commitment, the program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.

Learn more

Bench Talks

Our podcast series at the intersection of AI evaluation, data quality, and real-world impact.

Watch the latest episode

Reading Group

A recurring forum for researchers and practitioners to explore the latest frontier developments in AI while building meaningful connections within the community.

DEEP RESEARCH Expertise

Technical advisors and distinguished affiliates

Stephen Bach

Brown University

Eliot Horowitz Assistant Professor, Computer Science Department

Jason Fries

Stanford University

Assistant Professor of Biomedical Data Science and of Medicine

Jared Dunnmon

Co-Founder & Chief Scientist, Stealth Startup

Prev. Dir. of AI at DIU

Fred Sala

Chief Scientist, Snorkel AI

Assistant Professor @ University of Wisconsin-Madison

Chris Ré

Co-Founder, Snorkel AI

Professor @ Stanford University

Ludwig Schmidt

Stanford University · LAION

Stanford researcher and LAION collaborator

Karthik Narasimhan

Princeton University

Professor of Computer Science

Yu Su

Ohio State University

Associate Professor of Computer Science and Engineering

Lewis Tunstall

Hugging Face

Machine Learning Engineer

PUBLICATIONS

Browse research blogs and academic papers

Blog

Why Coding Agents Need Better Data, Evals, and Environments

Coding agents have moved from tab-complete to teammate. They autonomously inspect repositories, edit files, run commands, diagnose failures, and work through multi-step engineering tasks. That creates a harder reliability problem. A model that only suggests code is easy for a human to evaluate. A coding agent refactoring your repository and testing its own changes is much harder to supervise –…

May 06, 2026 •

Justin Bauer

Learn more about Why Coding Agents Need Better Data, Evals, and Environments

Blog

Benchmarks should shape the frontier, not just measure it

Since launching the Open Benchmarks Grants, we’ve received more than 100 applications from academic groups and industry labs spanning a wide range of domains and capabilities. As the best benchmarks have driven how the field allocates research effort, the bar for what counts as useful has risen. Here, we share what’s now table stakes for useful benchmarks, and what separates…

Apr 06, 2026 •

Vincent Sunn Chen

Learn more about Benchmarks should shape the frontier, not just measure it

Blog

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

To kick off our inaugural Benchtalks, a series dedicated to the researchers building these measurement toolkits, Snorkel AI co-founder Vincent Sunn Chen sat down with Alex Shaw, Founding MTS at Laude Institute and co-creator of Terminal-Bench and Harbor. Highlights More on Terminal-Bench: See the leaderboard and the catalog of tasks at tbench.ai. Explore Harbor: Learn how to scale your agent…

Mar 31, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

Blog

Building FinQA: An Open RL Environment for Financial Reasoning Agents

TL;DR: We built FinQA — a financial question-answering environment with 290 expert-curated questions across 22 public companies, now available on OpenEnv. Agents use MCP tools to discover schemas, write constrained SQL queries, and answer multi-step questions from real SEC 10-K filings. Most open-source models struggle with this kind of multi-step tool use, and even frontier closed-source models, while more accurate,…

Mar 30, 2026 •

Bhavishya Pohani

Learn more about Building FinQA: An Open RL Environment for Financial Reasoning Agents

Blog

How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

The Snorkel research team collaborated with the rLLM team at UC Berkeley on the Agentica project, using their open-source rLLM framework to fine-tune Qwen3-4B-Instruct-2507, delivering a model that beats Qwen3-235B-A22B on Snorkel AI’s expert-curated financial benchmarks – at 1/60th the size. A full breakdown of the results are published in the rLLM blog here. The key insight? Just focus on…

Feb 17, 2026 •

Chris Glaze

Learn more about How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Blog

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants Today, AI is marked by a growing asymmetry: the excitement around agentic AI is real—backed by quantitative progress on model cards and genuine leaps forward, especially in coding. But ask individuals or enterprises where they feel ready to deploy agentic automation in high-stakes, domain-specific settings outside of coding… and you will…

Feb 11, 2026 •

Vincent Sunn Chen

Learn more about Closing the Evaluation Gap in Agentic AI

Research Paper

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

AI agents may soon become capable of autonomously completing valuable, longhorizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each…

Jan 30, 2026 •

Snorkel Team

Learn more about Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Blog

Part V: Future direction and emerging trends

Explores how rubrics support agentic, multi-turn, tool-using, multimodal, and code-generating AI systems, and how they evolve with AI feedback and ensemble evaluation.

Dec 05, 2025 •

Justin Bauer

Learn more about Part V: Future direction and emerging trends

Blog

Chat with the Terminal-Bench team

Snorkel Chief Scientist Fred Sala and Kobie Crawford chat with the Terminal-Bench team to unpack the design behind Terminal-Bench 2.0 and the new Harbor framework.

Nov 19, 2025 •

Kobie Crawford, Fred Sala

Learn more about Chat with the Terminal-Bench team

1 2 … 33 34

Let’s research together

Join our team of leading researchers and help shape the future of AI.

View all careers

Open Benchmark Grants

We define and advance data and environments to push the AI frontier

Featured research

Learning from Less: Measuring the Effectiveness of RLVR in Low Data Compute Regimes

SlopCode Bench: A community benchmark measuring code erosion

Harvey’s BigLaw Bench: Research

Continual Learning Bench: Evaluating agents that adapt and improve over time

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Benchmarking Agents in Insurance Underwriting Environments

Vision and impact

Benchmarking & Evaluation

Scaling Subject Matter Expertise

RL, Training, & Data Valuation

Community and open science

Open Benchmarks Grants

Bench Talks

Reading Group

Technical advisors and distinguished affiliates

Stephen Bach

Jason Fries

Jared Dunnmon

Fred Sala

Chris Ré

Ludwig Schmidt

Karthik Narasimhan

Yu Su

Lewis Tunstall

Browse research blogs and academic papers

Let’s research together

How do you want to work with Snorkel?

Benchmarking &
Evaluation

Browse research blogs and academic papers