Continual Learning Bench by Berkeley & Snorkel

We define and advance data and environments to push the AI frontier

Built on 10+ years of pioneering research in data-centric AI, including 250+ publications and benchmarks.

Browse research library

building benchmarks and collaborating with

from the lab

Featured research

Research Paper

Accepted to MLSys

Learning from Less: Measuring the Effectiveness of RLVR in Low Data Compute Regimes

Benchmark

Open Benchmarks Grants

Benchmarking Agents in Insurance Underwriting Environments

key research areas

Vision and impact

We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.

Benchmarking &
Evaluation

Build benchmarks that define and advance the AI frontier

featured work

Continual Learning Bench
Co-published with Berkeley

Terminal-Bench 2.0 (+3.0)
Co-authored with Laude Institute

BigLaw Bench: Research
Co-released with Harvey

SlopCode Bench
Co-released with UW-Madison

Scaling Subject Matter Expertise

Define how subject matter experts encode their knowledge into data

featured work

Weak-to-Strong Generalization Through Data-Centric Lens
ICLR 2025

Rapid Data Creation with Weak Supervision
Best of VLDB 2017

RL, Training, & Data Valuation

Drive dataset development based on feedback from RL and model training

featured work

Learning from Less: Effectiveness of RLVR in Low Data and Compute Regimes
MLSys 2026

4B FinQA Model Outperforms 235B Model with the Right Data
Co-authored with Berkeley

RIFT: A Rubric Failure Mode Taxonomy and Automated Diagnostics
ICLR Workshop 2026

initiatives

Community and open science

Open benchmarks, conversations, and research for real-world AI performance.

Open Benchmarks Grants

Backed by a $3M commitment, the program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built and evaluated.

Learn more

Bench Talks

Our podcast series at the intersection of AI evaluation, data quality, and real-world impact.

Watch the latest episode

Reading Group

A recurring forum for researchers and practitioners to explore the latest frontier developments in AI while building meaningful connections within the community.

DEEP RESEARCH Expertise

Technical advisors and distinguished affiliates

Stephen Bach

Brown University

Eliot Horowitz Assistant Professor, Computer Science Department

Jason Fries

Stanford University

Assistant Professor of Biomedical Data Science and of Medicine

Jared Dunnmon

Co-Founder & Chief Scientist, Stealth Startup

Prev. Dir. of AI at DIU

Fred Sala

Chief Scientist

Snorkel AI

Assistant Professor @ University of Wisconsin-Madison

Chris Ré

Co-Founder

Snorkel AI

Professor @ Stanford University

Ludwig Schmidt

Stanford University · LAION

Stanford researcher and LAION collaborator

Karthik Narasimhan

Princeton University

Professor of Computer Science

Yu Su

Ohio State University

Associate Professor of Computer Science and Engineering

Lewis Tunstall

Hugging Face

Machine Learning Engineer

PUBLICATIONS

Browse research blogs and academic papers

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning,...

Research Paper

Accepted to MLSys 2026

NEW

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Jul 02, 2026 •

Justin Bauer, Thomas Walshe, Derek Pham, Harit Vishwakarma, Armin Parchami, Frederic Sala, Paroma Varma

Learn more about Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Agents’ Last Exam: AI Benchmarking for Real Work

Blog

NEW

Agents’ Last Exam: AI Benchmarking for Real Work

At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can…

Jun 29, 2026 •

Snorkel Team

Learn more about Agents’ Last Exam: AI Benchmarking for Real Work

Blog

Agentic AI Evaluation: Closing the Gap with Better Benchmarks and Data

Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to…

Jun 22, 2026 •

Snorkel Team

Learn more about Agentic AI Evaluation: Closing the Gap with Better Benchmarks and Data

Blog

Benchtalks #3: We taught AI everything except how to learn

For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration…

Jun 20, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #3: We taught AI everything except how to learn

Blog

Continual learning and evaluating how AI agents learn across sequences of tasks

Most agent benchmarks evaluate each task as an independent episode. The agent receives a task, produces an answer, gets scored, and moves on. The next task starts as if the previous one never happened. That setup misses a core requirement for deployed agents. A coding agent, research assistant, data analyst, or workplace assistant should improve as it works across repeated…

Jun 18, 2026 •

Chris Glaze

Learn more about Continual learning and evaluating how AI agents learn across sequences of tasks

Blog

Cua-Bench: benchmarking computer-use agents on professional software

TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —…

Jun 11, 2026 •

Zhengyang (Jason) Qi , Armin Parchami

Learn more about Cua-Bench: benchmarking computer-use agents on professional software

Blog

The Art and Science of Building Benchmarks That Shape the Field

Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents in realistic,…

Jun 08, 2026 •

Snorkel Team

Learn more about The Art and Science of Building Benchmarks That Shape the Field

Blog

Benchtalks #2: The Future of Coding Benchmarks with John Yang (SWE-Bench, ProgramBench)

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel…

May 21, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #2: The Future of Coding Benchmarks with John Yang (SWE-Bench, ProgramBench)

Blog

Why Coding Agents Need Better Data, Evals, and Environments

Coding agents have moved from tab-complete to teammate. They autonomously inspect repositories, edit files, run commands, diagnose failures, and work through multi-step engineering tasks. That creates a harder reliability problem. A model that only suggests code is easy for a human to evaluate. A coding agent refactoring your repository and testing its own changes is much harder to supervise –…

May 06, 2026 •

Justin Bauer

Learn more about Why Coding Agents Need Better Data, Evals, and Environments

1 2 … 35

Let’s research together

Join our team of leading researchers and help shape the future of AI.

View all careers

Open Benchmark Grants

We define and advance data and environments to push the AI frontier

Featured research

Learning from Less: Measuring the Effectiveness of RLVR in Low Data Compute Regimes

SlopCode Bench: A community benchmark measuring code erosion

Harvey’s BigLaw Bench: Research

Continual Learning Bench: Evaluating agents that adapt and improve over time

Terminal-Bench 2.0: Raising the bar for AI agent evaluation

Benchmarking Agents in Insurance Underwriting Environments

Vision and impact

Benchmarking & Evaluation

Scaling Subject Matter Expertise

RL, Training, & Data Valuation

Community and open science

Open Benchmarks Grants

Bench Talks

Reading Group

Technical advisors and distinguished affiliates

Stephen Bach

Jason Fries

Jared Dunnmon

Fred Sala

Chris Ré

Ludwig Schmidt

Karthik Narasimhan

Yu Su

Lewis Tunstall

Browse research blogs and academic papers

Let’s research together

How do you want to work with Snorkel?

Benchmarking &
Evaluation

Browse research blogs and academic papers