We define and advance data and environments to push the AI frontier

Built on 10+ years of pioneering research in data-centric AI,
including 250+ publications and benchmarks.

building benchmarks and collaborating with

Image
Image
Image
Image
Image
Image
Image
Image
Image
key research areas

Vision and impact

We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.

initiatives

Community and open science

Open benchmarks, conversations, and research for real-world AI performance.

Image

Open Benchmarks Grants

Backed by a $3M commitment, the program funds
open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built
and evaluated.

Image

Bench Talks

Our podcast series at the intersection of AI evaluation, data quality, and real-world impact.
Image

Reading Group

A recurring forum for researchers and practitioners to explore the latest frontier developments in AI while building meaningful connections within the community.

DEEP RESEARCH Expertise

Technical advisors and distinguished affiliates

Stephen Bach headshot

Stephen Bach

Brown University
Eliot Horowitz Assistant Professor, Computer Science Department
Jason Fries headshot

Jason Fries

Stanford University
Assistant Professor of Biomedical Data Science and of Medicine
Jared Dunnmon headshot

Jared Dunnmon

Co-Founder & Chief Scientist, Stealth Startup
Prev. Dir. of AI at DIU
Fred Sala headshot

Fred Sala

Chief Scientist
,
Snorkel AI
Assistant Professor @ University of Wisconsin-Madison
Chris Ré headshot

Chris Ré

Co-Founder
,
Snorkel AI
Professor @ Stanford University
Ludwig Schmidt headshot

Ludwig Schmidt

Stanford University · LAION
Stanford researcher and LAION collaborator
Karthik Narasimhan headshot

Karthik Narasimhan

Princeton University
Professor of Computer Science
Yu Su headshot

Yu Su

Ohio State University
Associate Professor of Computer Science and Engineering
Lewis Tunstall headshot

Lewis Tunstall

Hugging Face
Machine Learning Engineer
PUBLICATIONS

Browse research blogs
and academic papers

Type: All Types
Sort: Newest
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning,...
Research Paper
Accepted to MLSys 2026
NEW
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where…

Jul 02, 2026

Justin Bauer, Thomas Walshe, Derek Pham, Harit Vishwakarma, Armin Parchami, Frederic Sala, Paroma Varma

Learn more about Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Agents’ Last Exam: AI Benchmarking for Real Work
Blog
NEW
Agents’ Last Exam: AI Benchmarking for Real Work

At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can…

Jun 29, 2026
Learn more about Agents’ Last Exam: AI Benchmarking for Real Work
Agentic AI Evaluation: Closing the Gap with Better Benchmarks and Data
Blog
Agentic AI Evaluation: Closing the Gap with Better Benchmarks and Data

Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to…

Jun 22, 2026
Learn more about Agentic AI Evaluation: Closing the Gap with Better Benchmarks and Data
Benchtalks #3: We taught AI everything except how to learn
Blog
Benchtalks #3: We taught AI everything except how to learn

For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration…

Jun 20, 2026
Learn more about Benchtalks #3: We taught AI everything except how to learn
Continual learning and evaluating how AI agents learn across sequences of tasks
Blog
Continual learning and evaluating how AI agents learn across sequences of tasks

Most agent benchmarks evaluate each task as an independent episode. The agent receives a task, produces an answer, gets scored, and moves on. The next task starts as if the previous one never happened. That setup misses a core requirement for deployed agents. A coding agent, research assistant, data analyst, or workplace assistant should improve as it works across repeated…

Jun 18, 2026
Learn more about Continual learning and evaluating how AI agents learn across sequences of tasks
Cua-Bench: benchmarking computer-use agents on professional software
Blog
Cua-Bench: benchmarking computer-use agents on professional software

TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —…

Learn more about Cua-Bench: benchmarking computer-use agents on professional software
The Art and Science of Building Benchmarks That Shape the Field
Blog
The Art and Science of Building Benchmarks That Shape the Field

Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents in realistic,…

Jun 08, 2026
Learn more about The Art and Science of Building Benchmarks That Shape the Field
Benchtalks #2: The Future of Coding Benchmarks with John Yang (SWE-Bench, ProgramBench)
Blog
Benchtalks #2: The Future of Coding Benchmarks with John Yang (SWE-Bench, ProgramBench)

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel…

May 21, 2026
Learn more about Benchtalks #2: The Future of Coding Benchmarks with John Yang (SWE-Bench, ProgramBench)
Why Coding Agents Need Better Data, Evals, and Environments
Blog
Why Coding Agents Need Better Data, Evals, and Environments

Coding agents have moved from tab-complete to teammate. They autonomously inspect repositories, edit files, run commands, diagnose failures, and work through multi-step engineering tasks. That creates a harder reliability problem. A model that only suggests code is easy for a human to evaluate. A coding agent refactoring your repository and testing its own changes is much harder to supervise –…

May 06, 2026
Learn more about Why Coding Agents Need Better Data, Evals, and Environments
1 2 35
Image

Let’s research together

Join our team of leading researchers and help shape the future of AI.