We define and advance data and environments to push the AI frontier

Built on 10+ years of pioneering research in data-centric AI,
including 250+ publications and benchmarks.

building benchmarks and collaborating with

Image
Image
Image
Image
Image
Image
Image
Image
Image
key research areas

Vision and impact

We help labs advance frontier models by working with domain experts to design and build complex, realistic datasets that drive model performance.

initiatives

Community and open science

Open benchmarks, conversations, and research for real-world AI performance.

Image

Open Benchmarks Grants

Backed by a $3M commitment, the program funds
open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI systems are built
and evaluated.

Image

Bench Talks

Our podcast series at the intersection of AI evaluation, data quality, and real-world impact.
Image

Reading Group

A recurring forum for researchers and practitioners to explore the latest frontier developments in AI while building meaningful connections within the community.

DEEP RESEARCH Expertise

Technical advisors and distinguished affiliates

Stephen Bach headshot

Stephen Bach

Brown University
Eliot Horowitz Assistant Professor, Computer Science Department
Jason Fries headshot

Jason Fries

Stanford University
Assistant Professor of Biomedical Data Science and of Medicine
Jared Dunnmon headshot

Jared Dunnmon

Co-Founder & Chief Scientist, Stealth Startup
Prev. Dir. of AI at DIU
Fred Sala headshot

Fred Sala

Chief Scientist, Snorkel AI
Assistant Professor @ University of Wisconsin-Madison
Chris Ré headshot

Chris Ré

Co-Founder, Snorkel AI
Professor @ Stanford University
Ludwig Schmidt headshot

Ludwig Schmidt

Stanford University · LAION
Stanford researcher and LAION collaborator
Karthik Narasimhan headshot

Karthik Narasimhan

Princeton University
Professor of Computer Science
Yu Su headshot

Yu Su

Ohio State University
Associate Professor of Computer Science and Engineering
Lewis Tunstall headshot

Lewis Tunstall

Hugging Face
Machine Learning Engineer
PUBLICATIONS

Browse research blogs
and academic papers

Type: All Types
Sort: Newest
Chat with the Terminal-Bench team
Blog
Chat with the Terminal-Bench team

Snorkel Chief Scientist Fred Sala and Kobie Crawford chat with the Terminal-Bench team to unpack the design behind Terminal-Bench 2.0 and the new Harbor framework.

Nov 19, 2025
Learn more about Chat with the Terminal-Bench team
Intelligence Per Watt: A New Metric for AI’s Future
Blog
Intelligence Per Watt: A New Metric for AI’s Future

Snorkel AI contributes specialized datasets to Hazy Research’s “Intelligence-per-Watt” study, advancing how efficiently AI turns energy into intelligence.

Nov 12, 2025
Learn more about Intelligence Per Watt: A New Metric for AI’s Future
Snorkeling in RL environments
Blog
Snorkeling in RL environments

We unpack what makes a high-quality RL environment for LLMs and show how we build realistic, enterprise-grade environments at Snorkel AI.

Nov 04, 2025
Learn more about Snorkeling in RL environments
Automating Benchmark Design
Research Paper
Accepted to ICLR 2026
Automating Benchmark Design

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark…

Oct 30, 2025
Amanda Dsouza, Harit Vishwakarma, Zhengyang Qi, Justin Bauer, Derek Pham, Thomas Walshe, Armin Parchami, Frederic Sala, Paroma Varma
Learn more about Automating Benchmark Design
Introducing SnorkelSpatial
Blog
Introducing SnorkelSpatial

A procedurally generated and programmatically verified benchmark for evaluating spatial reasoning capabilities in LLMs Large language models (LLMs) are showing remarkable results on solving complex reasoning problems across domains—from mathematical proofs and logical puzzles to graduate-level science and engineering questions. On the other hand, their spatial reasoning capabilities are less understood, even though such reasoning underlies many everyday tasks. We…

Oct 24, 2025
Learn more about Introducing SnorkelSpatial
Scaling Trust: Rubrics in Snorkel’s Quality Process
Blog
Scaling Trust: Rubrics in Snorkel’s Quality Process

Snorkel’s “Trusted Scale” philosophy Welcome to Part 4 of Snorkel AI’s rubric series. In previous posts, we explored how rubrics enable structured evaluation (Part 1), the spectrum of rubric types and use cases (Part 2), and the science behind designing and validating them (Part 3). In this latest installment, we pull back the curtain on how Snorkel puts these principles…

Oct 16, 2025
Learn more about Scaling Trust: Rubrics in Snorkel’s Quality Process
Evaluating Multi-Agent Systems in Enterprise Tool Use
Blog
Evaluating Multi-Agent Systems in Enterprise Tool Use

In recent months, there has been increasing interest in the area of multi-agent systems and how they can be used to solve more complex tasks than a single agent could accomplish on its own. The topic is particularly interesting and raises several questions and ideas to consider: Anthropic’s blog post about how they architected a multi-agent deep research system is…

Oct 09, 2025
Learn more about Evaluating Multi-Agent Systems in Enterprise Tool Use
Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark
Blog
Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark

Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing…

Sep 30, 2025
Learn more about Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark
Parsing Isn’t Neutral: Why Evaluation Choices Matter
Blog
Parsing Isn’t Neutral: Why Evaluation Choices Matter

Behind every AI benchmark is a hidden choice: how to read the model’s answers. That choice—parsing—can quietly tilt results more than the model itself. Parsing is where we take an AI system’s raw response and extract the “answer” we use for scoring. It sounds mechanical, but as our research shows, the choice of parser can dramatically change measured accuracy. In…

Sep 26, 2025
Learn more about Parsing Isn’t Neutral: Why Evaluation Choices Matter
1 2 33 34
Image

Let’s research together

Join our team of leading researchers and help shape the future of AI.