resources

Resource library

Explore our complete library of resources including blogs, benchmarks, research papers and more.

Blog

Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark

Announcing a $3M commitment to launch Open Benchmarks Grants

Kobie Crawford

September 30, 2025

Blog

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants

Vincent Sunn Chen

February 11, 2026

Blog

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

Announcing a $3M commitment to launch Open Benchmarks Grants

Vincent Sunn Chen

March 31, 2026

Blog

Building FinQA: An Open RL Environment for Financial Reasoning Agents

Announcing a $3M commitment to launch Open Benchmarks Grants

Bhavishya Pohani

March 30, 2026

Blog

The science of rubric design

Announcing a $3M commitment to launch Open Benchmarks Grants

Charles Dickens

September 11, 2025

Blog

The Standard for Agents You Can Trust: Lessons from the Federal Front Lines

In the first installment of Agentic in Action — a series about real AI deployments, not demos — Snorkel AI’s Kevin Olivieri sat down with three people who have spent their careers where trust isn’t optional: Chris Sniffen, Federal Applied AI Lead at Snorkel AI; John Hickey, President of August Schell; and Mike Baca, CIO of August Schell. The conversation focused on…

Jun 03, 2026 •

Snorkel Team

Learn more about The Standard for Agents You Can Trust: Lessons from the Federal Front Lines

Blog

Benchtalks #2: The Future of Coding Benchmarks with John Yang (SWE-Bench, ProgramBench)

For our second Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with John Yang, a Stanford PhD student and creator of the SWE-bench franchise, SWE-smith, CodeClash, and most recently ProgramBench. Highlights More on ProgramBench: See the benchmark and the upcoming leaderboard at programbench.com. More from John Yang: Publications and writing at john-b-yang.github.io. Snorkel…

May 21, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #2: The Future of Coding Benchmarks with John Yang (SWE-Bench, ProgramBench)

Blog

Building AI-native systems for federal infrastructure with Rezaur Rahman

Christopher Sniffen recently sat down with Rezaur Rahman — CIO / CISO / CAIO at the Advisory Council on Historic Preservation — for a conversation on what it actually takes to build frontier AI for federal infrastructure. They get into the limits of frontier models on geospatial reasoning, mechanistic interpretability for applied AI, the trick that makes vision models useful…

May 14, 2026 •

Snorkel Team

Learn more about Building AI-native systems for federal infrastructure with Rezaur Rahman

Blog

Why Coding Agents Need Better Data, Evals, and Environments

Coding agents have moved from tab-complete to teammate. They autonomously inspect repositories, edit files, run commands, diagnose failures, and work through multi-step engineering tasks. That creates a harder reliability problem. A model that only suggests code is easy for a human to evaluate. A coding agent refactoring your repository and testing its own changes is much harder to supervise –…

May 06, 2026 •

Justin Bauer

Learn more about Why Coding Agents Need Better Data, Evals, and Environments

Blog

Benchmarks should shape the frontier, not just measure it

Since launching the Open Benchmarks Grants, we’ve received more than 100 applications from academic groups and industry labs spanning a wide range of domains and capabilities. As the best benchmarks have driven how the field allocates research effort, the bar for what counts as useful has risen. Here, we share what’s now table stakes for useful benchmarks, and what separates…

Apr 06, 2026 •

Vincent Sunn Chen

Learn more about Benchmarks should shape the frontier, not just measure it

Blog

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

To kick off our inaugural Benchtalks, a series dedicated to the researchers building these measurement toolkits, Snorkel AI co-founder Vincent Sunn Chen sat down with Alex Shaw, Founding MTS at Laude Institute and co-creator of Terminal-Bench and Harbor. Highlights More on Terminal-Bench: See the leaderboard and the catalog of tasks at tbench.ai. Explore Harbor: Learn how to scale your agent…

Mar 31, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

Blog

Building FinQA: An Open RL Environment for Financial Reasoning Agents

TL;DR: We built FinQA — a financial question-answering environment with 290 expert-curated questions across 22 public companies, now available on OpenEnv. Agents use MCP tools to discover schemas, write constrained SQL queries, and answer multi-step questions from real SEC 10-K filings. Most open-source models struggle with this kind of multi-step tool use, and even frontier closed-source models, while more accurate,…

Mar 30, 2026 •

Bhavishya Pohani

Learn more about Building FinQA: An Open RL Environment for Financial Reasoning Agents

Blog

How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

The Snorkel research team collaborated with the rLLM team at UC Berkeley on the Agentica project, using their open-source rLLM framework to fine-tune Qwen3-4B-Instruct-2507, delivering a model that beats Qwen3-235B-A22B on Snorkel AI’s expert-curated financial benchmarks – at 1/60th the size. A full breakdown of the results are published in the rLLM blog here. The key insight? Just focus on…

Feb 17, 2026 •

Chris Glaze

Learn more about How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Blog

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants Today, AI is marked by a growing asymmetry: the excitement around agentic AI is real—backed by quantitative progress on model cards and genuine leaps forward, especially in coding. But ask individuals or enterprises where they feel ready to deploy agentic automation in high-stakes, domain-specific settings outside of coding… and you will…

Feb 11, 2026 •

Vincent Sunn Chen

Learn more about Closing the Evaluation Gap in Agentic AI

1 2 3 … 64

Join our newsletter

For expert advice, the latest research, and exclusive events.

By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Resource library

Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark

Closing the Evaluation Gap in Agentic AI

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

Building FinQA: An Open RL Environment for Financial Reasoning Agents

The science of rubric design

Join our newsletter

How do you want to work with Snorkel?