resources

Resource library

Explore our complete library of resources including blogs, benchmarks, research papers and more.

Blog

Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark

Announcing a $3M commitment to launch Open Benchmarks Grants

Kobie Crawford

September 30, 2025

Blog

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants

Vincent Sunn Chen

February 11, 2026

Blog

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

Announcing a $3M commitment to launch Open Benchmarks Grants

Vincent Sunn Chen

March 31, 2026

Blog

Building FinQA: An Open RL Environment for Financial Reasoning Agents

Announcing a $3M commitment to launch Open Benchmarks Grants

Bhavishya Pohani

March 30, 2026

Blog

The science of rubric design

Announcing a $3M commitment to launch Open Benchmarks Grants

Charles Dickens

September 11, 2025

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning,...

Research Paper

Accepted to MLSys 2026

NEW

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Jul 02, 2026 •

Justin Bauer, Thomas Walshe, Derek Pham, Harit Vishwakarma, Armin Parchami, Frederic Sala, Paroma Varma

Learn more about Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Agents’ Last Exam: AI Benchmarking for Real Work

Blog

NEW

Agents’ Last Exam: AI Benchmarking for Real Work

At our latest Snorkel AI Reading Group, Yiyou Sun and David (Xinyang) Han (UC Berkeley, Center for Responsible and Decentralized Intelligence) presented Agents’ Last Exam (ALE) — a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. ALE is a collaboration between Berkeley RDI, Snorkel AI, and 300+ expert contributors across 55 professional subfields. ALE asks a deceptively simple question: can…

Jun 29, 2026 •

Snorkel Team

Learn more about Agents’ Last Exam: AI Benchmarking for Real Work

Case study

From hours to seconds on CLO contract review with 94% end user acceptance

A top 10 US bank manages CLO portfolios totaling billions in assets, each governed by contracts up to 500 pages.

Jun 24, 2026 •

Snorkel Team

Learn more about From hours to seconds on CLO contract review with 94% end user acceptance

Case study

Conversational, decision-grade responses in 15 seconds

A global media intelligence firm analyzes hundreds of millions of sources daily – from public news, social, and broadcast to proprietary analyst-curated databases – to help large enterprise clients manage communications, reputation, and strategic decision-making. Their competitive advantage is the layer on top of publicly available data: in-house human editorial teams, proprietary scoring and analytics frameworks, and years of analyst judgment refined into decision-grade intelligence. When a crisis signal is building or a competitor’s narrative is gaining traction, speed and accuracy matter enormously. Historically, getting an answer meant waiting for a human analyst to manually aggregate across those sources: a process measured in hours, not seconds.

Jun 24, 2026 •

Snorkel Team

Learn more about Conversational, decision-grade responses in 15 seconds

Blog

Agentic AI Evaluation: Closing the Gap with Better Benchmarks and Data

Alex Ratner, co-founder and CEO of Snorkel AI, spoke at @Scale: Systems & Reliability about one of the most underappreciated problems in AI deployment: our ability to measure agents has been outpaced — arguably for the first time in the history of the field — by our ability to build them. The talk digs into what it actually takes to…

Jun 22, 2026 •

Snorkel Team

Learn more about Agentic AI Evaluation: Closing the Gap with Better Benchmarks and Data

Blog

Benchtalks #3: We taught AI everything except how to learn

For our third Benchtalks, the series dedicated to the researchers building the measurement toolkits that frontier labs hill-climb on, Snorkel AI co-founder Vincent Sunn Chen sat down with Parth Asawa, a PhD student at UC Berkeley advised by Matei Zaharia and Joey Gonzalez. Parth leads research on continual learning and is the creator of Continual Learning Bench, developed in collaboration…

Jun 20, 2026 •

Vincent Sunn Chen

Learn more about Benchtalks #3: We taught AI everything except how to learn

Blog

Continual learning and evaluating how AI agents learn across sequences of tasks

Most agent benchmarks evaluate each task as an independent episode. The agent receives a task, produces an answer, gets scored, and moves on. The next task starts as if the previous one never happened. That setup misses a core requirement for deployed agents. A coding agent, research assistant, data analyst, or workplace assistant should improve as it works across repeated…

Jun 18, 2026 •

Chris Glaze

Learn more about Continual learning and evaluating how AI agents learn across sequences of tasks

Blog

Cua-Bench: benchmarking computer-use agents on professional software

TL;DR We built a benchmark of 25 expert-authored KiCad schematic-editing tasks and ran a frontier computer-use agent against them. The headline numbers: 1. Why build a computer-use benchmark for electrical engineering? Most computer-use benchmarks today live in the same handful of apps: web browsers, file managers, generic productivity suites. Those evaluations are useful, but they share a structural weakness —…

Jun 11, 2026 •

Zhengyang (Jason) Qi , Armin Parchami

Learn more about Cua-Bench: benchmarking computer-use agents on professional software

Blog

The Art and Science of Building Benchmarks That Shape the Field

Vincent Sunn Chen spoke at AI Engineer London about what it actually takes to build benchmarks that move the field forward, not just measure it. The throughline is an asymmetry that keeps showing up across deployments and the 150+ proposals reviewed for the Open Benchmarks Grants: agent capabilities are climbing fast, but the ability to measure those agents in realistic,…

Jun 08, 2026 •

Snorkel Team

Learn more about The Art and Science of Building Benchmarks That Shape the Field

1 2 … 64

Join our newsletter

For expert advice, the latest research, and exclusive events.

By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Resource library

Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark

Closing the Evaluation Gap in Agentic AI

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

Building FinQA: An Open RL Environment for Financial Reasoning Agents

The science of rubric design

Join our newsletter

How do you want to work with Snorkel?