resources

Resource library

Explore our complete library of resources including blogs, benchmarks, research papers and more.
Image for Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark
Blog

Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark

Announcing a $3M commitment to launch Open Benchmarks Grants
September 30, 2025
Image for Closing the Evaluation Gap in Agentic AI
Blog

Closing the Evaluation Gap in Agentic AI

Announcing a $3M commitment to launch Open Benchmarks Grants
February 11, 2026
Image for Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory
Blog

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

Announcing a $3M commitment to launch Open Benchmarks Grants
March 31, 2026
Image for Building FinQA: An Open RL Environment for Financial Reasoning Agents
Blog

Building FinQA: An Open RL Environment for Financial Reasoning Agents

Announcing a $3M commitment to launch Open Benchmarks Grants
March 30, 2026
Image for The science of rubric design
Blog

The science of rubric design

Announcing a $3M commitment to launch Open Benchmarks Grants
September 11, 2025
of
Type: All Types
Sort: Newest
DIU Enhances Decision-Making Resilience with Snorkel AI
Case study
DIU Enhances Decision-Making Resilience with Snorkel AI

Strategic dominance in the Indo-Pacific relies on the ability to track and coordinate friendly forces—”blue objects”—with absolute precision. To maintain operational awareness in dynamic and contested environments, the Department of War identified a requirement for adaptable, dual-use technologies that enhance logistics and decision-making resilience.

Jan 21, 2026
Snorkel Team
Learn more about DIU Enhances Decision-Making Resilience with Snorkel AI
SlopCodeBench: Measuring Code Erosion as Agents Iterate
Blog
SlopCodeBench: Measuring Code Erosion as Agents Iterate

SlopCodeBench reveals how AI coding agents degrade code quality over time—measuring “slop,” technical debt, and architectural erosion across iterations.

Jan 20, 2026
Learn more about SlopCodeBench: Measuring Code Erosion as Agents Iterate
Introducing the Snorkel Agentic Coding Benchmark
Blog
Introducing the Snorkel Agentic Coding Benchmark

Today, we’re sharing details about the Snorkel Agentic Coding benchmark—a comprehensive evaluation suite designed to test whether agents can handle the full complexity of software engineering work.

Jan 08, 2026
Learn more about Introducing the Snorkel Agentic Coding Benchmark
From Stalled Pilot to $43M Annual ROI: Top 5 Global Telco Achieves 95% Accuracy with Snorkel AI
Case study
From Stalled Pilot to $43M Annual ROI: Top 5 Global Telco Achieves 95% Accuracy with Snorkel AI

This Top 5 Global Telco aimed to evolve its internal billing co-pilot into a customer-facing chatbot capable of serving its global customer base. However, the project stalled at 54% accuracy due to data blind spots and reasoning errors that frustrated efforts to launch.

Dec 11, 2025
Snorkel Team
Learn more about From Stalled Pilot to $43M Annual ROI: Top 5 Global Telco Achieves 95% Accuracy with Snorkel AI
2026: The year of environments
Blog
2026: The year of environments

Our NeurIPS 2025 retrospective The Snorkel AI team We just returned from NeurIPS 2025, and we’re still processing everything we saw. The energy around data-centric AI has never been stronger—and we couldn’t be more grateful to the research community for pushing these ideas forward. The evolution we’ve witnessed When we first brought Snorkel AI research to NeurIPS back in 2019,…

Dec 10, 2025
Learn more about 2026: The year of environments
Part V: Future direction and emerging trends
Blog
Part V: Future direction and emerging trends

Explores how rubrics support agentic, multi-turn, tool-using, multimodal, and code-generating AI systems, and how they evolve with AI feedback and ensemble evaluation.

Dec 05, 2025
Learn more about Part V: Future direction and emerging trends
The Self-Critique Paradox: Why AI Verification Fails Where It’s Needed Most
Blog
The Self-Critique Paradox: Why AI Verification Fails Where It’s Needed Most

TL;DR: We stress-tested the “generate → criticize → improve” loop on 50 visual reasoning tasks. The results were counterintuitive: self-critique acts as a corrosive agent on high-performance tasks, turning 98% accuracy into 57%. Yet, for tasks where models fail completely, it works like magic. This difficulty-dependent behavior poses a critical, hidden risk for RLFT pipelines. The Promise vs. The Reality…

Nov 26, 2025
Learn more about The Self-Critique Paradox: Why AI Verification Fails Where It’s Needed Most
Chat with the Terminal-Bench team
Blog
Chat with the Terminal-Bench team

Snorkel Chief Scientist Fred Sala and Kobie Crawford chat with the Terminal-Bench team to unpack the design behind Terminal-Bench 2.0 and the new Harbor framework.

Nov 19, 2025
Learn more about Chat with the Terminal-Bench team
Intelligence Per Watt: A New Metric for AI’s Future
Blog
Intelligence Per Watt: A New Metric for AI’s Future

Snorkel AI contributes specialized datasets to Hazy Research’s “Intelligence-per-Watt” study, advancing how efficiently AI turns energy into intelligence.

Nov 12, 2025
Learn more about Intelligence Per Watt: A New Metric for AI’s Future
1 2 62 63
Image
Image

Join our newsletter

For expert advice, the latest research, and exclusive events.
By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.