Building better enterprise AI: incorporating expert feedback in system development
Enterprises that aim to build valuable GenAI applications must view them from a systems-level. LLMs are just one part of an ecosystem.
January 30, 2024
•
Chris Glaze
AI data development: a guide for data science projects
What is AI data development? AI data development includes any action taken to convert raw information into a format useful to AI.
November 13, 2024
•
Matt Casey
LLM evaluation in enterprise applications: a new era in ML
Learn about the obstacles faced by data scientists in LLM evaluation and discover effective strategies for overcoming them.
November 25, 2024
•
Matt Casey
All articles on Data development
SlopCodeBench: Measuring Code Erosion as Agents Iterate
SlopCodeBench reveals how AI coding agents degrade code quality over time—measuring “slop,” technical debt, and architectural erosion across iterations.
January 20, 2026
•
Kobie Crawford
Introducing the Snorkel Agentic Coding Benchmark
Today, we’re sharing details about the Snorkel Agentic Coding benchmark—a comprehensive evaluation suite designed to test whether agents can handle the full complexity of software engineering work.
January 8, 2026
•
Kobie Crawford
2026: The year of environments
Our NeurIPS 2025 retrospective The Snorkel AI team We just returned from NeurIPS 2025, and we’re still processing everything we saw. The energy around data-centric AI has never been stronger—and we couldn’t be more grateful to the research community for pushing these ideas forward. The evolution we’ve witnessed When we first brought Snorkel AI research to NeurIPS back in 2019,
December 10, 2025
•
Snorkel Team
Part V: Future direction and emerging trends
Explores how rubrics support agentic, multi-turn, tool-using, multimodal, and code-generating AI systems, and how they evolve with AI feedback and ensemble evaluation.
December 5, 2025
•
Justin Bauer
The Self-Critique Paradox: Why AI Verification Fails Where It’s Needed Most
TL;DR: We stress-tested the “generate → criticize → improve” loop on 50 visual reasoning tasks. The results were counterintuitive: self-critique acts as a corrosive agent on high-performance tasks, turning 98% accuracy into 57%. Yet, for tasks where models fail completely, it works like magic. This difficulty-dependent behavior poses a critical, hidden risk for RLFT pipelines. The Promise vs. The Reality
November 26, 2025
•
Armin Parchami
Terminal-Bench 2.0: Raising the bar for AI agent evaluation
Terminal-Bench 2.0 launches today, marking a major leap in AI agent evaluation. Snorkel AI contributed key research and task design to this release.
November 7, 2025
•
Kobie Crawford
Snorkeling in RL environments
We unpack what makes a high-quality RL environment for LLMs and show how we build realistic, enterprise-grade environments at Snorkel AI.
November 4, 2025
•
Armin Parchami
Introducing SnorkelSpatial
A procedurally generated and programmatically verified benchmark for evaluating spatial reasoning capabilities in LLMs Large language models (LLMs) are showing remarkable results on solving complex reasoning problems across domains—from mathematical proofs and logical puzzles to graduate-level science and engineering questions. On the other hand, their spatial reasoning capabilities are less understood, even though such reasoning underlies many everyday tasks. We
October 24, 2025
•
Harit Vishwakarma
Scaling Trust: Rubrics in Snorkel’s Quality Process
Snorkel’s “Trusted Scale” philosophy Welcome to Part 4 of Snorkel AI’s rubric series. In previous posts, we explored how rubrics enable structured evaluation (Part 1), the spectrum of rubric types and use cases (Part 2), and the science behind designing and validating them (Part 3). In this latest installment, we pull back the curtain on how Snorkel puts these principles
October 16, 2025
•
Derek Pham
Evaluating Multi-Agent Systems in Enterprise Tool Use
In recent months, there has been increasing interest in the area of multi-agent systems and how they can be used to solve more complex tasks than a single agent could accomplish on its own. The topic is particularly interesting and raises several questions and ideas to consider: Anthropic’s blog post about how they architected a multi-agent deep research system is
October 9, 2025
•
Bhavishya Pohani
Evaluating Coding Agent Capabilities with Terminal-Bench: Snorkel’s Role in Building the Next Generation Benchmark
Terminal-Bench, developed through a collaboration between Stanford University and Laude Institute, has quickly become the gold standard benchmark for evaluating AI agent capabilities in a command line environment. This comprehensive evaluation framework measures how effectively AI agents can perform complex, real-world tasks within terminal environments. At Snorkel AI, we’re excited to share that we’re one of the top collaborators contributing
Behind every AI benchmark is a hidden choice: how to read the model’s answers. That choice—parsing—can quietly tilt results more than the model itself. Parsing is where we take an AI system’s raw response and extract the “answer” we use for scoring. It sounds mechanical, but as our research shows, the choice of parser can dramatically change measured accuracy. In
September 26, 2025
•
Justin Bauer
The science of rubric design
Part 3 of our rubric series explains the science of rubric design. We show why rubrics should be treated like models—structured, measured, and iterated—to maximize objective alignment and inter-rater agreement. Learn how to choose hierarchy and scale points, track agreement (IAA) and LLMAJ alignment, and refine with domain experts, with examples like PaperBench and HealthBench.
September 11, 2025
•
Chris Glaze
,
Charles Dickens
The right tool for the job: An A-Z of rubrics
Rubrics turn fuzzy “good vs. bad” into measurable criteria for GenAI. In Part 2, we map what to measure (granularity and dataset-level vs instance-specific), where to measure (process vs outcome), and how to measure (humans, LLM-as-judge, code, reward models)—with examples like HHH, FLASK, HealthBench, and PaperBench.
September 2, 2025
•
Tom Walshe
,
Armin Parchami
Data quality and rubrics: how to build trust in your models
Rubrics aren’t just for evaluation—they’re a blueprint for better data annotation. In this post, we explore how structured rubrics enable scalable, high-quality labeling and evaluation of GenAI systems. Learn how Snorkel and leading labs use rubrics to align human and automated judgment and accelerate trusted AI development.