We build the data that pushes the frontier
Snorkel helps AI labs develop specialized training data and environments that set their models and agents apart.
Proud to partner with top frontier AI and research teams
Frontier models break at the edges. We build for that.
Most data pipelines are built for volume, not difficulty. Frontier models fail on distributional gaps in specialized domains, benchmark blind spots, and tasks where correctness is hard to define. Snorkel is built specifically for these problems.
Founded out of Stanford AI Lab, we've been shaping and benchmarking frontier AI for nearly a decade.
What we've found
New
RL Research
RLVR in Low Data & Compute Regimes
Better data beats more compute — measured across low-resource settings.
Evaluation Research
RIFT: Rubric Failure Mode Taxonomy
A diagnostic framework for when AI evaluation rubrics break down.
Domain Agents
Benchmarking Agents in Insurance Underwriting
Environment-first benchmarking for agents in a genuinely high-stakes domain.
Collaborations
Agentic Coding
Terminal-Bench 2.0
Real terminal tasks — exposing where today's coding agents fail.
Code Quality
SlopCodeBench
Generic code evals miss sloppy code. This measures what they ignore.
Research Collaboration
rLLM
A 4B model outperforming a 235B on financial tasks — via domain-specific expert data.
Legal AI
Harvey BigLaw Bench
Expert data for the hardest agentic legal research benchmark. Built with Harvey AI.
The Frontier AI Data Lab
Data development for the frontier
Snorkel partners with frontier AI teams to build the data, evaluation systems, and environments to improve models where generic coverage runs out.
Snorkel Data Series
Ready-to-use curriculum-structured datasets for the task areas frontier models are pushing hardest, with rubrics, reviewer guidance, difficulty tiers, and eval slices built in.
Custom data development
When off-the-shelf coverage runs out, we build bespoke datasets, evals, and benchmark expansions for the exact failure surface you need to close.
Specialized agents
Card content
Data
Expert Demonstrations & Reasoning
Human solution traces
Reasoning traces
SME Q&A rationales
Workflow demos and decision workflows
Tool-use demos
Preference Labels & Rankings
Patch/draft/report quality ranking
Trajectory QA
Risk/safety/style calibration
Helpful/harmless ranking
Grounding & style
Rubrics & Verifiable Outcomes
Unit tests / compile
Deterministic graders
Citation correctness
Numerical consistency/scorable math/science
Long-horizon tasks
Environments
Standard & Custom Environments
Repo + CLI tools
Browser/GUI harness
Multi-step/stateful workflows
Simulated environments
Your tools, codebase, corpus, data & permissions
DATA DEVELOPMENT
Good data is a set of design choices
Most data quality problems are design problems. Ambiguous task definitions produce inconsistent labels. Uncalibrated reviewers introduce systematic bias. Missing provenance makes failure analysis guesswork. Snorkel's proprietary process is built around the decisions that determine whether training data actually drives model improvement:
Custom AGENTS
Specialized agents grounded in expert data
The same data development system we use to improve frontier models powers our specialized agents. That means agents evaluated against task-specific rubrics and programmatic checks – not generic benchmarks – and refined through the same adjudication and provenance practices used in production model development.
Built for specialized workflows and high-consequence decisions, not generic copilots
Evaluation on environment-grounded tasks with programmatic pass/fail criteria
Same rigor used to train frontier-class models, applied to your enterprise deployment
PUBLISHED RESEARCH
Research that shapes the work
Every dataset, benchmark, and environment we create is the output of active research co-developed and peer-reviewed with leading academic teams and frontier labs.





