Benchmarks should shape the frontier, not just measure it
Since launching the Open Benchmarks Grants, we’ve received more than 100 applications from academic groups and industry labs spanning a wide range of domains and capabilities. Here, we share the new bar for useful benchmarks: what is now table stakes, and what separates the benchmarks that shape the frontier, not just measure it.
Useful benchmarks are, first and foremost, effective measuring sticks
- Rigorously-validated tasks: The individual tasks are high quality (e.g. real-world complexity, well-structured instructions, verifiable solutions), as validated by real domain experts. GPQA introduced new adversarial quality control mechanisms1 to ensure that tasks were not only well-posed, but also tractable for other experts to solve.
- Fine-grained distributional diversity: The benchmark defines a clear taxonomy for its domain and distributes tasks across it deliberately, so results can be sliced into actionable signals. MMLU constructed an ambitious taxonomy of 57 academic subjects (across STEM, humanities, and professional domains).2
- Robust eval methodology: Metrics go beyond raw accuracy, capturing cost, latency, reasoning quality, or whatever dimensions actually matter for real-world use of the capability. The benchmark measures what it claims to3, and the methodology is reproducible and transparent. TAU-bench measures both task completion and adherence to policy constraints, i.e. a model that books the wrong flight confidently still fails.4
- Model headroom: The benchmark is unsaturated. It exposes real soft spots in model capabilities and creates meaningful signals for where model developers should focus next. Released just this week, ARC-AGI has frontier LLMs models scoring below 1% over tasks that are solvable by humans.5
Lasting benchmarks push the frontier
- A thesis on the frontier: The benchmark defines a new subspace of capabilities for the frontier or revisits a previous research question with new assumptions. The most ambitious benchmarks have a thesis on where the world is going: Terminal-Bench was a bet on the CLI – not only for coding agents, but for general-purpose computer use.6
- Concrete research sets roadmaps: The benchmark produces new research roadmaps. They inspire new attacks against important research problems, including follow-on benchmarks and new methods that advance the field. SWE-Bench spawned a whole family of benchmarks (e.g. Lite, Verified, Multilingual, Multimodal), and its evolution has shaped how teams build their coding agents.7
- Researcher UX: The benchmark builders are committed to the “researcher experience”. This means the benchmark is simple to run models/agents against, simple to contribute to/extend, and simple to adapt supervision/reward signals for RL/tuning. HELM pioneered a standardized and modular harness for reproducible evals8; TBench2.0 shipped with Harbor, which has become de facto tooling for teams building agents.9
Benchmarks are one of the highest leverage ways to not just measure AI, but define and advance the field. Individual researchers and small teams have enormous agency to direct how the community thinks about key research questions.
We’re excited to keep partnering with the best benchmark builders. Please apply and/or reach out (benchmarks.snorkel.ai)!
- https://arxiv.org/abs/2311.12022
- https://arxiv.org/pdf/2009.03300
- https://openreview.net/forum?id=mdA5lVvNcU
- https://arxiv.org/abs/2406.12045
- https://arcprize.org/arc-agi/3
- https://arxiv.org/abs/2601.11868
- https://arxiv.org/abs/2310.06770
- https://arxiv.org/abs/2211.09110
- https://harborframework.com/