Benchmarks for what frontier AI hasn't solved
Senior SWE-Bench
A benchmark for evaluating coding agents on senior-level engineering work: building features from realistic instructions, investigating bugs that require runtime investigation, and shipping code that aligns to existing codebase conventions.


OSWorld 2.0
Long-horizon professional workflows with verifiable outcomes across 55 sub-industries. 147 public tasks of a 1,500+ task corpus, sourced and validated by 300+ industry experts.
Agents’ Last Exam
Long-horizon professional workflows with verifiable outcomes across 55 sub-industries. 147 public tasks of a 1,500+ task corpus, sourced and validated by 300+ industry experts.
Continual Learning Bench
Evaluates whether AI systems improve from prior experience across sequential, stateful tasks, measuring real in-context learning, not just raw capability.
SlopCode Bench
Measures code quality degradation in AI-assisted codebases. Tracks checkpoint solve rates, erosion (code bloat), and verbosity under realistic repo conditions.
Terminal-Bench 2.1
Terminal agent evaluation led by Stanford University and Laude Institute. v2.1 fixes 28 tasks from 2.0 and introduces continuous validation.
Agentic Coding
A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.
Open Benchmarks Grants
Backed by a $3M commitment, our Open Benchmarks Grants program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI is built and evaluated.
featured Collaborations
Computer tasks
Natural sciences
Legal AI
Harvey BigLaw Bench: Research
Expert data powering the hardest benchmark for agentic legal research
Legal AI
JudgmentBench
Compares rubric-based and preference-based evaluation for judging output quality.
Get notified when we launch a new benchmark
Please enable scripts and refresh the page to continue.
Three core dimensions where today's benchmarks fall short



