Open Benchmarks Grants
Terminal-Bench 2.1
A revision of Terminal-Bench 2.0 that fixes 28 of 89 tasks and introduces continuous validation for agentic benchmarks.
Overview
Terminal-Bench 2.1 fixes issues in 28 of the 89 tasks from Terminal-Bench 2.0. The task issues fell into three categories: external dependencies that changed after the benchmark was built, resource budgets that were too tight for valid solutions to finish, and tasks where the instructions did not match the tests.
After these changes, no task is unsolved in Terminal-Bench 2.1. The release also introduces continuous validation for agentic benchmarks.
Leaderboard
| Rank | Agent | Model | Date | Agent Org | Model Org | Accuracy |
|---|---|---|---|---|---|---|
| 1 | Codex CLI | GPT-5.5 | 2026-05-01 | OpenAI | OpenAI |
83.4%
±2.2
|
| 2 | Claude Code | Claude Opus 4.8 | 2026-05-29 | Anthropic | Anthropic |
78.9%
±2.5
|
| 3 | Terminus 2 | GPT-5.5 | 2026-05-01 | Terminal-Bench | OpenAI |
78.2%
±2.4
|
| 4 | Terminus 2 | Claude Opus 4.8 | 2026-05-29 | Terminal-Bench | OpenAI |
74.6%
±2.4
|
| 5 | Terminus 2 | Gemini 3 Pro | 2026-05-01 | Terminal-Bench |
74.4%
±2.6
|
|
| 6 | Gemini CLI | Gemini 3.1 Pro | 2026-05-05 |
70.7%
±2.9
|
||
| 7 | Terminus 2 | Gemini 3.1 Pro | 2026-05-05 | Terminal-Bench |
70.3%
±2.9
|
|
| 8 | Claude Code | Claude Opus 4.7 | 2026-05-01 | Anthropic | Anthropic |
69.7%
±2.7
|
| 9 | Gemini CLI | Gemini 3 Pro | 2026-05-02 |
66.3%
±2.7
|
||
| 10 | Terminus 2 | Claude Opus 4.7 | 2026-05-01 | Terminal-Bench | Anthropic |
66.1%
±2.7
|
| 11 | Claude Code | GLM 5.1 | 2026-05-02 | Anthropic | Z-AI |
58.7%
±2.4
|
TB 2.0 vs 2.1 across representative pairs
Average accuracy across 14 representative agent–model pairs. Most pairs improved on 2.1. The largest gain came from Claude Code with Opus 4.6, which improved by 12.1 percentage points.
| Model | Agent | TB 2.0 | TB 2.1 | Difference |
|---|---|---|---|---|
| GPT-5.3-Codex | Codex CLI |
73.3%
|
79.1%
|
5.8% |
| GPT-5.4 | Codex CLI |
76%
|
77.3%
|
1.3% |
| Gemini 3.1 Pro | Terminus 2 |
63%
|
70.7%
|
7.7% |
| Opus 4.6 | Claude Code |
58%
|
70.1%
|
12.1% |
| GPT-5.3-Codex | Terminus 2 |
64.7%
|
68.5%
|
3.8% |
| Gemini 3.1 Pro | Gemini CLI |
61.3%
|
67.1%
|
5.8% |
| GPT-5.4 mini | Codex CLI |
57.8%
|
66.1%
|
8.3% |
| Opus 4.6 | Terminus 2 |
62.9%
|
63.8%
|
0.9% |
| Sonnet 4.6 | Claude Code |
51.9%
|
58.5%
|
6.6% |
| Gemini 3 Flash | Gemini CLI |
47.4%
|
56.9%
|
9.5% |
| GPT-5.4 | Terminus 2 |
55.1%
|
54.8%
|
-0.3% |
| Gemini 3 Flash | Terminus 2 |
51.7%
|
54.2%
|
2.5% |
| Sonnet 4.6 | Terminus 2 |
48%
|
51.5%
|
3.5% |
| GPT-5.4 mini | Terminus 2 |
37.8%
|
36.9%
|
-0.9% |
What changed
The 28 modified tasks fell into three categories of issues identified through community feedback and continuous validation.
01
External dependencies
9 tasks. TB 2.0 pinned Docker images for reproducibility, but tasks with internet access introduced external dependencies that changed over time.
02
Resource mismatches
8 tasks. Insufficient resource budgets (CPU, memory, time) meant valid approaches — including oracle solutions — could not finish consistently.
03
Misspecification
Tasks where instructions did not match the tests. Example: query-optimize tests expected Spark SQL while the instructions asked for PostgreSQL. Rewritten to use PostgreSQL consistently.
Per-task pass rate changes
Changes in pass rate across the 28 modified tasks. Several previously unsolved tasks now have nonzero pass rates. Largest gains came from tasks whose failures were caused by environment drift, resource mismatches, or misspecification.
| Task |
-100%
-50%
0%
50%
100%
|
Difference |
|---|---|---|
| polyglot-c-py |
|
+84.3% |
| polyglot-rust-c |
|
+74.3% |
| caffe-cifar-10 |
|
+64.3% |
| torch-tensor-parallelism |
|
+62.8% |
| adaptive-rejection-sampler |
|
+35.7% |
| mteb-retrieve |
|
+31.4% |
| build-pmars |
|
+24.3% |
| install-windows-3.11 |
|
+20.0% |
| compile-compcert |
|
+17.1% |
| mteb-leaderboard |
|
+12.9% |
| rstan-to-pystan |
|
+11.4% |
| extract-moves-from-video |
|
+10.0% |
| crack-7z-hash |
|
+5.8% |
| filter-js-from-html |
|
+4.3% |
| configure-git-webserver |
|
+4.2% |
| sam-cell-seg |
|
+1.4% |
| gpt2-codegolf |
|
±0.0% |
| financial-document-processor |
|
-1.5% |
| make-doom-for-mips |
|
-2.8% |
| hf-model-inference |
|
-2.9% |
| build-pov-ray |
|
-4.3% |
| torch-pipeline-parallelism |
|
-4.3% |
| train-fasttext |
|
-5.7% |
| protein-assembly |
|
-7.1% |
| query-optimize |
|
-7.1% |
| fix-git |
|
-7.2% |
| mcmc-sampling-stan |
|
-10.0% |
| overfull-hbox |
|
-18.6% |

