Terminal-Bench 2.0
Terminal-Bench is a joint project between Stanford University and Laude Institute. The original benchmark passed 1,000 GitHub stars and drew contributions from nearly 100 developers worldwide before the 2.0 release raised the bar with 89 carefully curated tasks designed to keep frontier-model performance under
the 50% ceiling.
Each task runs in a unique Docker container with a human-written oracle solution and tests that verify the final container state. The 2.0 release dropped easier tasks (like the original "Hello World" debugger), eliminated unreproducible items (like the YouTube-download task affected by changing anti-bot protections), and tightened specifications so that near-100% performance is attainable for sufficiently capable agents.
Leaderboard
| Rank | Agent | Model | Date | Agent Org | Model Org | Accuracy |
|---|
Inside the leaderboard
How tight is the top 10?
Each entry on Terminal-Bench is reported with a 95% confidence interval. Visualized on a single axis, the top 10 windows overlap heavily — the rank order is real, but the gaps are smaller than they look.
| Rank | Agent | Model |
70%
75%
80%
85%
90%
|
Accuracy |
|---|---|---|---|---|
| 1 | NexAU-AHE | GPT-5.5 |
84.7%
±2.1
|
|
| 2 | LemonHarness | Multiple |
84.5%
±2.6
|
|
| 3 | Capy | GPT-5.5 |
83.1%
±2.1
|
|
| 4 | Codex CLI | GPT-5.5 |
82.2%
±2.2
|
|
| 5 | Polaris | Multiple |
82.2%
±2.8
|
|
| 6 | WOZCODE | Claude Opus 4.7 |
80.2%
±2.1
|
|
| 7 | TongAgents | Gemini 3.1 Pro |
80.2%
±2.6
|
|
| 8 | LemonHarness | Multiple |
79.9%
±3
|
|
| 9 | SageAgent | GPT-5.3-Codex |
78.4%
±2.2
|
|
| 10 | Droid | GPT-5.3-Codex |
77.3%
±2.2
|
Problem catalog · All 89 problems
Methodology
From the blog
Acknowledgments
Get notified when we launch a new benchmark
Please enable scripts and refresh the page to continue.



