Archived

Terminal-Bench 2.1

A revision of Terminal-Bench 2.0 that fixes 28 of 89 tasks and introduces continuous validation for agentic benchmarks.

Terminal-Bench 2.0 archive

Built with

Resources

Release Notes

GitHub

Website

Overview

Terminal-Bench 2.1 fixes issues in 28 of the 89 tasks from Terminal-Bench 2.0. The task issues fell into three categories: external dependencies that changed after the benchmark was built, resource budgets that were too tight for valid solutions to finish, and tasks where the instructions did not match the tests.

After these changes, no task is unsolved in Terminal-Bench 2.1. The release also introduces continuous validation for agentic benchmarks. A per-task breakdown and discussion is in PR #53.

Leaderboard

Rank	Agent	Model	Date	Agent Org	Model Org	Accuracy
1	Claude Code	Claude 5 Fable	2026-06-07	Anthropic	Anthropic	83.8% ±1.2
2	Codex CLI	GPT-5.5	2026-05-01	OpenAI	OpenAI	83.1% ±1.1
3	Terminus 2	Claude 5 Fable	2026-06-05	Terminal-Bench	Anthropic	80.4% ±1.2
4	Cursor CLI	Grok 4.5	2026-07-09	Cursor	xAI	79.3% ±1.5
5	Claude Code	Claude Opus 4.8	2026-07-09	Anthropic	Anthropic	78.9% ±1.3
6	Codex CLI	GPT-5.6 Terra	2026-07-11	OpenAI	OpenAI	78.4% ±1.3
7	Terminus 2	GPT-5.5	2026-05-01	Terminal-Bench	OpenAI	78% ±1.2
8	mini-SWE-agent	Muse Spark 1.1	2026-07-09	Princeton	Meta	76.2% ±1.2
9	Codex CLI	GPT-5.6 Luna	2026-07-11	OpenAI	OpenAI	75.7% ±1.3
10	Claude Code	Claude Sonnet 5	2026-07-09	Anthropic	Anthropic	74.6% ±1.6
11	Terminus 2	Gemini 3 Pro	2026-05-01	Terminal-Bench	Google	73.9% ±1.3
12	Claude Code	Claude Opus 4.7	2026-05-01	Anthropic	Anthropic	68.9% ±1.4
13	Terminus 2	Claude Opus 4.7	2026-05-01	Terminal-Bench	Anthropic	66.1% ±1.4
14	Gemini CLI	Gemini 3 Pro	2026-05-01	Google	Google	65.8% ±1.4
15	Gemini CLI	Gemini 3.1 Pro	2026-05-05	Google	Google	65.8% ±1.7
16	Terminus 2	Gemini 3.1 Pro	2026-05-05	Terminal-Bench	Google	65.6% ±1.7
17	Claude Code	GLM 5.1	2026-05-01	Anthropic	Z-AI	58.7% ±1.2

TB 2.0 vs 2.1 across representative pairs

Average accuracy across 14 representative agent–model pairs. Most pairs improved on 2.1. The largest gain came from Claude Code with Opus 4.6, which improved by 12.1 percentage points.

Model	Agent	TB 2.0	TB 2.1	Difference
GPT-5.3-Codex	Codex CLI	73.3%	79.1%	5.8%
GPT-5.4	Codex CLI	76%	77.3%	1.3%
Gemini 3.1 Pro	Terminus 2	63%	70.7%	7.7%
Opus 4.6	Claude Code	58%	70.1%	12.1%
GPT-5.3-Codex	Terminus 2	64.7%	68.5%	3.8%
Gemini 3.1 Pro	Gemini CLI	61.3%	67.1%	5.8%
GPT-5.4 mini	Codex CLI	57.8%	66.1%	8.3%
Opus 4.6	Terminus 2	62.9%	63.8%	0.9%
Sonnet 4.6	Claude Code	51.9%	58.5%	6.6%
Gemini 3 Flash	Gemini CLI	47.4%	56.9%	9.5%
GPT-5.4	Terminus 2	55.1%	54.8%	-0.3%
Gemini 3 Flash	Terminus 2	51.7%	54.2%	2.5%
Sonnet 4.6	Terminus 2	48%	51.5%	3.5%
GPT-5.4 mini	Terminus 2	37.8%	36.9%	-0.9%

What changed

The 28 modified tasks fell into three categories of issues identified through community feedback and continuous validation.

External dependencies

9 tasks. TB 2.0 pinned Docker images for reproducibility, but tasks with internet access introduced external dependencies that changed over time.

Resource mismatches

8 tasks. Insufficient resource budgets (CPU, memory, time) meant valid approaches — including oracle solutions — could not finish consistently.

Misspecification

Tasks where instructions did not match the tests. Example: query-optimize tests expected Spark SQL while the instructions asked for PostgreSQL. Rewritten to use PostgreSQL consistently.

Per-task pass rate changes

Changes in pass rate across the 28 modified tasks. Several previously unsolved tasks now have nonzero pass rates. Largest gains came from tasks whose failures were caused by environment drift, resource mismatches, or misspecification.

Task	-100% -50% 0% 50% 100%	Difference
polyglot-c-py		+84.3%
polyglot-rust-c		+74.3%
caffe-cifar-10		+64.3%
torch-tensor-parallelism		+62.8%
adaptive-rejection-sampler		+35.7%
mteb-retrieve		+31.4%
build-pmars		+24.3%
install-windows-3.11		+20.0%
compile-compcert		+17.1%
mteb-leaderboard		+12.9%
rstan-to-pystan		+11.4%
extract-moves-from-video		+10.0%
crack-7z-hash		+5.8%
filter-js-from-html		+4.3%
configure-git-webserver		+4.2%
sam-cell-seg		+1.4%
gpt2-codegolf		±0.0%
financial-document-processor		-1.5%
make-doom-for-mips		-2.8%
hf-model-inference		-2.9%
build-pov-ray		-4.3%
torch-pipeline-parallelism		-4.3%
train-fasttext		-5.7%
protein-assembly		-7.1%
query-optimize		-7.1%
fix-git		-7.2%
mcmc-sampling-stan		-10.0%
overfull-hbox		-18.6%