Archived

Terminal-Bench 2.0

A benchmark for terminal agents featuring 89 hard, human-verified tasks in containerized environments, evaluated via task resolution rate with contributions built by a community of nearly 100 developers.

Built with

Overview

Terminal-Bench is a joint project between Stanford University and Laude Institute. The original benchmark passed 1,000 GitHub stars and drew contributions from nearly 100 developers worldwide before the 2.0 release raised the bar with 89 carefully curated tasks designed to keep frontier-model performance under the 50% ceiling.

Each task runs in a unique Docker container with a human-written oracle solution and tests that verify the final container state. The 2.0 release dropped easier tasks (like the original "Hello World" debugger), eliminated unreproducible items (like the YouTube-download task affected by changing anti-bot protections), and tightened specifications so that near-100% performance is attainable for sufficiently capable agents.

Leaderboard

Rank	Agent	Model	Date	Agent Org	Model Org	Accuracy

Inside the leaderboard

All 144 published submissions on Terminal-Bench 2.0 at a glance — how scores distribute, where the median sits, and which providers cluster at the top.

142

Published Entries

44%

Median Score

28.2%

Below 30%

15.5%

70% or Above

Score distribution · count of submissions per 10-point bucket

0-10

10-20

20-30

30-40

40-50

50-60

60-70

70-80

80-90

Task resolution rate

Most submissions cluster in the 50–70% band; only 3 cross 80%.

Provider concentration in top 30

Tagged by backbone model, not by agent harness

OpenAI

Anthropic

Multiple

Google

OpenAI

13 / 30

43.3%

Anthropic

9 / 30

30%

Multiple

4 / 30

13.3%

Google

4 / 30

13.3%

How tight is the top 10?

Each entry on Terminal-Bench is reported with a 95% confidence interval. Visualized on a single axis, the top 10 windows overlap heavily — the rank order is real, but the gaps are smaller than they look.

Rank	Agent	Model	Accuracy
1	NexAU-AHE	GPT-5.5	84.7% ±2.1
2	LemonHarness	Multiple	84.5% ±2.6
3	Capy	GPT-5.5	83.1% ±2.1
4	Codex CLI	GPT-5.5	82.2% ±2.2
5	Polaris	Multiple	82.2% ±2.8
6	WOZCODE	Claude Opus 4.7	80.2% ±2.1
7	TongAgents	Gemini 3.1 Pro	80.2% ±2.6
8	LemonHarness	Multiple	79.9% ±3
9	SageAgent	GPT-5.3-Codex	78.4% ±2.2
10	Droid	GPT-5.3-Codex	77.3% ±2.2

The top 10 cluster within ~7 percentage points (77.3% to 84.7%), and the 95% confidence interval bands overlap heavily \u2014 rank order is real, but the gaps are smaller than they look.

Problem catalog · All 89 problems

All 89 tasks span 16 categories and three difficulty tiers. Each task runs in its own Docker container with a human-written oracle solution.

software-engineering

system-administration

scientific-computing

security

data-science

debugging

file-operations

model-training

mathematics

data-processing

machine-learning

games

personal-assistant

optimization

data-querying

video-processing

easy

medium

hard

adaptive-rejection-sampler

break-filter-js-from-html

Show all 89 problems

Methodology

METRIC

Task resolution rate, reported per submission with a 95% confidence interval. Tests verify the final container state only, not agent commands or intermediate steps.

environment

Fully containerized Docker environment. Each task includes a unique image with relevant packages and files pre-installed, plus a time limit.

verification

Every task in 2.0 was reviewed for reproducibility, specification quality, and solvability. Tasks that were unreproducible (like the original YouTube-download) or arbitrarily threshold-gated were removed.

agents

Submissions pair a backbone model with an agent scaffold (Codex CLI, ForgeCode, Terminus 2, Mini-SWE-Agent, Claude Code, and others). Each Agent + Model combination is its own leaderboard row.

From the blog

Data development