Back to Benchmarks
Released May 2, 2026
Open Benchmarks Grants

Terminal-Bench 2.1

A revision of Terminal-Bench 2.0 that fixes 28 of 89 tasks and introduces continuous validation for agentic benchmarks.
Overview

Terminal-Bench 2.1 fixes issues in 28 of the 89 tasks from Terminal-Bench 2.0. The task issues fell into three categories: external dependencies that changed after the benchmark was built, resource budgets that were too tight for valid solutions to finish, and tasks where the instructions did not match the tests.

After these changes, no task is unsolved in Terminal-Bench 2.1. The release also introduces continuous validation for agentic benchmarks. 

Leaderboard

Rank Agent Model Date Agent Org Model Org Accuracy
1 Codex CLI GPT-5.5 2026-05-01 OpenAI OpenAI
83.4% ±2.2
2 Claude Code Claude Opus 4.8 2026-05-29 Anthropic Anthropic
78.9% ±2.5
3 Terminus 2 GPT-5.5 2026-05-01 Terminal-Bench OpenAI
78.2% ±2.4
4 Terminus 2 Claude Opus 4.8 2026-05-29 Terminal-Bench OpenAI
74.6% ±2.4
5 Terminus 2 Gemini 3 Pro 2026-05-01 Terminal-Bench Google
74.4% ±2.6
6 Gemini CLI Gemini 3.1 Pro 2026-05-05 Google Google
70.7% ±2.9
7 Terminus 2 Gemini 3.1 Pro 2026-05-05 Terminal-Bench Google
70.3% ±2.9
8 Claude Code Claude Opus 4.7 2026-05-01 Anthropic Anthropic
69.7% ±2.7
9 Gemini CLI Gemini 3 Pro 2026-05-02 Google Google
66.3% ±2.7
10 Terminus 2 Claude Opus 4.7 2026-05-01 Terminal-Bench Anthropic
66.1% ±2.7
11 Claude Code GLM 5.1 2026-05-02 Anthropic Z-AI
58.7% ±2.4

TB 2.0 vs 2.1 across representative pairs

Average accuracy across 14 representative agent–model pairs. Most pairs improved on 2.1. The largest gain came from Claude Code with Opus 4.6, which improved by 12.1 percentage points.
Model Agent TB 2.0 TB 2.1 Difference
GPT-5.3-Codex Codex CLI
73.3%
79.1%
5.8%
GPT-5.4 Codex CLI
76%
77.3%
1.3%
Gemini 3.1 Pro Terminus 2
63%
70.7%
7.7%
Opus 4.6 Claude Code
58%
70.1%
12.1%
GPT-5.3-Codex Terminus 2
64.7%
68.5%
3.8%
Gemini 3.1 Pro Gemini CLI
61.3%
67.1%
5.8%
GPT-5.4 mini Codex CLI
57.8%
66.1%
8.3%
Opus 4.6 Terminus 2
62.9%
63.8%
0.9%
Sonnet 4.6 Claude Code
51.9%
58.5%
6.6%
Gemini 3 Flash Gemini CLI
47.4%
56.9%
9.5%
GPT-5.4 Terminus 2
55.1%
54.8%
-0.3%
Gemini 3 Flash Terminus 2
51.7%
54.2%
2.5%
Sonnet 4.6 Terminus 2
48%
51.5%
3.5%
GPT-5.4 mini Terminus 2
37.8%
36.9%
-0.9%

What changed

The 28 modified tasks fell into three categories of issues identified through community feedback and continuous validation.
01
External dependencies
9 tasks. TB 2.0 pinned Docker images for reproducibility, but tasks with internet access introduced external dependencies that changed over time.
02
Resource mismatches
8 tasks. Insufficient resource budgets (CPU, memory, time) meant valid approaches — including oracle solutions — could not finish consistently.
03
Misspecification
Tasks where instructions did not match the tests. Example: query-optimize tests expected Spark SQL while the instructions asked for PostgreSQL. Rewritten to use PostgreSQL consistently.

Per-task pass rate changes

Changes in pass rate across the 28 modified tasks. Several previously unsolved tasks now have nonzero pass rates. Largest gains came from tasks whose failures were caused by environment drift, resource mismatches, or misspecification.

Acknowledgments

Led by Stanford and Laude Institute, with TB 2.1 lead Kelly Buchanan.

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.