OSWorld 2.0
A benchmark for evaluating computer-use agents on long-horizon, real-world workflows: 108 authentic tasks across 31 self-hosted web environments and professional desktop applications, with partial-credit scoring (avg 27.25 checkpoints per task).


Overview
Computer-use agents are increasingly deployed on multi-hour professional workflows, but most benchmarks evaluate them on short desktop tasks that finish in under 30 steps. OSWorld 2.0 reframes the problem around long-horizon work, sourced from realistic end-to-end workflows that a skilled human typically takes over an hour to complete.
Headline finding: At the 500-step budget, no system completes more than 20% of tasks end-to-end. Claude Opus 4.8 leads at 19.81% binary completion and 54.19% partial score; partial scores cluster in the 20–55% range across all evaluated systems, meaning frontier agents make meaningful progress but rarely finish.
At a glance
108
Long-horizon tasks
31
69.6%
take humans over 1 hour
>250
average agent steps
27.25
avg scoring checkpoints
The 108 tasks span seven professional domains and 21 sub-categories, covering research, creative production, engineering, personal services, business and finance, administration and compliance, and healthcare workflows. Tasks map to occupation families and SOC major groups, with a wage-bill-based GDP proxy estimating economic coverage. The largest shares come from document preparation, software and databases, and finance and operations analysis, with a long tail of additional professional activities.
Leaderboard
| Model | Effort | Approach | Binary | Partial | Est. Cost |
|---|---|---|---|---|---|
| claude-opus-4-8 | max | Batched tool |
20.6%
|
54.8%
|
n/a |
| claude-opus-4-8 | max | Standard |
18.52%
|
49.33%
|
n/a |
| claude-opus-4-7 | max | Batched tool |
18.2%
|
48.91%
|
n/a |
| claude-opus-4-7 | max | Standard |
13.9%
|
49.1%
|
$3.87K |
| gpt-5-5 | xhigh | Batch tool |
13%
|
49.5%
|
$2.75K |
| claude-sonnet-4-6 | medium | Standard |
9.3%
|
33.9%
|
$1.55K |
| claude-sonnet-4-6 | max | Standard |
8.3%
|
41.5%
|
$2.41K |
| minimax-m3 | enabled | Standard |
4.6%
|
22.3%
|
$258.78 |
| kimi-2-6 | enabled | Standard |
4.6%
|
22.1%
|
$708 |
| qwen-3-7-plus | thinking | Standard |
2.8%
|
21.5%
|
$411.56 |
| Model | Effort | Approach | Binary | Partial | Est. Cost |
|---|---|---|---|---|---|
| gpt-5-5 | xhigh | Batch tool |
13%
|
49.5%
|
$2.75K |
| claude-opus-4-7 | max | Standard |
13%
|
39.8%
|
$2.47K |
| claude-sonnet-4-6 | medium | Standard |
8.3%
|
29.4%
|
$990 |
| claude-sonnet-4-6 | max | Standard |
6.5%
|
35.8%
|
$1.72K |
| kimi-2-6 | enabled | Standard |
4.6%
|
14.4%
|
$604 |
| minimax-m3 | enabled | Standard |
3.7%
|
16.6%
|
$182.03 |
| qwen-3-7-plus | thinking | Standard |
1.9%
|
16.6%
|
$403.33 |
| Model | Effort | Approach | Binary | Partial | Est. Cost |
|---|---|---|---|---|---|
| gpt-5-5 | xhigh | Batch tool |
13%
|
46.7%
|
$1.88K |
| claude-opus-4-7 | max | Standard |
4.6%
|
20.3%
|
$1.03K |
| claude-sonnet-4-6 | max | Standard |
4.6%
|
20%
|
$800 |
| claude-sonnet-4-6 | medium | Standard |
4.6%
|
14.2%
|
$410 |
| minimax-m3 | enabled | Standard |
1.9%
|
8.2%
|
$86.82 |
| kimi-2-6 | enabled | Standard |
1.9%
|
7.1%
|
$336 |
Performance drops as tasks get longer
Completion collapses with horizon.
At the 500-step budget, the top evaluated agent (Claude Opus 4.8 with batched tool) reaches 19.81% binary completion and 54.19% partial score. Cut the budget to 300 steps and the leaderboard top stays at 13.0% binary; at 150 steps, binary is still 13.0% but partial drops to 46.7%. On the longest workflows in the corpus — tasks well past the 1.6-hour median human operation time — top frontier agents approach near-zero binary completion regardless of step budget. Partial progress is real; reliable completion is not.
Trajectory showcase
Inspect complete agent trajectories step by step. Explore all task trajectories.
OSWorld 1.0 vs 2.0
OSWorld 2.0 is a substantial expansion of the original OSWorld evaluation: tasks span far more agent steps, cross more applications, run inside reproducible self-hosted environments, and use partial credit instead of binary completion alone.
OSWorld 1.0
OSWorld 2.0
Methodology
Metrics
Submissions are scored at 150 / 300 / 500 agent-step budgets. The 500-step budget mirrors realistic long-horizon work; the 150-step budget surfaces efficiency.
Safety audit
A separate audit pipeline runs 8 diagnostic checks on each trajectory. Safety reports are scored independently from task completion.
What the results show
Three patterns recur across the evaluated agents.
Higher scores require disproportionately more tokens.
Crossing the 50% partial-score threshold requires order-of-magnitude more tokens than reaching 25%. Efficiency scales worse than capability.
Task horizon remains a hard limit.
Binary completion collapses as task length grows. On the longest workflows in the corpus, top frontier agents approach near-zero end-to-end completion regardless of step budget.
Agents are weak at recovering and maintaining hidden state.
When tasks require tracking unobserved or evolving context across steps, agents lose track — repeating earlier work, missing updates, or executing from stale plans.
Resources
Acknowledgments
OSWorld 2.0 is developed by XLANG Lab, with contributions from Snorkel AI researchers Zhengyang Qi (Jason), Vincent Sunn Chen, and Frederic Sala. Snorkel AI is the research and data partner on this project.
Get notified when we launch a new benchmark
Please enable scripts and refresh the page to continue.









