LEADERBOARDS

Benchmarks for what frontier AI hasn't solved

Our ability to measure AI has been outpaced by our ability to develop it. We close that evaluation gap with benchmarks built around the tasks today's agents still break down on.
partners
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
Image
New

Senior SWE-Bench

A benchmark for evaluating coding agents on senior-level engineering work: building features from realistic instructions, investigating bugs that require runtime investigation, and shipping code that aligns to existing codebase conventions.

Built with
Image
Image
Image
Tasteful Solve Rate
The top-performing frontier models fail to complete tasks with senior-level correctness and taste over 75% of the time.
1
Image
Claude Sonnet 5
33.3%
Reward hacking (e.g. GitHub searches) detected, 26 tasks removed from score
Image
Claude Opus 4.8
24%
3
Image
GPT-5.5
18%
4
Image
Claude Sonnet 4.6
16.2%
New
Open Benchmarks Grants

OSWorld 2.0

Long-horizon professional workflows with verifiable outcomes across 55 sub-industries. 147 public tasks of a 1,500+ task corpus, sourced and validated by 300+ industry experts.

By binary accuracy (300 steps)
1
Image
gpt-5-5 · xhigh
13%
2
Image
Claude Opus 4.7 · max
13%
3
Image
Claude Sonnet 4.6 · medium
8.3%
Open Benchmarks Grants

Agents’ Last Exam

Long-horizon professional workflows with verifiable outcomes across 55 sub-industries. 147 public tasks of a 1,500+ task corpus, sourced and validated by 300+ industry experts.

By Binary Accuracy
1
Image
Codex · GPT-5.5
24%
2
Image
ALE Claw · GPT-5.5
23%
3
Image
Claude Code · Claude-Fable-5
22%
Open Benchmarks Grants

Continual Learning Bench

Evaluates whether AI systems improve from prior experience across sequential, stateful tasks, measuring real in-context learning, not just raw capability.

Top Systems (Agg. Reward)
1
Image
ICL · Claude Sonnet 4.6
+0.223
2
Image
ICL · GPT-5.4
+0.201
3
Image
Claude Code · Claude Sonnet 4.6
+0.190
Open Benchmarks Grants

SlopCode Bench

Measures code quality degradation in AI-assisted codebases. Tracks checkpoint solve rates, erosion (code bloat), and verbosity under realistic repo conditions.

Top Models by Iso Solve
1
Image
GPT-5.5
28.06%
2
Image
GPT-5.3-Codex
26.02%
3
Image
GPT-5.4
23.47%
Open Benchmarks Grants

Terminal-Bench 2.1

Terminal agent evaluation led by Stanford University and Laude Institute. v2.1 fixes 28 tasks from 2.0 and introduces continuous validation.

Top Submissions
1
Image
Codex CLI · GPT-5.5
83.4%
2
Image
Claude Code · Claude 5 Fable
83.1%
3
Image
Terminus 2 · Claude 5 Fable
80.4%

Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.

Top Models
1
Image
Claude Opus 4.6
65.2%
2
Image
Claude Opus 4.5
58.0%
3
Image
Claude Sonnet 4.5
57.6%

View 8 archived benchmarks
IN DEVELOPMENT

Open Benchmarks Grants

Backed by a $3M commitment, our Open Benchmarks Grants program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI is built and evaluated.

Get notified when we launch a new benchmark

Looking ahead

Three core dimensions where today's benchmarks fall short

Benchmarks must close the gap between what we measure and what agents actually encounter. Our work focuses on three dimensions where today’s evaluations break down.
01
Environment complexity
How dynamic is the operating environment? Real systems are far more complex than today's benchmarks.
02
Autonomy horizon
How independently can the agent operate before reliability breaks down?
03
Output complexity
How sophisticated is the deliverable agents must produce?

For models that need to be right. Not just good enough.

Snorkel expert data-as-a-service

Featured leaderboards

Exclusive to Snorkel, these benchmarks are meticulously designed and validated by subject matter experts to probe frontier AI models on demanding, specialized tasks. These are just a few of our featured benchmarks—new ones are added regularly, so check back often to see the latest from our research team.

Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.
Claude Opus 4.5
58%
Claude Sonnet 4.5
57.6%
Gemini 3 Pro Preview
51.6%
gpt-5.2
49.4%
gpt-5
45.2%
Kimi-K2-Thinking
36.8%
Devstral 2
33.2%
Grok 4.1 Fast
25.2%
Qwen 3 Coder 480B
18.8%
Mistral Large 3
13.8%
Show full rating

Finance Reasoning

A benchmark co-created with Snorkel's financial expert network, to test agents on financial reasoning questions, through tool-calling and planning.
Grok 4
53.1%
GPT-5.4
52%
Claude Sonnet 3.7
51.89%
gpt-5
51%
Claude Sonnet 4
49.37%
Claude Opus 4
48.1%
Gemini 3 Pro
46.84%
gpt-5-mini
46.8%
o4 mini
45.57%
Claude Opus 4.1
45.56%
GPT-4.1
44.3%
o3
43.04%
Grok 3
41.8%
Grok 4 Fast Reasoning
40.51%
NVIDIA Nemotron Super 49B v1.5
35.443%
Kimi-K2-Thinking
35%
Gemini 2.5 Pro
34.6%
Nova Premier
34.17%
Gemini 2.5 Flash
32%
gpt-oss-120b
31.6%
o3-mini
30.37%
gpt-5-nano
26.6%
Qwen 3 235B
17.7%
Magistral Medium
13.92%
Nova Pro
12.65%
Mistral Large
10.12%
Show full rating

SnorkelUnderwrite

An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.
GPT-5.4
91%
Claude Opus 4.1
86.3%
gpt-5
83.33%
Grok 4
83.3%
Grok 4 Fast Reasoning
81.33%
Grok 3
78%
o4 mini
78%
Claude Opus 4
77%
o3
77%
Claude Sonnet 3.7
74.6%
Claude Sonnet 4
72.3%
gpt-5-mini
71.67%
Kimi-K2-Thinking
71.3%
GPT-4.1
70.6%
Gemini 2.5 Flash
61%
Nova Premier
57%
Gemini 2.5 Pro
56.3%
Nova Pro
52.3%
gpt-5-nano
47%
Llama 3.3 70B
46.3%
Llama 4 Maverick
46.3%
Llama 4 Scout
44.3%
o3-mini
44.3%
Nova Lite
40%
Mistral Large
38.3%
Codestral
34%
Nova Micro
31%
gpt-oss-120b
30%
Magistral Medium
29.3%
Command R+
25.7%
Qwen 3 235B
21.3%
Llama 3.1 405B
20%
Command R
15.3%
Show full rating

SnorkelSequences

A procedurally-generated and expert-verified benchmark for evaluating mathematical reasoning and compositional capabilities in LLMs.
gpt-5
77.6%
gpt-5-mini
77.6%
gpt-5-nano
72%
GPT-5.4
71.6%
o3-mini
71.2%
Gemini 2.5 Flash
70.8%
Claude Sonnet 4
70.4%
Grok 4 Fast Reasoning
70.2%
o4 mini
68.8%
NVIDIA Nemotron Super 49B v1.5
66.8%
Gemini 2.5 Pro
66%
Claude Opus 4
65.6%
o3
65.2%
Grok 4
63.2%
Llama 4 Maverick
62%
Nova Premier
51.8%
Llama 4 Scout
48.4%
Claude Sonnet 3.7
47.6%
Magistral Medium
47.6%
Nvidia nemotron super 49B
44.8%
Nova Pro
41.2%
Nova Lite
40%
Grok 3
39.2%
Llama 3.3 70B
38.8%
Mistral Large
38.8%
Codestral
38.4%
GPT-4.1
36.8%
Nvidia 70B Instruct
36.4%
Kimi-K2-Thinking
36%
Llama 3.1 405B
35.2%
Nova Micro
33.6%
Qwen 3 235B
28%
Show full rating

SnorkelSpatial

A procedurally-generated benchmark for evaluating allocentric and egocentric spatial reasoning capabilities in LLMs.
GPT-5.4
99%
Grok 4 Fast Reasoning
84.85%
o3
76.67%
gpt-5
73.94%
gpt-oss-120b
52.73%
gpt-5-mini
45.45%
Claude Opus 4.1
45.15%
Magistral Medium 1.2
44.24%
Claude Opus 4
40.3%
o3-mini
37.88%
Claude Sonnet 4
33.33%
gpt-5-nano
26.67%
Claude Sonnet 3.7
21.52%
Gemini 2.5 Flash
18.79%
Llama 4 Scout
15.45%
Gemini 2.5 Pro
15.15%
gpt-5-chat
14.85%
Mistral Large
14.85%
o4 mini
14.85%
GPT-4.1
14.55%
Llama 3.3 70B
14.55%
Mistral Medium 3.1
14.55%
Nova Micro
14.55%
Command R+
14.24%
Nova Premier
14.24%
Qwen 3 235B
13.94%
Codestral
13.64%
Nova Lite
13.33%
Grok 3
12.73%
Magistral Medium
12.42%
Llama 4 Maverick
12.12%
Nova Pro
12.12%
Command R
11.82%
Show full rating

SnorkelWordle

A benchmark designed to evaluate linguistic reasoning and instruction-following capabilities in language models through the iterative and constrained gameplay of Wordle.
gpt-5
94%
Grok 4
93%
o3
92.9%
o4 mini
91.9%
Gemini 3 Pro
91%
gpt-5-mini
91%
o3-mini
90%
Grok 4 Fast Reasoning
88%
Claude Opus 4
85.6%
Kimi-K2-Thinking
85%
Claude Sonnet 4
83%
gpt-oss-120b
81.6%
gpt-5-nano
79%
Gemini 2.5 Pro
74%
Grok 3
71%
Claude Sonnet 3.7
68%
gpt-oss-20b
65.9%
GPT-4.1
62%
Gemini 2.5 Flash
61.9%
Kimi-K2
54%
Llama 3.3 70B
10.2%
Show full rating

SnorkelGraph

A procedurally-generated and expert verified benchmark for evaluating mathematical and spatial reasoning capabilities of LLMs through graph reasoning problems.
GPT-5.4
84.5%
Grok 4 Fast Reasoning
75%
o4 mini
75%
gpt-5-mini
72.5%
gpt-5
72%
o3
71.5%
o3-mini
71%
Claude Opus 4
64.5%
Grok 3
64%
GPT-4.1
63%
gpt-5-nano
62.5%
Qwen 3 235B
61.5%
Grok 4
61%
Claude Sonnet 4
58%
Gemini 2.5 Pro
58%
Gemini 2.5 Flash
55%
Magistral Medium
53.5%
Claude Sonnet 3.7
50%
Nova Premier
34.5%
Llama 4 Maverick
34%
Mistral Large
30%
Nvidia nemotron super 49B
29%
Nova Pro
28%
Llama 4 Scout
26%
Codestral
24.5%
Llama 3.3 70B
23.5%
Nvidia 70B Instruct
22.5%
Llama 3.1 405B
20.5%
Nova Lite
19%
Nova Micro
17.5%
Command R+
15%
Command-Light
10.5%
Command
10%
Show full rating

SnorkelFinance

A benchmark of expert-verified financial QA created from financial reports for evaluating AI agents on tool-calling and reasoning capabilities.
gpt-5
81%
o3
81%
Gemini 3 Pro
80.34%
Claude Opus 4.1
80.3%
gpt-5-mini
79.3%
Claude Opus 4
78.3%
Claude Sonnet 3.7
77.9%
Claude Sonnet 4
76.6%
o4 mini
76.6%
Grok 4
74.04%
Grok 4 Fast Reasoning
73.45%
Kimi-K2-Thinking
71.7%
gpt-oss-120b
66.6%
Grok 3
65.86%
o3-mini
63.79%
GPT-4.1
62.7%
Nova Premier
62.06%
Gemini 2.5 Pro
60.6%
Gemini 2.5 Flash
53.1%
Qwen 3 235B
51.37%
gpt-5-nano
50%
NVIDIA Nemotron Super 49B v1.5
44%
Nova Pro
40.34%
Codestral
27.6%
Nova Lite
16.89%
Magistral Medium
16.2%
Nova Micro
14.48%
Mistral Large
13.4%
Show full rating