LEADERBOARDS

Benchmarks for what frontier AI hasn't solved

Our ability to measure AI has been outpaced by our ability to develop it. We close that evaluation gap with benchmarks built around the tasks today's agents still break down on.

partners

New

Senior SWE-Bench

A benchmark for evaluating coding agents on senior-level engineering work: building features from realistic instructions, investigating bugs that require runtime investigation, and shipping code that aligns to existing codebase conventions.

Built with

Tasteful Solve Rate

The top-performing frontier models fail to complete tasks with senior-level correctness and taste over 75% of the time.

1

Claude Sonnet 5

33.3%

Reward hacking (e.g. GitHub searches) detected, 26 tasks removed from score

Open Benchmarks Grants

OSWorld 2.0

Long-horizon professional workflows with verifiable outcomes across 55 sub-industries. 147 public tasks of a 1,500+ task corpus, sourced and validated by 300+ industry experts.

By binary accuracy (300 steps)

Claude Opus 4.7 · max

13%

3

Claude Sonnet 4.6 · medium

8.3%

Open Benchmarks Grants

Agents’ Last Exam

Long-horizon professional workflows with verifiable outcomes across 55 sub-industries. 147 public tasks of a 1,500+ task corpus, sourced and validated by 300+ industry experts.

Claude Code · Claude-Fable-5

22%

Open Benchmarks Grants

Continual Learning Bench

Evaluates whether AI systems improve from prior experience across sequential, stateful tasks, measuring real in-context learning, not just raw capability.

Top Systems (Agg. Reward)

1

ICL · Claude Sonnet 4.6

Claude Code · Claude Sonnet 4.6

+0.190

Open Benchmarks Grants

SlopCode Bench

Measures code quality degradation in AI-assisted codebases. Tracks checkpoint solve rates, erosion (code bloat), and verbosity under realistic repo conditions.

Top Models by Iso Solve

Open Benchmarks Grants

Terminal-Bench 2.1

Terminal agent evaluation led by Stanford University and Laude Institute. v2.1 fixes 28 tasks from 2.0 and introduces continuous validation.

Claude Code · Claude 5 Fable

83.1%

3

Terminus 2 · Claude 5 Fable

80.4%

Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.

View 8 archived benchmarks

IN DEVELOPMENT

Open Benchmarks Grants

Backed by a $3M commitment, our Open Benchmarks Grants program funds open-source datasets, benchmarks, and evaluation artifacts that shape how frontier AI is built and evaluated.

featured Collaborations

Computer tasks

terminal-bench 3.0

Real terminal tasks — exposing where today's coding agents fail.

Natural sciences

terminal-bench-science

Generic code evals miss sloppy code. This measures what they ignore.

Legal AI

Harvey BigLaw Bench: Research

Expert data powering the hardest benchmark for agentic legal research

Legal AI

JudgmentBench

Compares rubric-based and preference-based evaluation for judging output quality.

Get notified when we launch a new benchmark

Your browser is currently blocking scripts, which prevents the form from loading.
Please enable scripts and refresh the page to continue.

Looking ahead

Three core dimensions where today's benchmarks fall short

Benchmarks must close the gap between what we measure and what agents actually encounter. Our work focuses on three dimensions where today’s evaluations break down.

Explore the eval gap

01

Environment complexity

How dynamic is the operating environment? Real systems are far more complex than today's benchmarks.

02

Autonomy horizon

How independently can the agent operate before reliability breaks down?

03

Output complexity

How sophisticated is the deliverable agents must produce?

For models that need to be right. Not just good enough.

Request dataset samples

Explore research

Snorkel expert data-as-a-service

Featured leaderboards

Exclusive to Snorkel, these benchmarks are meticulously designed and validated by subject matter experts to probe frontier AI models on demanding, specialized tasks. These are just a few of our featured benchmarks—new ones are added regularly, so check back often to see the latest from our research team.

Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.

Claude Opus 4.5

58%

Claude Sonnet 4.5

57.6%

Gemini 3 Pro Preview

51.6%

gpt-5.2

49.4%

gpt-5

45.2%

Kimi-K2-Thinking

36.8%

Devstral 2

33.2%

Grok 4.1 Fast

25.2%

Qwen 3 Coder 480B

18.8%

Mistral Large 3

13.8%

Show full rating

Finance Reasoning

A benchmark co-created with Snorkel's financial expert network, to test agents on financial reasoning questions, through tool-calling and planning.

Grok 4

53.1%

GPT-5.4

52%

Claude Sonnet 3.7

51.89%

gpt-5

51%

Claude Sonnet 4

49.37%

Claude Opus 4

48.1%

Gemini 3 Pro

46.84%

gpt-5-mini

46.8%

o4 mini

45.57%

Claude Opus 4.1

45.56%

GPT-4.1

44.3%

o3

43.04%

Grok 3

41.8%

Grok 4 Fast Reasoning

40.51%

NVIDIA Nemotron Super 49B v1.5

35.443%

Kimi-K2-Thinking

35%

Gemini 2.5 Pro

34.6%

Nova Premier

34.17%

Gemini 2.5 Flash

32%

gpt-oss-120b

31.6%

o3-mini

30.37%

gpt-5-nano

26.6%

Qwen 3 235B

17.7%

Magistral Medium

13.92%

Nova Pro

12.65%

Mistral Large

10.12%

Show full rating

SnorkelUnderwrite

An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.

GPT-5.4

91%

Claude Opus 4.1

86.3%

gpt-5

83.33%

Grok 4

83.3%

Grok 4 Fast Reasoning

81.33%

Grok 3

78%

o4 mini

78%

Claude Opus 4

77%

o3

77%

Claude Sonnet 3.7

74.6%

Claude Sonnet 4

72.3%

gpt-5-mini

71.67%

Kimi-K2-Thinking

71.3%

GPT-4.1

70.6%

Gemini 2.5 Flash

61%

Nova Premier

57%

Gemini 2.5 Pro

56.3%

Nova Pro

52.3%

gpt-5-nano

47%

Llama 3.3 70B

46.3%

Llama 4 Maverick

46.3%

Llama 4 Scout

44.3%

o3-mini

44.3%

Nova Lite

40%

Mistral Large

38.3%

Codestral

34%

Nova Micro

31%

gpt-oss-120b

30%

Magistral Medium

29.3%

Command R+

25.7%

Qwen 3 235B

21.3%

Llama 3.1 405B

20%

Command R

15.3%

Show full rating

SnorkelSequences

A procedurally-generated and expert-verified benchmark for evaluating mathematical reasoning and compositional capabilities in LLMs.

gpt-5

77.6%

gpt-5-mini

77.6%

gpt-5-nano

72%

GPT-5.4

71.6%

o3-mini

71.2%

Gemini 2.5 Flash

70.8%

Claude Sonnet 4

70.4%

Grok 4 Fast Reasoning

70.2%

o4 mini

68.8%

NVIDIA Nemotron Super 49B v1.5

66.8%

Gemini 2.5 Pro

66%

Claude Opus 4

65.6%

o3

65.2%

Grok 4

63.2%

Llama 4 Maverick

62%

Nova Premier

51.8%

Llama 4 Scout

48.4%

Claude Sonnet 3.7

47.6%

Magistral Medium

47.6%

Nvidia nemotron super 49B

44.8%

Nova Pro

41.2%

Nova Lite

40%

Grok 3

39.2%

Llama 3.3 70B

38.8%

Mistral Large

38.8%

Codestral

38.4%

GPT-4.1

36.8%

Nvidia 70B Instruct

36.4%

Kimi-K2-Thinking

36%

Llama 3.1 405B

35.2%

Nova Micro

33.6%

Qwen 3 235B

28%

Show full rating

SnorkelSpatial

A procedurally-generated benchmark for evaluating allocentric and egocentric spatial reasoning capabilities in LLMs.

GPT-5.4

99%

Grok 4 Fast Reasoning

84.85%

o3

76.67%

gpt-5

73.94%

gpt-oss-120b

52.73%

gpt-5-mini

45.45%

Claude Opus 4.1

45.15%

Magistral Medium 1.2

44.24%

Claude Opus 4

40.3%

o3-mini

37.88%

Claude Sonnet 4

33.33%

gpt-5-nano

26.67%

Claude Sonnet 3.7

21.52%

Gemini 2.5 Flash

18.79%

Llama 4 Scout

15.45%

Gemini 2.5 Pro

15.15%

gpt-5-chat

14.85%

Mistral Large

14.85%

o4 mini

14.85%

GPT-4.1

14.55%

Llama 3.3 70B

14.55%

Mistral Medium 3.1

14.55%

Nova Micro

14.55%

Command R+

14.24%

Nova Premier

14.24%

Qwen 3 235B

13.94%

Codestral

13.64%

Nova Lite

13.33%

Grok 3

12.73%

Magistral Medium

12.42%

Llama 4 Maverick

12.12%

Nova Pro

12.12%

Command R

11.82%

Show full rating

SnorkelWordle

A benchmark designed to evaluate linguistic reasoning and instruction-following capabilities in language models through the iterative and constrained gameplay of Wordle.

gpt-5

94%

Grok 4

93%

o3

92.9%

o4 mini

91.9%

Gemini 3 Pro

91%

gpt-5-mini

91%

o3-mini

90%

Grok 4 Fast Reasoning

88%

Claude Opus 4

85.6%

Kimi-K2-Thinking

85%

Claude Sonnet 4

83%

gpt-oss-120b

81.6%

gpt-5-nano

79%

Gemini 2.5 Pro

74%

Grok 3

71%

Claude Sonnet 3.7

68%

gpt-oss-20b

65.9%

GPT-4.1

62%

Gemini 2.5 Flash

61.9%

Kimi-K2

54%

Llama 3.3 70B

10.2%

Show full rating

SnorkelGraph

A procedurally-generated and expert verified benchmark for evaluating mathematical and spatial reasoning capabilities of LLMs through graph reasoning problems.

GPT-5.4

84.5%

Grok 4 Fast Reasoning

75%

o4 mini

75%

gpt-5-mini

72.5%

gpt-5

72%

o3

71.5%

o3-mini

71%

Claude Opus 4

64.5%

Grok 3

64%

GPT-4.1

63%

gpt-5-nano

62.5%

Qwen 3 235B

61.5%

Grok 4

61%

Claude Sonnet 4

58%

Gemini 2.5 Pro

58%

Gemini 2.5 Flash

55%

Magistral Medium

53.5%

Claude Sonnet 3.7

50%

Nova Premier

34.5%

Llama 4 Maverick

34%

Mistral Large

30%

Nvidia nemotron super 49B

29%

Nova Pro

28%

Llama 4 Scout

26%

Codestral

24.5%

Llama 3.3 70B

23.5%

Nvidia 70B Instruct

22.5%

Llama 3.1 405B

20.5%

Nova Lite

19%

Nova Micro

17.5%

Command R+

15%

Command-Light

10.5%

Command

10%

Show full rating

SnorkelFinance

A benchmark of expert-verified financial QA created from financial reports for evaluating AI agents on tool-calling and reasoning capabilities.

gpt-5

81%

o3

81%

Gemini 3 Pro

80.34%

Claude Opus 4.1

80.3%

gpt-5-mini

79.3%

Claude Opus 4

78.3%

Claude Sonnet 3.7

77.9%

Claude Sonnet 4

76.6%

o4 mini

76.6%

Grok 4

74.04%

Grok 4 Fast Reasoning

73.45%

Kimi-K2-Thinking

71.7%

gpt-oss-120b

66.6%

Grok 3

65.86%

o3-mini

63.79%

GPT-4.1

62.7%

Nova Premier

62.06%

Gemini 2.5 Pro

60.6%

Gemini 2.5 Flash

53.1%

Qwen 3 235B

51.37%

gpt-5-nano

50%

NVIDIA Nemotron Super 49B v1.5

44%

Nova Pro

40.34%

Codestral

27.6%

Nova Lite

16.89%

Magistral Medium

16.2%

Nova Micro

14.48%

Mistral Large

13.4%

Show full rating

Benchmarks for what frontier AI hasn't solved

Senior SWE-Bench

OSWorld 2.0

Agents’ Last Exam

Continual Learning Bench

SlopCode Bench

Terminal-Bench 2.1

Agentic Coding

Terminal-Bench 2.0

Finance Reasoning

SnorkelSequences

SnorkelFinance

SnorkelGraph

SnorkelUnderwrite

SnorkelWordle

SnorkelSpatial

Open Benchmarks Grants

Get notified when we launch a new benchmark

Three core dimensions where today's benchmarks fall short

For models that need to be right. Not just good enough.

Featured leaderboards

Agentic Coding

Finance Reasoning

SnorkelUnderwrite

SnorkelSequences

SnorkelSpatial

SnorkelWordle

SnorkelGraph

SnorkelFinance

How do you want to work with Snorkel?