Leaderboard

Snorkel expert data-as-a-service

Featured leaderboards

Exclusive to Snorkel, these benchmarks are meticulously designed and validated by subject matter experts to probe frontier AI models on demanding, specialized tasks. These are just a few of our featured benchmarks—new ones are added regularly, so check back often to see the latest from our research team.

Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.

Claude Opus 4.5

58%

Claude Sonnet 4.5

57.6%

Gemini 3 Pro Preview

51.6%

gpt-5.2

49.4%

gpt-5

45.2%

Kimi-K2-Thinking

36.8%

Devstral 2

33.2%

Grok 4.1 Fast

25.2%

Qwen 3 Coder 480B

18.8%

Mistral Large 3

13.8%

Show full rating

SnorkelUnderwrite

An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.

Claude Opus 4.1

86.3%

gpt-5

83.33%

Grok 4

83.3%

Grok 4 Fast Reasoning

81.33%

Grok 3

78%

o4 mini

78%

Claude Opus 4

77%

o3

77%

Claude Sonnet 3.7

74.6%

Claude Sonnet 4

72.3%

gpt-5-mini

71.67%

Kimi-K2-Thinking

71.3%

GPT-4.1

70.6%

Gemini 2.5 Flash

61%

Nova Premier

57%

Gemini 2.5 Pro

56.3%

Nova Pro

52.3%

gpt-5-nano

47%

Llama 3.3 70B

46.3%

Llama 4 Maverick

46.3%

Llama 4 Scout

44.3%

o3-mini

44.3%

Nova Lite

40%

Mistral Large

38.3%

Codestral

34%

Nova Micro

31%

gpt-oss-120b

30%

Magistral Medium

29.3%

Command R+

25.7%

Qwen 3 235B

21.3%

Llama 3.1 405B

20%

Command R

15.3%

Show full rating

Finance Reasoning

A benchmark co-created with Snorkel's financial expert network, to test agents on financial reasoning questions, through tool-calling and planning.

Grok 4

53.1%

Claude Sonnet 3.7

51.89%

gpt-5

51%

Claude Sonnet 4

49.37%

Claude Opus 4

48.1%

Gemini 3 Pro

46.84%

gpt-5-mini

46.8%

o4 mini

45.57%

Claude Opus 4.1

45.56%

GPT-4.1

44.3%

o3

43.04%

Grok 3

41.8%

Grok 4 Fast Reasoning

40.51%

NVIDIA Nemotron Super 49B v1.5

35.443%

Kimi-K2-Thinking

35%

Gemini 2.5 Pro

34.6%

Nova Premier

34.17%

Gemini 2.5 Flash

32%

gpt-oss-120b

31.6%

o3-mini

30.37%

gpt-5-nano

26.6%

Qwen 3 235B

17.7%

Magistral Medium

13.92%

Nova Pro

12.65%

Mistral Large

10.12%

Show full rating

SnorkelSequences

A procedurally-generated and expert-verified benchmark for evaluating mathematical reasoning and compositional capabilities in LLMs.

gpt-5

77.6%

gpt-5-mini

77.6%

gpt-5-nano

72%

o3-mini

71.2%

Gemini 2.5 Flash

70.8%

Claude Sonnet 4

70.4%

Grok 4 Fast Reasoning

70.2%

o4 mini

68.8%

NVIDIA Nemotron Super 49B v1.5

66.8%

Gemini 2.5 Pro

66%

Claude Opus 4

65.6%

o3

65.2%

Grok 4

63.2%

Llama 4 Maverick

62%

Nova Premier

51.8%

Llama 4 Scout

48.4%

Claude Sonnet 3.7

47.6%

Magistral Medium

47.6%

Nvidia nemotron super 49B

44.8%

Nova Pro

41.2%

Nova Lite

40%

Grok 3

39.2%

Llama 3.3 70B

38.8%

Mistral Large

38.8%

Codestral

38.4%

GPT-4.1

36.8%

Nvidia 70B Instruct

36.4%

Kimi-K2-Thinking

36%

Llama 3.1 405B

35.2%

Nova Micro

33.6%

Qwen 3 235B

28%

Show full rating

SnorkelSpatial

A procedurally-generated benchmark for evaluating allocentric and egocentric spatial reasoning capabilities in LLMs.

Grok 4 Fast Reasoning

84.85%

o3

76.67%

gpt-5

73.94%

gpt-oss-120b

52.73%

gpt-5-mini

45.45%

Claude Opus 4.1

45.15%

Magistral Medium 1.2

44.24%

Claude Opus 4

40.3%

o3-mini

37.88%

Claude Sonnet 4

33.33%

gpt-5-nano

26.67%

Claude Sonnet 3.7

21.52%

Gemini 2.5 Flash

18.79%

Llama 4 Scout

15.45%

Gemini 2.5 Pro

15.15%

gpt-5-chat

14.85%

Mistral Large

14.85%

o4 mini

14.85%

GPT-4.1

14.55%

Llama 3.3 70B

14.55%

Mistral Medium 3.1

14.55%

Nova Micro

14.55%

Command R+

14.24%

Nova Premier

14.24%

Qwen 3 235B

13.94%

Codestral

13.64%

Nova Lite

13.33%

Grok 3

12.73%

Magistral Medium

12.42%

Llama 4 Maverick

12.12%

Nova Pro

12.12%

Command R

11.82%

Show full rating

SnorkelWordle

A benchmark designed to evaluate linguistic reasoning and instruction-following capabilities in language models through the iterative and constrained gameplay of Wordle.

gpt-5

94%

Grok 4

93%

o3

92.9%

o4 mini

91.9%

Gemini 3 Pro

91%

gpt-5-mini

91%

o3-mini

90%

Grok 4 Fast Reasoning

88%

Claude Opus 4

85.6%

Kimi-K2-Thinking

85%

Claude Sonnet 4

83%

gpt-oss-120b

81.6%

gpt-5-nano

79%

Gemini 2.5 Pro

74%

Grok 3

71%

Claude Sonnet 3.7

68%

gpt-oss-20b

65.9%

GPT-4.1

62%

Gemini 2.5 Flash

61.9%

Kimi-K2

54%

Llama 3.3 70B

10.2%

Show full rating

SnorkelGraph

A procedurally-generated and expert verified benchmark for evaluating mathematical and spatial reasoning capabilities of LLMs through graph reasoning problems.

Grok 4 Fast Reasoning

75%

o4 mini

75%

gpt-5-mini

72.5%

gpt-5

72%

o3

71.5%

o3-mini

71%

Claude Opus 4

64.5%

Grok 3

64%

GPT-4.1

63%

gpt-5-nano

62.5%

Qwen 3 235B

61.5%

Grok 4

61%

Claude Sonnet 4

58%

Gemini 2.5 Pro

58%

Gemini 2.5 Flash

55%

Magistral Medium

53.5%

Claude Sonnet 3.7

50%

Nova Premier

34.5%

Llama 4 Maverick

34%

Mistral Large

30%

Nvidia nemotron super 49B

29%

Nova Pro

28%

Llama 4 Scout

26%

Codestral

24.5%

Llama 3.3 70B

23.5%

Nvidia 70B Instruct

22.5%

Llama 3.1 405B

20.5%

Nova Lite

19%

Nova Micro

17.5%

Command R+

15%

Command-Light

10.5%

Command

10%

Show full rating

SnorkelFinance

A benchmark of expert-verified financial QA created from financial reports for evaluating AI agents on tool-calling and reasoning capabilities.

gpt-5

81%

o3

81%

Gemini 3 Pro

80.34%

Claude Opus 4.1

80.3%

gpt-5-mini

79.3%

Claude Opus 4

78.3%

Claude Sonnet 3.7

77.9%

Claude Sonnet 4

76.6%

o4 mini

76.6%

Grok 4

74.04%

Grok 4 Fast Reasoning

73.45%

Kimi-K2-Thinking

71.7%

gpt-oss-120b

66.6%

Grok 3

65.86%

o3-mini

63.79%

GPT-4.1

62.7%

Nova Premier

62.06%

Gemini 2.5 Pro

60.6%

Gemini 2.5 Flash

53.1%

Qwen 3 235B

51.37%

gpt-5-nano

50%

NVIDIA Nemotron Super 49B v1.5

44%

Nova Pro

40.34%

Codestral

27.6%

Nova Lite

16.89%

Magistral Medium

16.2%

Nova Micro

14.48%

Mistral Large

13.4%

Show full rating

Performance per Dollar

Model

Cost

Model Value Comparison

Compare

Vs

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.

Talk to Snorkel