Snorkel expert data-as-a-service

Featured leaderboards

Exclusive to Snorkel, these benchmarks are meticulously designed and validated by subject matter experts to probe frontier AI models on demanding, specialized tasks. These are just a few of our featured benchmarks—new ones are added regularly, so check back often to see the latest from our research team.

Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.
Claude Opus 4.5
58%
Claude Sonnet 4.5
57.6%
Gemini 3 Pro Preview
51.6%
gpt-5.2
49.4%
gpt-5
45.2%
Kimi-K2-Thinking
36.8%
Devstral 2
33.2%
Grok 4.1 Fast
25.2%
Qwen 3 Coder 480B
18.8%
Mistral Large 3
13.8%
Show full rating

SnorkelUnderwrite

An expert-verified frontier benchmark with multi-turn conversations, focused on agentic reasoning and tool use in commercial underwriting settings.
Claude Opus 4.1
86.3%
gpt-5
83.33%
Grok 4
83.3%
Grok 4 Fast Reasoning
81.33%
Grok 3
78%
o4 mini
78%
Claude Opus 4
77%
o3
77%
Claude Sonnet 3.7
74.6%
Claude Sonnet 4
72.3%
gpt-5-mini
71.67%
Kimi-K2-Thinking
71.3%
GPT-4.1
70.6%
Gemini 2.5 Flash
61%
Nova Premier
57%
Gemini 2.5 Pro
56.3%
Nova Pro
52.3%
gpt-5-nano
47%
Llama 3.3 70B
46.3%
Llama 4 Maverick
46.3%
Llama 4 Scout
44.3%
o3-mini
44.3%
Nova Lite
40%
Mistral Large
38.3%
Codestral
34%
Nova Micro
31%
gpt-oss-120b
30%
Magistral Medium
29.3%
Command R+
25.7%
Qwen 3 235B
21.3%
Llama 3.1 405B
20%
Command R
15.3%
Show full rating

Finance Reasoning

A benchmark co-created with Snorkel's financial expert network, to test agents on financial reasoning questions, through tool-calling and planning.
Grok 4
53.1%
Claude Sonnet 3.7
51.89%
gpt-5
51%
Claude Sonnet 4
49.37%
Claude Opus 4
48.1%
Gemini 3 Pro
46.84%
gpt-5-mini
46.8%
o4 mini
45.57%
Claude Opus 4.1
45.56%
GPT-4.1
44.3%
o3
43.04%
Grok 3
41.8%
Grok 4 Fast Reasoning
40.51%
NVIDIA Nemotron Super 49B v1.5
35.443%
Kimi-K2-Thinking
35%
Gemini 2.5 Pro
34.6%
Nova Premier
34.17%
Gemini 2.5 Flash
32%
gpt-oss-120b
31.6%
o3-mini
30.37%
gpt-5-nano
26.6%
Qwen 3 235B
17.7%
Magistral Medium
13.92%
Nova Pro
12.65%
Mistral Large
10.12%
Show full rating

SnorkelSequences

A procedurally-generated and expert-verified benchmark for evaluating mathematical reasoning and compositional capabilities in LLMs.
gpt-5
77.6%
gpt-5-mini
77.6%
gpt-5-nano
72%
o3-mini
71.2%
Gemini 2.5 Flash
70.8%
Claude Sonnet 4
70.4%
Grok 4 Fast Reasoning
70.2%
o4 mini
68.8%
NVIDIA Nemotron Super 49B v1.5
66.8%
Gemini 2.5 Pro
66%
Claude Opus 4
65.6%
o3
65.2%
Grok 4
63.2%
Llama 4 Maverick
62%
Nova Premier
51.8%
Llama 4 Scout
48.4%
Claude Sonnet 3.7
47.6%
Magistral Medium
47.6%
Nvidia nemotron super 49B
44.8%
Nova Pro
41.2%
Nova Lite
40%
Grok 3
39.2%
Llama 3.3 70B
38.8%
Mistral Large
38.8%
Codestral
38.4%
GPT-4.1
36.8%
Nvidia 70B Instruct
36.4%
Kimi-K2-Thinking
36%
Llama 3.1 405B
35.2%
Nova Micro
33.6%
Qwen 3 235B
28%
Show full rating

SnorkelSpatial

A procedurally-generated benchmark for evaluating allocentric and egocentric spatial reasoning capabilities in LLMs.
Grok 4 Fast Reasoning
84.85%
o3
76.67%
gpt-5
73.94%
gpt-oss-120b
52.73%
gpt-5-mini
45.45%
Claude Opus 4.1
45.15%
Magistral Medium 1.2
44.24%
Claude Opus 4
40.3%
o3-mini
37.88%
Claude Sonnet 4
33.33%
gpt-5-nano
26.67%
Claude Sonnet 3.7
21.52%
Gemini 2.5 Flash
18.79%
Llama 4 Scout
15.45%
Gemini 2.5 Pro
15.15%
gpt-5-chat
14.85%
Mistral Large
14.85%
o4 mini
14.85%
GPT-4.1
14.55%
Llama 3.3 70B
14.55%
Mistral Medium 3.1
14.55%
Nova Micro
14.55%
Command R+
14.24%
Nova Premier
14.24%
Qwen 3 235B
13.94%
Codestral
13.64%
Nova Lite
13.33%
Grok 3
12.73%
Magistral Medium
12.42%
Llama 4 Maverick
12.12%
Nova Pro
12.12%
Command R
11.82%
Show full rating

SnorkelWordle

A benchmark designed to evaluate linguistic reasoning and instruction-following capabilities in language models through the iterative and constrained gameplay of Wordle.
gpt-5
94%
Grok 4
93%
o3
92.9%
o4 mini
91.9%
Gemini 3 Pro
91%
gpt-5-mini
91%
o3-mini
90%
Grok 4 Fast Reasoning
88%
Claude Opus 4
85.6%
Kimi-K2-Thinking
85%
Claude Sonnet 4
83%
gpt-oss-120b
81.6%
gpt-5-nano
79%
Gemini 2.5 Pro
74%
Grok 3
71%
Claude Sonnet 3.7
68%
gpt-oss-20b
65.9%
GPT-4.1
62%
Gemini 2.5 Flash
61.9%
Kimi-K2
54%
Llama 3.3 70B
10.2%
Show full rating

SnorkelGraph

A procedurally-generated and expert verified benchmark for evaluating mathematical and spatial reasoning capabilities of LLMs through graph reasoning problems.
Grok 4 Fast Reasoning
75%
o4 mini
75%
gpt-5-mini
72.5%
gpt-5
72%
o3
71.5%
o3-mini
71%
Claude Opus 4
64.5%
Grok 3
64%
GPT-4.1
63%
gpt-5-nano
62.5%
Qwen 3 235B
61.5%
Grok 4
61%
Claude Sonnet 4
58%
Gemini 2.5 Pro
58%
Gemini 2.5 Flash
55%
Magistral Medium
53.5%
Claude Sonnet 3.7
50%
Nova Premier
34.5%
Llama 4 Maverick
34%
Mistral Large
30%
Nvidia nemotron super 49B
29%
Nova Pro
28%
Llama 4 Scout
26%
Codestral
24.5%
Llama 3.3 70B
23.5%
Nvidia 70B Instruct
22.5%
Llama 3.1 405B
20.5%
Nova Lite
19%
Nova Micro
17.5%
Command R+
15%
Command-Light
10.5%
Command
10%
Show full rating

SnorkelFinance

A benchmark of expert-verified financial QA created from financial reports for evaluating AI agents on tool-calling and reasoning capabilities.
gpt-5
81%
o3
81%
Gemini 3 Pro
80.34%
Claude Opus 4.1
80.3%
gpt-5-mini
79.3%
Claude Opus 4
78.3%
Claude Sonnet 3.7
77.9%
Claude Sonnet 4
76.6%
o4 mini
76.6%
Grok 4
74.04%
Grok 4 Fast Reasoning
73.45%
Kimi-K2-Thinking
71.7%
gpt-oss-120b
66.6%
Grok 3
65.86%
o3-mini
63.79%
GPT-4.1
62.7%
Nova Premier
62.06%
Gemini 2.5 Pro
60.6%
Gemini 2.5 Flash
53.1%
Qwen 3 235B
51.37%
gpt-5-nano
50%
NVIDIA Nemotron Super 49B v1.5
44%
Nova Pro
40.34%
Codestral
27.6%
Nova Lite
16.89%
Magistral Medium
16.2%
Nova Micro
14.48%
Mistral Large
13.4%
Show full rating

Performance per Dollar

Model
Cost

Model Value Comparison

Compare
Vs
Image

Snorkel Expert Data-as-a-Service

Accelerate the evaluation and development of frontier AI models with a scalable, white-glove service that provides model development teams with high quality, expert data.