| Snorkel AI

May 20, 2026

We evaluated [Model Name] across 1,247 enterprise-grade agentic coding tasks. Here’s the full breakdown — including where targeted training data could close the gap.

Performance overview

[Model Name] scored XX.X% on Terminal-Bench+. But the aggregate number masks meaningful variation across task categories.

BENCHMARK 1

00.0%

+0.0 vs previous

BENCHMARK 2

00.0%

+0.0 vs previous

BENCHMARK 3

00.0%

Detail

BENCHMARK 4

00.0%

Detail

Scores by category

Terminal-Bench+ covers 8 categories of agentic coding tasks. Here’s the breakdown:

Where it excels

KEY INSIGHT

One non-numeric idea worth spotlighting inline — the thing the reader should leave with, in one sentence.

Describe the top-performing categories and what drives the strong results.

Where it falls short

The lowest-scoring categories represent the biggest opportunity for improvement through targeted training data.

Category name

39.4%

Description of what this category tests and why the model underperforms here.

Common failure mode: Describe the specific pattern of failure observed during evaluation.

Category name

39.4%

Description of what this category tests and why the model underperforms here.

Common failure mode: Describe the specific pattern of failure observed during evaluation.

The data opportunity

OUR TAKE

A clear, opinionated conclusion that tells the reader what to do with this information.

Snorkel’s Agentic Coding Data Series includes expert-curated training examples specifically designed for these weak spots:

Sample data preview

1$ example.sh

3$ snorkel eval –task terminal-bench+ –model frontier-v1

4→ loading 1,247 tasks from Agentic Coding Data Series

5→ running multi-step CLI evaluations…

7Accuracy: 38.2% (± 1.4)

8Pass@1: 0.41 Pass@5: 0.63

< 40%

Placeholder stat — swap in the number that anchors the piece.

Performance overview

Scores by category

Where it excels

Where it falls short

The data opportunity

Sample data preview

Request sample data

Get a sample dataset tailored to your model’s gaps

Want more like this? Talk to our team about how Snorkel builds frontier AI data.

How do you want to work with Snorkel?