May 20, 2026

We evaluated [Model Name] across 1,247 enterprise-grade agentic coding tasks. Here’s the full breakdown — including where targeted training data could close the gap.


Performance overview

[Model Name] scored XX.X% on Terminal-Bench+. But the aggregate number masks meaningful variation across task categories.

BENCHMARK 1

00.0%

+0.0 vs previous

BENCHMARK 2

00.0%

+0.0 vs previous

BENCHMARK 3

00.0%

Detail

BENCHMARK 4

00.0%

Detail

Scores by category

Terminal-Bench+ covers 8 categories of agentic coding tasks. Here’s the breakdown:

Category name 78.2%
Category name 52.3%
Category name 31.2%

Where it excels

KEY INSIGHT

One non-numeric idea worth spotlighting inline — the thing the reader should leave with, in one sentence.

Describe the top-performing categories and what drives the strong results.

Where it falls short

The lowest-scoring categories represent the biggest opportunity for improvement through targeted training data.

Category name

39.4%

Description of what this category tests and why the model underperforms here.

Common failure mode: Describe the specific pattern of failure observed during evaluation.

Category name

39.4%

Description of what this category tests and why the model underperforms here.

Common failure mode: Describe the specific pattern of failure observed during evaluation.

The data opportunity

OUR TAKE

A clear, opinionated conclusion that tells the reader what to do with this information.

Snorkel’s Agentic Coding Data Series includes expert-curated training examples specifically designed for these weak spots:

Sample data preview

1$ example.sh
2
3$ snorkel eval –task terminal-bench+ –model frontier-v1
4→ loading 1,247 tasks from Agentic Coding Data Series
5→ running multi-step CLI evaluations…
6
7Accuracy: 38.2% (± 1.4)
8Pass@1: 0.41 Pass@5: 0.63

< 40%

Placeholder stat — swap in the number that anchors the piece.

Request sample data

Get a sample dataset tailored to your model’s gaps

We’ll prepare a 200-example sample from the datasets most relevant to your model’s weak spots. Typically delivered within 48 hours.

Want more like this? Talk to our team about how Snorkel builds frontier AI data.

Image