We evaluated [Model Name] across 1,247 enterprise-grade agentic coding tasks. Here’s the full breakdown — including where targeted training data could close the gap.
BENCHMARK 1
00.0%
+0.0 vs previous
BENCHMARK 2
00.0%
+0.0 vs previous
BENCHMARK 3
00.0%
Detail
BENCHMARK 4
00.0%
Detail
Terminal-Bench+ covers 8 categories of agentic coding tasks. Here’s the breakdown:
KEY INSIGHT
One non-numeric idea worth spotlighting inline — the thing the reader should leave with, in one sentence.
Describe the top-performing categories and what drives the strong results.
The lowest-scoring categories represent the biggest opportunity for improvement through targeted training data.
Category name
39.4%
Description of what this category tests and why the model underperforms here.
Common failure mode: Describe the specific pattern of failure observed during evaluation.
Category name
39.4%
Description of what this category tests and why the model underperforms here.
Common failure mode: Describe the specific pattern of failure observed during evaluation.
OUR TAKE
A clear, opinionated conclusion that tells the reader what to do with this information.
Snorkel’s Agentic Coding Data Series includes expert-curated training examples specifically designed for these weak spots:
< 40%
Placeholder stat — swap in the number that anchors the piece.
We’ll prepare a 200-example sample from the datasets most relevant to your model’s weak spots. Typically delivered within 48 hours.