Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.

Overview

The Snorkel Agentic Coding benchmark comprises 100 multi-step coding tasks, evenly distributed across four difficulty tiers, designed to evaluate models across a diverse range of capabilities germane to real-world software engineering work.

Taking insights from our contributions to the Terminal-Bench project, our Agentic Coding tasks evaluate agents in fully sandboxed execution environments. Each task is paired with a human-validated reference solution, comprehensive unit tests, and scoring rubrics that assess both final outputs and the agent's trajectory.

Leaderboard

Rank	Model	Score
1	Claude Opus 4.6	65.2%
2	Claude Opus 4.5	58%
3	Claude Sonnet 4.5	57.6%
4	Gemini 3 Pro Preview	51.6%
5	gpt-5.2	49.4%
6	gpt-5	45.2%
7	Kimi-K2-Thinking	36.8%
8	Devstral 2	33.2%
9	Grok 4.1 Fast	25.2%
10	Qwen 3 Coder 480B	18.8%
11	Mistral Large 3	13.8%

Sample Task

Incident Commander: Payments Canary Rollback

You are the on-call SRE responding to a spike in 5xx errors after a canary rollout of payments-api. All observability artifacts, runbooks, and config files are already packaged inside the container. No network access is allowed.

Requirements

Diagnose the incident using logs, metrics, and traces to identify the root cause and blast radius.
Execute mitigation per the runbook (/app/runbooks/payments-canary-rollback.md):
- Update /app/config/service_state.json as specified in the runbook
- Only modify payments-api - do not change any other services
- Only change fields specified in the runbook - preserve all other fields exactly
- Create any required configuration files specified in the runbook
Produce output files documenting your diagnosis, actions, and validation.

Inputs

/app/data/logs/app.log — application logs
/app/data/metrics/before.json — pre-incident metrics
/app/data/metrics/after.json — post-mitigation metrics
/app/data/traces/trace_sample.json — distributed traces
/app/data/deployments/ — deployment manifests
/app/runbooks/payments-canary-rollback.md — mitigation runbook
/app/config/service_state.json — service configuration

Outputs

All outputs must be created exactly as specified below.

`/app/outputs/incident_summary.md`

Markdown document with these sections: Overview, Root Cause, Blast Radius, Evidence, Actions Taken, Validation, Next Steps.

Cite full artifact paths used in analysis
Include pod, DB host, and failure mode in root cause
Include the primary trace ID (the first payments-api canary error trace with DB connection issues, by timestamp), rollback_token, and correlation ID
List affected transaction IDs and count
Include the incident time window (start and end timestamps of payments-api canary DB errors)
Confirm unaffected services explicitly - check all other services in service_state.json (excluding payments-api) and confirm they are unaffected
Show before/after metrics with deltas
Include an "SLO Budget" section containing the words "SLO" and "budget", plus the calculated remaining budget value (monthly_budget_pct minus consumed_pct, rounded to 2 decimal places)
Use "disable" or "disabled" when describing the canary action

`/app/outputs/actions.json`

JSON array with exactly FIVE action objects:

{"action": "disable_canary", "target": "payments-api", "status": "...", "details": "...", "evidence": "...", "rollback_token": "...", "correlation_id": "..."}
- evidence must reference app.log and include the primary trace ID
- correlation_id must be extracted from the relevant log entries
{"action": "create_alert_suppression", "target": "payments-api", "status": "...", "details": "..."}
{"action": "create_followup_ticket", "target": "payments-api", "status": "...", "details": "...", "priority": "...", "assigned_team": "..."}
- priority and assigned_team must be determined per the runbook
{"action": "notify_stakeholders", "target": "payments-api", "status": "...", "details": "...", "channel": "...", "escalation_level": "..."}
- channel and escalation_level must be determined per the runbook
{"action": "update_deployment_status", "target": "payments-api", "status": "...", "details": "...", "previous_status": "...", "new_status": "...", "canary_version": "..."}
- previous_status: the status from the deployment manifest before rollback
- new_status: "rolled_back"
- canary_version: the version from the deployment manifest

`/app/outputs/postcheck.json`

JSON object with fields: error_rate_pct, p99_latency_ms, validation_passed, notes, affected_txn_count, impact_pct, incident_window, slo_budget, mttr_seconds, error_velocity

error_rate_pct and p99_latency_ms: values from after-metrics for payments-api
validation_passed: boolean based on runbook thresholds
notes: must contain "after.json", the literal metric names ("error_rate_pct", "p99_latency_ms"), and their threshold values ("5" and "400")
affected_txn_count: count of unique transactions from payments-api canary DB error log entries
impact_pct: (affected_txn_count / total_requests_from_before_metrics) * 100, rounded to 2 decimal places
incident_window: object with start and end fields containing the first and last payments-api canary DB error timestamps from logs
slo_budget: object with these exact fields:
- monthly_budget_pct: 0.1 (representing 99.9% availability SLO)
- consumed_pct: the error_rate_pct from before-metrics (during incident)
- remaining_pct: monthly_budget_pct minus consumed_pct
mttr_seconds: Mean Time To Recovery - seconds between first canary DB error and last canary DB error (parse timestamps and compute difference)
error_velocity: errors per minute during incident - affected_txn_count divided by (mttr_seconds / 60), rounded to 2 decimal places

`/app/config/alert_suppression.json`

JSON object with fields:

service: "payments-api"
suppression_window_minutes: 30
reason: description of the incident
created_by: "incident-commander"
expires_at: ISO 8601 timestamp, computed as incident start time (first canary error) + 30 minutes

`/app/data/deployments/canary_manifest.json`

Update the deployment manifest:

Find the payments-api deployment entry
Change its status field from "active" to "rolled_back"
Preserve all other fields and entries exactly

#!/bin/bash
set -euo pipefail

cd /app
mkdir -p /app/outputs

python3 - <<'PY'
import json
import pathlib
import re
import textwrap
from datetime import datetime, timedelta

base = pathlib.Path("/app")

before_metrics = json.loads((base / "data/metrics/before.json").read_text())
after_metrics = json.loads((base / "data/metrics/after.json").read_text())
payments_before = before_metrics["payments-api"]
payments_after = after_metrics["payments-api"]

traces_data = json.loads((base / "data/traces/trace_sample.json").read_text())

payments_canary_errors = []
for t in traces_data.get("traces", []):
    if t.get("root_service") != "payments-api" or t.get("outcome") != "error":
        continue
    for span in t.get("spans", []):
        tags = span.get("tags", {})
        if tags.get("deployment") == "canary" and tags.get("db.error"):
            payments_canary_errors.append(t)
            break

payments_canary_errors.sort(key=lambda x: x.get("timestamp", ""))
primary_trace = payments_canary_errors[0] if payments_canary_errors else None
primary_trace_id = primary_trace["trace_id"] if primary_trace else ""

canary_pod = None
db_host = None
if primary_trace:
    for span in primary_trace.get("spans", []):
        if span.get("tags", {}).get("pod"):
            canary_pod = span["tags"]["pod"]
        if span.get("tags", {}).get("host"):
            db_host = span["tags"]["host"]

logs = (base / "data/logs/app.log").read_text()

deployments_path = base / "data/deployments/canary_manifest.json"
deployments = json.loads(deployments_path.read_text())
rollback_token = None
canary_version = None
previous_status = None
for dep in deployments.get("deployments", []):
    if dep.get("service") == "payments-api" and dep.get("status") == "active":
        rollback_token = dep.get("rollback_token")
        canary_version = dep.get("version")
        previous_status = dep.get("status")
        dep["status"] = "rolled_back"
        break

deployments_path.write_text(json.dumps(deployments, indent=2))

affected_txns = set()
error_timestamps = []
correlation_id = None
for line in logs.split("\n"):
    if "payments-api" in line and "ERROR" in line and "[txn:" in line:
        if "canary pod" in line and "db connection refused" in line:
            match = re.search(r'\[txn:(pay-\d+)\]', line)
            if match:
                affected_txns.add(match.group(1))
            timestamp = line.split()[0]
            error_timestamps.append(timestamp)
            corr_match = re.search(r'\[corr:(INC-\d{8}-\d{4})\]', line)
            if corr_match and correlation_id is None:
                correlation_id = corr_match.group(1)

affected_txns = sorted(list(affected_txns))
affected_txn_count = len(affected_txns)

incident_window = {
    "start": min(error_timestamps) if error_timestamps else "",
    "end": max(error_timestamps) if error_timestamps else ""
}

start_time = datetime.fromisoformat(incident_window["start"].replace("Z", "+00:00"))
end_time = datetime.fromisoformat(incident_window["end"].replace("Z", "+00:00"))
mttr_seconds = int((end_time - start_time).total_seconds())

if mttr_seconds > 0:
    error_velocity = round(affected_txn_count / (mttr_seconds / 60), 2)
else:
    error_velocity = float(affected_txn_count)

total_requests = payments_before["requests"]
impact_pct = round((affected_txn_count / total_requests) * 100, 2)

error_rate_delta = payments_after["error_rate_pct"] - payments_before["error_rate_pct"]
p99_delta = payments_after["p99_latency_ms"] - payments_before["p99_latency_ms"]

state_path = base / "config/service_state.json"
state = json.loads(state_path.read_text())
payments_owner = state.get("payments-api", {}).get("owner", "payments-team")
payments_state = state.get("payments-api", {})
payments_state["canary_enabled"] = False
payments_state["traffic_split"] = {"stable": 100, "canary": 0}
state["payments-api"] = payments_state
state_path.write_text(json.dumps(state, indent=2))

incident_time = datetime.fromisoformat(primary_trace["timestamp"].replace("Z", "+00:00"))
expires_time = incident_time + timedelta(minutes=30)
expires_at = expires_time.strftime("%Y-%m-%dT%H:%M:%SZ")

alert_suppression = {
    "service": "payments-api",
    "suppression_window_minutes": 30,
    "reason": "Canary rollback due to DB connection failures on canary pod",
    "created_by": "incident-commander",
    "expires_at": expires_at
}
(base / "config/alert_suppression.json").write_text(json.dumps(alert_suppression, indent=2))

priority = "P1" if impact_pct > 1 else "P2"
channel = "pagerduty" if priority == "P1" else "slack"
escalation_level = "L2" if impact_pct > 5 else "L1"

actions = [
    {
        "action": "disable_canary",
        "target": "payments-api",
        "status": "applied",
        "details": "Set traffic_split to stable=100 and canary=0 per runbook",
        "evidence": f"/app/data/logs/app.log; trace_id={primary_trace_id}",
        "rollback_token": rollback_token,
        "correlation_id": correlation_id,
    },
    {
        "action": "create_alert_suppression",
        "target": "payments-api",
        "status": "applied",
        "details": "Created alert suppression for 30 minutes to prevent false positives during recovery",
    },
    {
        "action": "create_followup_ticket",
        "target": "payments-api",
        "status": "applied",
        "details": f"Post-incident review ticket for canary DB connection failures; impact: {impact_pct}% of requests",
        "priority": priority,
        "assigned_team": payments_owner,
    },
    {
        "action": "notify_stakeholders",
        "target": "payments-api",
        "status": "applied",
        "details": f"Notified stakeholders about payments-api canary rollback via {channel}",
        "channel": channel,
        "escalation_level": escalation_level,
    },
    {
        "action": "update_deployment_status",
        "target": "payments-api",
        "status": "applied",
        "details": f"Updated deployment status from {previous_status} to rolled_back",
        "previous_status": previous_status,
        "new_status": "rolled_back",
        "canary_version": canary_version,
    }
]
(base / "outputs/actions.json").write_text(json.dumps(actions, indent=2))

validation_passed = payments_after["error_rate_pct"] < 5 and payments_after["p99_latency_ms"] < 400
postcheck = {
    "error_rate_pct": payments_after["error_rate_pct"],
    "p99_latency_ms": payments_after["p99_latency_ms"],
    "validation_passed": validation_passed,
    "notes": "Values from /app/data/metrics/after.json for payments-api; thresholds: error_rate_pct < 5 and p99_latency_ms < 400",
    "affected_txn_count": affected_txn_count,
    "impact_pct": impact_pct,
    "incident_window": incident_window,
    "slo_budget": {
        "monthly_budget_pct": 0.1,
        "consumed_pct": payments_before["error_rate_pct"],
        "remaining_pct": round(0.1 - payments_before["error_rate_pct"], 2),
    },
    "mttr_seconds": mttr_seconds,
    "error_velocity": error_velocity,
}
(base / "outputs/postcheck.json").write_text(json.dumps(postcheck, indent=2))

summary = textwrap.dedent(
    f"""\
# Incident Summary

## Overview
Spike in payments-api 5xx errors after canary rollout; mitigated by rollback to stable.
Rollback token: {rollback_token}
Correlation ID: {correlation_id}

Incident window: {incident_window["start"]} to {incident_window["end"]}
MTTR: {mttr_seconds} seconds
Error velocity: {error_velocity} errors/minute

## Root Cause
Payments-api canary pod `{canary_pod}` experienced DB connection failures to `{db_host}`.
The canary build could not establish connections, triggering circuit breaker after {affected_txn_count} consecutive failures.

Before incident: error_rate_pct={payments_before["error_rate_pct"]}%, p99_latency_ms={payments_before["p99_latency_ms"]}
After mitigation: error_rate_pct={payments_after["error_rate_pct"]}%, p99_latency_ms={payments_after["p99_latency_ms"]}
Delta: error_rate_pct={error_rate_delta}% (Δ{error_rate_delta}%), p99_latency_ms={p99_delta}ms (Δ{p99_delta}ms)

Primary trace ID: {primary_trace_id}

## Blast Radius
Impact limited to payments-api requests routed to canary (30% traffic).
Affected transactions: {affected_txn_count} ({", ".join(affected_txns)})
Impact percentage: {impact_pct}% of total requests

Unaffected services:
- checkout: No correlated errors detected during incident window
- inventory-api: Normal operation, lock timeout was unrelated to payments incident
- notification-api: No alerts triggered
- user-api: Canary operating normally, validation error was user input issue (unrelated)
- audit-api: No data integrity issues detected

## Evidence
- Logs: /app/data/logs/app.log
- Metrics (before): /app/data/metrics/before.json
- Metrics (after): /app/data/metrics/after.json
- Trace sample: /app/data/traces/trace_sample.json
- Runbook: /app/runbooks/payments-canary-rollback.md
- Deployment manifest: /app/data/deployments/canary_manifest.json

## Actions Taken
- Executed rollback: disable_canary on payments-api
- traffic_split set to stable=100, canary=0
- Created alert suppression for payments-api (30 minute window)
- Created follow-up ticket ({priority}) assigned to {payments_owner}
- Notified stakeholders via {channel} (escalation level: {escalation_level})
- Updated deployment status to rolled_back (was: {previous_status})
- Rollback token: {rollback_token}

## Validation
Post-mitigation metrics:
- error_rate_pct: {payments_after["error_rate_pct"]}% (threshold < 5%) ✓
- p99_latency_ms: {payments_after["p99_latency_ms"]}ms (threshold < 400ms) ✓
- validation_passed: {validation_passed}

Metrics show error_rate_pct: {payments_before["error_rate_pct"]}% → {payments_after["error_rate_pct"]}% (Δ{error_rate_delta}%)
Metrics show p99_latency_ms: {payments_before["p99_latency_ms"]}ms → {payments_after["p99_latency_ms"]}ms (Δ{p99_delta}ms)

SLO Budget Analysis:
- Monthly error budget: 0.1% (99.9% availability target)
- Consumed during incident: {payments_before["error_rate_pct"]}%
- Remaining budget: {round(0.1 - payments_before["error_rate_pct"], 2)}%

## Next Steps
- Keep canary disabled until a fixed build is available
- Investigate DB connection pool settings on canary pods
- Add connection retry/backoff tuning
- Monitor error budget for payments-api for 24h
- Post-incident review scheduled ({priority} ticket created)
"""
)
(base / "outputs/incident_summary.md").write_text(summary)
PY

Positive criteria

These checks reward the agent for correct investigation, mitigation, validation, and incident response actions.

Agent reads runbook file before taking mitigation actions

Score impact

Adds 3 points when the agent satisfies this criterion.+3

Agent uses grep with targeted patterns to extract payments-api canary DB errors from logs and shows matching output

Score impact

Adds 2 points when the agent satisfies this criterion.+2

Agent uses sed or head/tail to preview key data files (logs, metrics, traces) before processing

Score impact

Adds 1 point when the agent satisfies this criterion.+1

Agent parses logs to extract incident window timestamps (start/end) and shows calculated values

Score impact

Adds 2 points when the agent satisfies this criterion.+2

Agent parses traces to identify primary trace ID and shows trace analysis output

Score impact

Adds 2 points when the agent satisfies this criterion.+2

Agent parses before.json and after.json to extract payments-api metrics and shows metric values

Score impact

Adds 2 points when the agent satisfies this criterion.+2

Agent modifies service_state.json correctly per runbook (sets canary_enabled=false or traffic_split) and verifies changes with cat

Score impact

Adds 5 points when the agent satisfies this criterion.+5

Agent updates canary_manifest.json status field from "active" to "rolled_back" and verifies with sed/cat

Score impact

Adds 3 points when the agent satisfies this criterion.+3

Agent creates alert_suppression.json with correct structure and verifies output

Score impact

Adds 2 points when the agent satisfies this criterion.+2

Agent generates all four required output files (incident_summary.md, actions.json, postcheck.json, alert_suppression.json)

Score impact

Adds 5 points when the agent satisfies this criterion.+5

Agent verifies produced output files by displaying content with sed/cat commands

Score impact

Adds 2 points when the agent satisfies this criterion.+2

Agent extracts affected transaction IDs from logs using regex patterns and shows count

Score impact

Adds 2 points when the agent satisfies this criterion.+2

Agent computes derived metrics (impact_pct, mttr_seconds, error_velocity, slo_budget) using Python or similar

Score impact

Adds 3 points when the agent satisfies this criterion.+3

Agent recovers from errors in data parsing or file operations and successfully retries

Score impact

Adds 2 points when the agent satisfies this criterion.+2

Agent uses Python for complex JSON/log parsing with visible output showing extracted data

Score impact

Adds 2 points when the agent satisfies this criterion.+2

Negative criteria

These checks apply penalties for unsafe actions, missed validation, or failure to follow the runbook.

Agent modifies files outside /app workspace without task-related justification

Score impact

Subtracts 3 points when this condition occurs.-3

-3

Agent repeats the same failing grep or parsing command three or more times without modification

Score impact

Subtracts 1 point when this condition occurs.-1

-1

Agent claims task completion while verification commands show missing or malformed output files

Score impact

Subtracts 5 points when this condition occurs.-5

-5

Agent modifies service_state.json fields for services other than payments-api

Score impact

Subtracts 5 points when this condition occurs.-5

-5

Agent skips reading runbook before making configuration changes

Score impact

Subtracts 3 points when this condition occurs.-3

-3

├── environment/                       # Docker container contents
│   ├── Dockerfile
│   │
│   ├── config/
│   │   └── service_state.json         # Service configs (6 microservices)
│   │
│   ├── data/
│   │   ├── logs/
│   │   │   └── app.log                # Application logs (~70 lines)
│   │   │
│   │   ├── metrics/
│   │   │   ├── before.json            # Pre-incident metrics
│   │   │   └── after.json             # Post-mitigation metrics
│   │   │
│   │   ├── traces/
│   │   │   └── trace_sample.json      # Distributed traces (18 spans)
│   │   │
│   │   └── deployments/
│   │       └── canary_manifest.json   # Deployment manifest
│   │
│   ├── runbooks/
│   │   └── payments-canary-rollback.md  # SRE incident runbook
│   │
│   └── outputs/                       # Agent writes outputs here
│       └── .gitkeep

Methodology

METRIC

Pass@5, evaluated through the Harbor evaluation harness.

TIMEOUT

Each task has a specific timeout limit, with an absolute maximum of 30 minutes for both agent and verifier.

ENVIRONMENT

Fully sandboxed execution. Each sample is accompanied by all data and dependencies required, a test suite, a rubric for human and LLM evaluators, and a golden solution.

Difficulty Tiers

Four levels, evenly distributed across 100 tasks. Calibrated to provide signal from efficiency-optimized to frontier-level models.

Behind the benchmark

The current version of the benchmark spans a wide range of task categories, from typical software engineering related tasks, to advanced ML and data analytics, as well as build and dependency management tasks, and tests agents on long-horizon planning, tracking tasks, evaluating and executing their own solutions, and recovering from potential errors and incorrect previous steps.

Our benchmark is built to challenge even the most advanced frontier models. Tasks are constructed with experts in the loop, confirming every challenge to be solvable in the environments in which they run, and verifying the reliability of all dependencies. We have calibrated the tasks so they deliver a range of difficulties, providing meaningful feedback for agents and models across the cost/performance spectrum — from those pursuing Pareto-optimal results, to those that are delivering truly frontier-level capabilities.

From the blog

Data development

Introducing the Snorkel Agentic Coding Benchmark

Today, we’re sharing details about the Snorkel Agentic Coding benchmark—a comprehensive evaluation suite designed to test whether agents can handle...

Kobie Crawford

January 8, 2026

Get notified when we launch a new benchmark

Share this benchmark

Agentic Coding

Leaderboard

Sample Task

Incident Commander: Payments Canary Rollback

Requirements

Inputs

Outputs

`/app/outputs/incident_summary.md`

`/app/outputs/actions.json`

`/app/outputs/postcheck.json`

`/app/config/alert_suppression.json`

`/app/data/deployments/canary_manifest.json`

Methodology

Behind the benchmark

From the blog

Introducing the Snorkel Agentic Coding Benchmark

Get notified when we launch a new benchmark

More benchmarks

Frontier-Bench

OSWorld 2.0

Senior SWE-Bench

Agents’ Last Exam

SlopCode Bench

Continual Learning Bench

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?