Back to Benchmarks
Released April 8, 2026

Agentic Coding

A benchmark for evaluating AI models on complex, real-world coding tasks that require multi-step reasoning, tool use, and autonomous problem-solving.
Overview

SlopCodeBench (SCBench) is a benchmark designed to evaluate coding agents the way real software actually gets built: through repeated requirement changes and extensions. Instead of treating the spec as a one-shot oracle, each task is a sequence of checkpoints where an agent implements an initial version, then extends its own solution multiple times as new requirements arrive.

The v1.0 release includes 36 problems with 196 total checkpoints, evaluated in a black-box setting where only a CLI or API contract is given. No prescribed architecture, function signatures, or module boundaries, so early design decisions can meaningfully help or hurt later work.

Leaderboard

Rank Model Score
1 Claude Opus 4.5
58
2 Claude Sonnet 4.5
57.6
3 Gemini 3 Pro Preview
51.6
4 gpt-5.2
49.4
5 gpt-5
45.2
6 Kimi-K2-Thinking
36.8
7 Devstral 2
33.2
8 Grok 4.1 Fast
25.2
9 Qwen 3 Coder 480B
18.8
10 Mistral Large 3
13.8

Sample Task

Incident Commander: Payments Canary Rollback

You are the on-call SRE responding to a spike in 5xx errors after a canary rollout of payments-api. All observability artifacts, runbooks, and config files are already packaged inside the container. No network access is allowed.

Requirements

  1. Diagnose the incident using logs, metrics, and traces to identify the root cause and blast radius.

  2. Execute mitigation per the runbook (/app/runbooks/payments-canary-rollback.md):

    • Update /app/config/service_state.json as specified in the runbook

    • Only modify payments-api - do not change any other services

    • Only change fields specified in the runbook - preserve all other fields exactly

    • Create any required configuration files specified in the runbook

  3. Produce output files documenting your diagnosis, actions, and validation.

Inputs

  • /app/data/logs/app.log — application logs

  • /app/data/metrics/before.json — pre-incident metrics

  • /app/data/metrics/after.json — post-mitigation metrics

  • /app/data/traces/trace_sample.json — distributed traces

  • /app/data/deployments/ — deployment manifests

  • /app/runbooks/payments-canary-rollback.md — mitigation runbook

  • /app/config/service_state.json — service configuration

Outputs

All outputs must be created exactly as specified below.

/app/outputs/incident_summary.md

Markdown document with these sections: Overview, Root Cause, Blast Radius, Evidence, Actions Taken, Validation, Next Steps.

  • Cite full artifact paths used in analysis

  • Include pod, DB host, and failure mode in root cause

  • Include the primary trace ID (the first payments-api canary error trace with DB connection issues, by timestamp), rollback_token, and correlation ID

  • List affected transaction IDs and count

  • Include the incident time window (start and end timestamps of payments-api canary DB errors)

  • Confirm unaffected services explicitly - check all other services in service_state.json (excluding payments-api) and confirm they are unaffected

  • Show before/after metrics with deltas

  • Include an "SLO Budget" section containing the words "SLO" and "budget", plus the calculated remaining budget value (monthly_budget_pct minus consumed_pct, rounded to 2 decimal places)

  • Use "disable" or "disabled" when describing the canary action

/app/outputs/actions.json

JSON array with exactly FIVE action objects:

  1. {"action": "disable_canary", "target": "payments-api", "status": "...", "details": "...", "evidence": "...", "rollback_token": "...", "correlation_id": "..."}

    • evidence must reference app.log and include the primary trace ID

    • correlation_id must be extracted from the relevant log entries

  2. {"action": "create_alert_suppression", "target": "payments-api", "status": "...", "details": "..."}

  3. {"action": "create_followup_ticket", "target": "payments-api", "status": "...", "details": "...", "priority": "...", "assigned_team": "..."}

    • priority and assigned_team must be determined per the runbook
  4. {"action": "notify_stakeholders", "target": "payments-api", "status": "...", "details": "...", "channel": "...", "escalation_level": "..."}

    • channel and escalation_level must be determined per the runbook
  5. {"action": "update_deployment_status", "target": "payments-api", "status": "...", "details": "...", "previous_status": "...", "new_status": "...", "canary_version": "..."}

    • previous_status: the status from the deployment manifest before rollback

    • new_status: "rolled_back"

    • canary_version: the version from the deployment manifest

/app/outputs/postcheck.json

JSON object with fields: error_rate_pct, p99_latency_ms, validation_passed, notes, affected_txn_count, impact_pct, incident_window, slo_budget, mttr_seconds, error_velocity

  • error_rate_pct and p99_latency_ms: values from after-metrics for payments-api

  • validation_passed: boolean based on runbook thresholds

  • notes: must contain "after.json", the literal metric names ("error_rate_pct", "p99_latency_ms"), and their threshold values ("5" and "400")

  • affected_txn_count: count of unique transactions from payments-api canary DB error log entries

  • impact_pct: (affected_txn_count / total_requests_from_before_metrics) * 100, rounded to 2 decimal places

  • incident_window: object with start and end fields containing the first and last payments-api canary DB error timestamps from logs

  • slo_budget: object with these exact fields:

    • monthly_budget_pct: 0.1 (representing 99.9% availability SLO)

    • consumed_pct: the error_rate_pct from before-metrics (during incident)

    • remaining_pct: monthly_budget_pct minus consumed_pct

  • mttr_seconds: Mean Time To Recovery - seconds between first canary DB error and last canary DB error (parse timestamps and compute difference)

  • error_velocity: errors per minute during incident - affected_txn_count divided by (mttr_seconds / 60), rounded to 2 decimal places

/app/config/alert_suppression.json

JSON object with fields:

  • service: "payments-api"

  • suppression_window_minutes: 30

  • reason: description of the incident

  • created_by: "incident-commander"

  • expires_at: ISO 8601 timestamp, computed as incident start time (first canary error) + 30 minutes

/app/data/deployments/canary_manifest.json

Update the deployment manifest:

  • Find the payments-api deployment entry

  • Change its status field from "active" to "rolled_back"

  • Preserve all other fields and entries exactly

Methodology

METRIC
Pass@5, evaluated through the Harbor evaluation harness.
TIMEOUT
Each task has a specific timeout limit, with an absolute maximum of 30 minutes for both agent and verifier.
ENVIRONMENT
Fully sandboxed execution. Each sample is accompanied by all data and dependencies required, a test suite, a rubric for human and LLM evaluators, and a golden solution.
Difficulty Tiers
Four levels, evenly distributed across 100 tasks. Calibrated to provide signal from efficiency-optimized to frontier-level models.

Behind the benchmark

The current version of the benchmark spans a wide range of task categories, from typical software engineering related tasks, to advanced ML and data analytics, as well as build and dependency management tasks, and tests agents on long-horizon planning, tracking tasks, evaluating and executing their own solutions, and recovering from potential errors and incorrect previous steps.

Our benchmark is built to challenge even the most advanced frontier models. Tasks are constructed with experts in the loop, confirming every challenge to be solvable in the environments in which they run, and verifying the reliability of all dependencies. We have calibrated the tasks so they deliver a range of difficulties, providing meaningful feedback for agents and models across the cost/performance spectrum -- from those pursuing Pareto-optimal results, to those that are delivering truly frontier-level capabilities.

Resources

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.