Agentic Coding
SlopCodeBench (SCBench) is a benchmark designed to evaluate coding agents the way real software actually gets built: through repeated requirement changes and extensions. Instead of treating the spec as a one-shot oracle, each task is a sequence of checkpoints where an agent implements an initial version, then extends its own solution multiple times as new requirements arrive.
The v1.0 release includes 36 problems with 196 total checkpoints, evaluated in a black-box setting where only a CLI or API contract is given. No prescribed architecture, function signatures, or module boundaries, so early design decisions can meaningfully help or hurt later work.
Leaderboard
| Rank | Model | Score |
|---|---|---|
| 1 | Claude Opus 4.5 |
58
|
| 2 | Claude Sonnet 4.5 |
57.6
|
| 3 | Gemini 3 Pro Preview |
51.6
|
| 4 | gpt-5.2 |
49.4
|
| 5 | gpt-5 |
45.2
|
| 6 | Kimi-K2-Thinking |
36.8
|
| 7 | Devstral 2 |
33.2
|
| 8 | Grok 4.1 Fast |
25.2
|
| 9 | Qwen 3 Coder 480B |
18.8
|
| 10 | Mistral Large 3 |
13.8
|
Sample Task
Incident Commander: Payments Canary Rollback
You are the on-call SRE responding to a spike in 5xx errors after a canary rollout of payments-api. All observability artifacts, runbooks, and config files are already packaged inside the container. No network access is allowed.
Requirements
-
Diagnose the incident using logs, metrics, and traces to identify the root cause and blast radius.
-
Execute mitigation per the runbook (
/app/runbooks/payments-canary-rollback.md):-
Update
/app/config/service_state.jsonas specified in the runbook -
Only modify
payments-api- do not change any other services -
Only change fields specified in the runbook - preserve all other fields exactly
-
Create any required configuration files specified in the runbook
-
-
Produce output files documenting your diagnosis, actions, and validation.
Inputs
-
/app/data/logs/app.log— application logs -
/app/data/metrics/before.json— pre-incident metrics -
/app/data/metrics/after.json— post-mitigation metrics -
/app/data/traces/trace_sample.json— distributed traces -
/app/data/deployments/— deployment manifests -
/app/runbooks/payments-canary-rollback.md— mitigation runbook -
/app/config/service_state.json— service configuration
Outputs
All outputs must be created exactly as specified below.
/app/outputs/incident_summary.md
Markdown document with these sections: Overview, Root Cause, Blast Radius, Evidence, Actions Taken, Validation, Next Steps.
-
Cite full artifact paths used in analysis
-
Include pod, DB host, and failure mode in root cause
-
Include the primary trace ID (the first payments-api canary error trace with DB connection issues, by timestamp), rollback_token, and correlation ID
-
List affected transaction IDs and count
-
Include the incident time window (start and end timestamps of payments-api canary DB errors)
-
Confirm unaffected services explicitly - check all other services in service_state.json (excluding payments-api) and confirm they are unaffected
-
Show before/after metrics with deltas
-
Include an "SLO Budget" section containing the words "SLO" and "budget", plus the calculated remaining budget value (monthly_budget_pct minus consumed_pct, rounded to 2 decimal places)
-
Use "disable" or "disabled" when describing the canary action
/app/outputs/actions.json
JSON array with exactly FIVE action objects:
-
{"action": "disable_canary", "target": "payments-api", "status": "...", "details": "...", "evidence": "...", "rollback_token": "...", "correlation_id": "..."}-
evidencemust referenceapp.logand include the primary trace ID -
correlation_idmust be extracted from the relevant log entries
-
-
{"action": "create_alert_suppression", "target": "payments-api", "status": "...", "details": "..."} -
{"action": "create_followup_ticket", "target": "payments-api", "status": "...", "details": "...", "priority": "...", "assigned_team": "..."}priorityandassigned_teammust be determined per the runbook
-
{"action": "notify_stakeholders", "target": "payments-api", "status": "...", "details": "...", "channel": "...", "escalation_level": "..."}channelandescalation_levelmust be determined per the runbook
-
{"action": "update_deployment_status", "target": "payments-api", "status": "...", "details": "...", "previous_status": "...", "new_status": "...", "canary_version": "..."}-
previous_status: the status from the deployment manifest before rollback -
new_status: "rolled_back" -
canary_version: the version from the deployment manifest
-
/app/outputs/postcheck.json
JSON object with fields: error_rate_pct, p99_latency_ms, validation_passed, notes, affected_txn_count, impact_pct, incident_window, slo_budget, mttr_seconds, error_velocity
-
error_rate_pctandp99_latency_ms: values from after-metrics for payments-api -
validation_passed: boolean based on runbook thresholds -
notes: must contain "after.json", the literal metric names ("error_rate_pct", "p99_latency_ms"), and their threshold values ("5" and "400") -
affected_txn_count: count of unique transactions from payments-api canary DB error log entries -
impact_pct: (affected_txn_count / total_requests_from_before_metrics) * 100, rounded to 2 decimal places -
incident_window: object withstartandendfields containing the first and last payments-api canary DB error timestamps from logs -
slo_budget: object with these exact fields:-
monthly_budget_pct: 0.1 (representing 99.9% availability SLO) -
consumed_pct: the error_rate_pct from before-metrics (during incident) -
remaining_pct: monthly_budget_pct minus consumed_pct
-
-
mttr_seconds: Mean Time To Recovery - seconds between first canary DB error and last canary DB error (parse timestamps and compute difference) -
error_velocity: errors per minute during incident - affected_txn_count divided by (mttr_seconds / 60), rounded to 2 decimal places
/app/config/alert_suppression.json
JSON object with fields:
-
service: "payments-api" -
suppression_window_minutes: 30 -
reason: description of the incident -
created_by: "incident-commander" -
expires_at: ISO 8601 timestamp, computed as incident start time (first canary error) + 30 minutes
/app/data/deployments/canary_manifest.json
Update the deployment manifest:
-
Find the payments-api deployment entry
-
Change its
statusfield from "active" to "rolled_back" -
Preserve all other fields and entries exactly
Methodology
Behind the benchmark
The current version of the benchmark spans a wide range of task categories, from typical software engineering related tasks, to advanced ML and data analytics, as well as build and dependency management tasks, and tests agents on long-horizon planning, tracking tasks, evaluating and executing their own solutions, and recovering from potential errors and incorrect previous steps.
Our benchmark is built to challenge even the most advanced frontier models. Tasks are constructed with experts in the loop, confirming every challenge to be solvable in the environments in which they run, and verifying the reliability of all dependencies. We have calibrated the tasks so they deliver a range of difficulties, providing meaningful feedback for agents and models across the cost/performance spectrum -- from those pursuing Pareto-optimal results, to those that are delivering truly frontier-level capabilities.

