Back to Benchmarks
Released April 24, 2026
Open Benchmarks Grants

SlopCode Bench

A benchmark measuring code quality degradation in AI-assisted codebases. Tracks checkpoint solve rates, erosion (code bloat ratio), and verbosity to evaluate whether models produce correct and clean code under realistic conditions.
Built with
ImageImageImageImage
Overview

The Snorkel Agentic Coding benchmark comprises 100 multi-step coding tasks, evenly distributed across four difficulty tiers, designed to evaluate models across a diverse range of capabilities germane to real-world software engineering work.

Taking insights from our contributions to the Terminal-Bench project, our Agentic Coding tasks evaluate agents in fully sandboxed execution environments. Each task is paired with a human-validated reference solution, comprehensive unit tests, and scoring rubrics that assess both final outputs and the agent's trajectory.

Leaderboard

Showing best version by % Checkpoints. Select both Model and Harness to view all versions.
Model Harness Version Strict Solve % Iso Solve % Core Solve % $/CKPT Erosion Verbosity % AST-Grep % Cloned
GPT 5.5 (High) Codex 0.124.0 14.29 28.06 65.31 $1.51 0.494 0.269 0.249 0.047
GPT 5.3-Codex (High) Codex 0.98.0 11.22 26.02 59.18 $0.69 0.644 0.336 0.314 0.069
GPT 5.4 (High) Codex 0.110.0 10.71 23.47 61.22 $0.82 0.508 0.273 0.240 0.058
GPT 5.2-Codex (High) Codex 0.93.0 9.69 21.94 54.59 $0.85 0.728 0.398 0.364 0.097
Opus 4.6 (High) Claude Code 2.1.32 9.69 20.92 65.31 $3.17 0.737 0.318 0.288 0.103
Opus 4.7 (High) Claude Code 2.1.111 8.16 20.92 64.29 $2.17 0.759 0.357 0.327 0.084
Kimi K2.6 (High) Kimi CLI 1.37.0 10.71 18.88 51.02 $0.74 0.764 0.399 0.359 0.129
Opus 4.5 (High) Claude Code 2.0.51 9.18 17.35 56.12 $2.53 0.691 0.297 0.294 0.091
Sonnet 4.6 (High) Claude Code 2.1.44 7.14 16.84 56.12 $1.96 0.741 0.316 0.298 0.093
Composer 2 Cursor CLI 2026.04.13-a9d7fb5 6.12 16.33 51.53 $0.44 0.716 0.353 0.318 0.107
GLM 5.1 (High) Claude Code 2.1.44 9.69 13.78 38.78 $1.47 0.684 0.322 0.301 0.096
GPT 5.4-Mini (High) Codex 0.110.0 5.10 13.78 51.02 $0.45 0.655 0.330 0.305 0.076
Kimi K2.5 (High) Kimi CLI 1.37.0 4.59 9.69 39.80 $0.33 0.712 0.309 0.306 0.094
Kimi K2.5 OpenCode 1.4.3 4.59 8.67 31.12 $0.53 0.702 0.319 0.297 0.117
GLM 5.1 (High) OpenCode 1.4.3 5.61 8.16 20.41 $0.59 0.550 0.387 0.329 0.145
GPT 5.3-Codex-Spark (High) Codex 0.100.0 3.06 8.16 29.08 $0.20 0.586 0.357 0.340 0.086
Kimi K2.5 (High) Claude Code 2.1.44 3.57 7.14 28.06 $1.07 0.692 0.310 0.301 0.097
MiniMax M2.7 (High) Claude Code 2.1.44 2.55 4.08 28.57 $0.33 0.500 0.265 0.227 0.108
MiniMax M2.7 OpenCode 1.4.3 1.53 3.57 20.92 $0.27 0.746 0.418 0.379 0.146

Performance scatter

Why iterative evaluation

Aider and SWE-Bench evaluate an agent’s ability to solve an issue given a frozen repository. Undoubtedly, this is an important capability, but this is a single point in time. An agent could produce an entirely viable, but utterly different from the ground truth, fix that would fundamentally change how a developer would solve the inevitable extension. Thus, measuring qualitative metrics at a single snapshot in time yields a noisy signal that is scaffolded by prior human decisions. Furthermore, agents are not evaluated on their performance in long-horizon coding tasks, where they must either live with or redesign their original choices. Viewing agentic benchmarks as iterative processes is the only way to evaluate the true nature of software engineering.

We must adopt this framing both now and for the future of agentic coding. Much of the recent discourse on agentic coding tools has focused on the “slop” they generate (verbose comments, defensive coding, bloat). While “slop” is ill-defined, the core of these grievances hits squarely on the limitations of single-iteration benchmarks. It is tough to understand and maintain code riddled with these issues. This extends to structural issues generated by models: making minor modifications often requires rewriting the entire codebase because it is easier than extending agent-written code. Iterative benchmarks like SCBench are crucial for truly autonomous SWE agents. Without them, we would have no way to measure their ability to function autonomously given only specification updates, because it is impossible for us to know every required feature or extension from the outset.

Design principles

None of this would be possible without deliberate design choices in benchmark construction:
No prescribed interfaces
All that is provided is the external contract of either the CLI interface or the API endpoints and response formats. Agents select the underlying architecture and the approach to solving the problem. Providing a function signature or other internal hints would mask the signal we want to measure.
No explicit test cases or test suite
The model only sees the examples in the spec and the explanation of the behaviors. Part of eroding code quality is the inability to think of obvious edge cases for a spec. Thus, we require the agent to identify and handle the specified edge cases.
Black-box, language agnostic evaluation
Solutions are judged purely on the outputs they produce, given an input. Each problem includes normalization code to ensure that minor arbitrary decisions, such as white-space formatting, do not affect the solution’s correctness.

Problem Catalog · All 36 v1.0 Problems

developer tools
8
web
7
data processing
6
cli tools
5
configuration management
2
dsl
2
algorithms
2
simulation
1
databases
1
networking
1
file systems
1
easy
12
medium
12
hard
12
cfgpipe
configuration-management
easy
6
CLI configuration resolver that reads a JSON schema, resolves typed parameters from prioritized sources (default, env, file, primary/secondary stores, args), supports nested groups, watch mode with structured change events, advanced types (duration, pattern, map, list, redacted), and store prefix composition.
By Gabriel Orlanski
code-search
developer-tools
easy
5
Multi-language code search tool (inspired by ast-grep) that finds patterns and applies refactorings. Starts with regex search in Python, adds AST-based pattern matching with metavariables, then auto-fix with conflict resolution. Supports Python, JS, C++, Rust, Java, Go, Haskell.
By Gabriel Orlanski
circuit-eval
simulation
medium
8
CLI tool for digital circuit evaluation and optimization. Parses scalar and vector circuits in .circ, .json, and .bench formats. Evaluates with 2-valued and 3-valued logic, generates truth tables, checks equivalence, and optimizes circuits with configurable passes.
By Gabriel Orlanski
database-migration
databases
medium
5
SQLite migration CLI. Starts with basic DDL (create table, add/drop columns), adds data transformations and backfills, then foreign keys/indexes/check constraints with rollback support, and finally dependency management with topological sorting and cycle detection.
By Albert Ge
dag-execution
dsl
hard
3
Workflow orchestration system with a custom DSL for defining DAGs of tasks with dependencies and parameters. Includes a parser, execution engine, and JSONL logging. Adds caching with content-hashing and time-based strategies, then dynamic cache overrides per-task.
By Gabriel Orlanski
dynamic-buffer
data-processing
hard
4
Code generator that infers data transformations from input/output examples and emits working code in Python, JS, C++, or Rust. Handles filtering, column ops, stateful transforms (prefix sums, sliding windows), and window functions. Generated code streams data with fixed buffers.
By Gabriel Orlanski
Show all 36 problems

Methodology

CKPT Solved
Checkpoint, and all prior checkpoints, are solved
Isolated Solved
% Passes only the tests for the checkpoint.
Core Solved
Just passes the core tests for a checkpoint.
$ / CKPT
Average USD cost per checkpoint
Erosion
Fraction of total complexity mass in high-complexity functions (CC > 10), where mass(f) = CC(f) × √SLOC(f). 0 = no high-complexity functions, 1 = all mass in high-CC functions.
Verbosity
Union of AST-Grep flagged lines and clone lines divided by LOC. Bounded [0, 1].
% AST-Grep
Percentage of lines flagged by AST-Grep rules for wasteful code patterns.
% Cloned
Percentage of lines that are structural duplicates (clone lines / LOC).

What the AST-Grep rules look for

The % AST-Grep metric scores generated code against 341 named slop patterns (205 unique rule types after deduplication) defined in configs/slop_rules.yaml. Each rule pairs an AST-Grep pattern with a human-readable diagnosis. Diagnosis text is quoted verbatim from the YAML. (The file has 14 additional work-in-progress entries we exclude from these counts.)

341
Production Patterns
205
Unique Rules
331
Warning
7
Info
3
Hint
Python
Language Scope
chained-comparison-opportunity
warning
Use chained comparison (e.g., a < b < c) instead of 'and'
isinstance-return-ladder
warning
Long isinstance/elif ladder returning simple values; prefer a dispatch table or polymorphism.
json-dumps-then-loads
warning
json.loads(json.dumps(x)) is noisy; copy the structure directly
nested-if-no-else
warning
Nested if statements without else - consider flattening or combining conditions
manual-min-max
warning
Manual min/max logic - use built-in min() or max()
for-range-len
warning
range(len(seq)) loop suggests index juggling; prefer enumerate
Sample rule:
chained-comparison-opportunity

$A < $B and $B < $C
$A > $B and $B > $C

AST-Grep matches the pattern above; the rule fires on each match and contributes to the % AST-Grep score.
Show all 205 rules

Acknowledgments

The benchmark is led by Gabriel Orlanski (University of Wisconsin–Madison) with support from DARPA, NPF and Snorkel AI through the Open Benchmarks Grants Program.

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.