ImageImage
ImageImage

SNORKEL DATA SERIES // Terminal-Style Coding Tasks for Agents

Frontier datasets for terminal-based 
agentic coding

Built for teams training agents in terminal and repository environments, this Snorkel Data Series provides the high-volume, expert-authored task data needed to move from simple code generation to autonomous software engineering
REQUEST DATA SAMPLES //
By submitting this form, I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Two dataset tracks: SWE-Bench-CLI+ and Terminal-Bench+

Designed for leading AI labs, our coding tracks are curriculum-structured to progressively increase difficulty, paired with dockerized evaluation infrastructure designed to match production engineering environments.
  • SWE-Bench CLI+
    Terminal-based, repo-grounded SWE tasks inside real repositiories, spanning 7+ languages.

    Agents must navigate real codebases, manage cross-file dependencies, and execute fixes via the CLI across multiple languages.

  • Terminal-Bench+
    Multi-step terminal tasks with milestones, tools and larger environments

    Optimized for long-horizon planning, tool use, and system-state manipulation under realistic constraints.

This Data Series is intentionally calibrated to stress state-of-the-art coding agents

Built for Frontier model evaluation.

  • Tiered difficulty from Core to Frontier
  • Calibrated to remain challenging for models that have memorized public software benchmarks.
  • Designed for SFT/RL training, benchmarking, and deployment validation

If your agent succeeds here, it performs in production.

Why the Snorkel Data Series

Image
High-volume quarterly drops
Image
Multi-layer quality pipeline
Image
Unified execution environment
Image
Direct roadmap influence

Expert-led validation

Every task is built and validated through a multi-layer quality pipeline
01

Human review

SMEs verify clarity, correctness, and full solvability.
02

LLM-assisted validation

Automated checks flag instruction-test mismatches and missing constraints.
03

Deterministic testing

Code-based unit tests validate compliance, syntax, formatting, and outcomes.
04

Guardrails

Additional checks catch cheating paths, non-determinism, and reward hacking.
Image
Image

Accelerate agent performance using verifiable, multi-step CLI environments with the Snorkel Data Series