Back to Benchmarks
Released May 26, 2026
Open Benchmarks Grants

Continual Learning Bench

A benchmark evaluating whether AI systems genuinely improve from prior experience. Unlike static benchmarks that treat every task as independent, CLB measures performance across sequential, stateful task sequences, rewarding systems that accumulate and apply knowledge over time.
Built with
Snorkel AI logo lockup mono white outline pngImageImage
Overview

Most benchmarks make a core assumption: models are stateless. Once they complete a task, they move on to the next as if the first never happened. In practice, deployed systems encounter new information and operate in sequential environments where meaningful improvement should occur.

Continual Learning Bench 1.0 is a benchmark of expert-validated task sequences across real-world domains (software engineering, data science, strategic modeling) where tasks are not independent, systems are expected to change during evaluation, and performance depends on what the system has seen before.

Leaderboard

Sorted by aggregate reward. Only systems with complete task coverage receive a rank.
Rank System Avg Reward Avg Gain Avg Cost
1
ICL · Claude Sonnet 4.6
0.223 0.254 $30.43
2
ICL · GPT-5.4
0.201 0.201 $18.39
3
Claude Code · Sonnet 4.6
0.19 0.239 $38.6
4
Mem0 · GPT-5.4
0.151 0.202 $18.34
5
ICL · Claude Opus 4.7
0.102 0.195 $49.62
6
ICL Notepad · GPT-5.4
0.08 0.078 $14.28
7
ICL · Gemini 3 Flash
0.08 0.164 $7.6
8
Codex · GPT-5.4
0.066 0.146 $27.21
9
ACE · GPT-5.4
0.046 0.086 $62.75
10
ICL Notepad · Claude Sonnet 4.6
0.035 0.182 $31.53
11
ICL Notepad · Gemini 3.1 Pro Preview
-0.002 0.094 $13.32
12
ICL · Gemini 3.1 Pro Preview
-0.056 0.062 $15.23

Aggregate metrics

Pareto frontier — non-dominated systems on this metric pair

How it works

Each task is a sequence of instances. A continually-learning system carries state from one instance to the next; the stateless baseline resets between every instance. The difference between the two is the system's gain.
CONTINUAL LEARNING
S1

S2

S3

S4
STATELESS BASELINE
S1

RESET

S2

RESET

S3

RESET

S4
GAIN
reward
(continual) −
reward
(stateless baseline)

Stateful system vs. stateless baseline

Select a stateful system to compare its reward curve against its own stateless baseline — two lines, same task, with and without continual learning. Values are per-instance reward, averaged across runs.
stateful (continual learning) stateless baseline

Per-task breakdown

System Mean Cum. Reward Mean Cum. Gain Cost Runs
ICL · GPT-5.4 46.198 ± 1.001 26.437 ± 1.001 $1.93 ± $0.05 5
Claude Code · Sonnet 4.6 44.282 ± 1.449 24.522 ± 1.449 $10.40 ± $2.27 5
ICL · Claude Sonnet 4.6 36.584 ± 1.262 16.825 ± 1.262 $3.60 ± $0.17 5
ICL Notepad · Claude Sonnet 4.6 35.993 ± 2.414 16.233 ± 2.414 $2.99 ± $0.27 5
Mem0 · GPT-5.4 33.794 ± 2.986 14.033 ± 2.986 $1.39 ± $0.07 5
ICL · Claude Opus 4.7 33.572 ± 3.082 13.813 ± 3.082 $7.58 ± $0.42 5
ICL · Gemini 3 Flash 33.039 ± 0.879 13.279 ± 0.879 $0.68 ± $0.02 5
ICL · Gemini 3.1 Pro Preview 33.033 ± 1.136 13.273 ± 1.136 $3.84 ± $0.17 5
Codex · GPT-5.4 32.828 ± 0.000 13.068 ± 0.000 $3.15 ± $0.00 1
ICL Notepad · GPT-5.4 31.915 ± 2.122 12.153 ± 2.122 $1.02 ± $0.05 5
ICL Notepad · Gemini 3.1 Pro Preview 29.122 ± 3.011 9.362 ± 3.011 $2.80 ± $0.53 5
ACE · GPT-5.4 19.778 ± 0.009 0.017 ± 0.009 $3.96 ± $0.33 5

Task suite 1.0

Tasks are authored and validated by domain experts. Each task is a sequence of related instances rather than a single static problem — success requires the agent to adapt as the sequence unfolds.
codebase_adaptation
The agent resolves a sequence of GitHub issues on a shared codebase by executing bash commands in a Docker container. Success is measured by how few steps are needed per issue — rewarding agents that accumulate reusable knowledge of the repo over time.
19 sub-tasks
blind_spectrum_monitoring
The agent monitors RF spectrum signals to detect anomalies and identify emitters, operating with incomplete sensor data and shifting sensor configurations. It must learn persistent emitter patterns while adapting to changing array geometry across monitoring sessions.
90 sub-tasks
cohort_studies
The agent estimates patient survival across sequential clinical studies with inconsistent variable definitions and coding conventions. It must synthesize epidemiological knowledge across schemas to improve Kaplan-Meier survival estimates for predefined population cohorts.
20 sub-tasks
database_exploration
The agent answers natural-language questions about an unknown SQLite database by issuing exploratory queries before committing to a final answer. The schema drifts across instances, requiring the agent to relearn structure over time.
40 sub-tasks
exploitable_poker
The agent plays heads-up poker against a deterministic opponent whose strategy has exploitable patterns. It must infer weaknesses from hand outcomes and adapt its betting decisions to accumulate profit over many hands.
120 sub-tasks
sales_prediction
The agent forecasts furniture sales across stores and time periods by writing Python analysis code in Docker. It must learn store-specific growth patterns and schema conventions from historical data, improving its models with each sequential prediction task.
12 sub-tasks

Methodology

reward ↑
Raw task performance score. Higher is better.
gain ↑
Reward minus the same system's stateless baseline — direct measure of how much the system learned from experience.
Agg. Reward / Gain ↑
Each task's reward (or gain) normalized against a reference ceiling and fixed (or corresponding) stateless baseline, then averaged across tasks. Primary ranking metric.
Cost ↓
Aggregate table: sum of each included task's mean single rollout spend. Task table: mean spend per single task rollout.

Acknowledgments

Led by Stanford and Laude Institute, with TB 2.1 lead Kelly Buchanan.

Get notified when we launch a new benchmark

Share this benchmark

For models that need to be right. Not just good enough.