Open Benchmarks Grants
Continual Learning Bench
A benchmark evaluating whether AI systems genuinely improve from prior experience. Unlike static benchmarks that treat every task as independent, CLB measures performance across sequential, stateful task sequences, rewarding systems that accumulate and apply knowledge over time.
Built with

Overview
Most benchmarks make a core assumption: models are stateless. Once they complete a task, they move on to the next as if the first never happened. In practice, deployed systems encounter new information and operate in sequential environments where meaningful improvement should occur.
Continual Learning Bench 1.0 is a benchmark of expert-validated task sequences across real-world domains (software engineering, data science, strategic modeling) where tasks are not independent, systems are expected to change during evaluation, and performance depends on what the system has seen before.
Leaderboard
Sorted by aggregate reward. Only systems with complete task coverage receive a rank.
| Rank | System | Avg Reward | Avg Gain | Avg Cost |
|---|---|---|---|---|
| 1 |
|
0.223 | 0.254 | $30.43 |
| 2 |
|
0.201 | 0.201 | $18.39 |
| 3 |
|
0.19 | 0.239 | $38.6 |
| 4 |
|
0.151 | 0.202 | $18.34 |
| 5 |
|
0.102 | 0.195 | $49.62 |
| 6 |
|
0.08 | 0.078 | $14.28 |
| 7 |
|
0.08 | 0.164 | $7.6 |
| 8 |
|
0.066 | 0.146 | $27.21 |
| 9 |
|
0.046 | 0.086 | $62.75 |
| 10 |
|
0.035 | 0.182 | $31.53 |
| 11 |
|
-0.002 | 0.094 | $13.32 |
| 12 |
|
-0.056 | 0.062 | $15.23 |
Aggregate metrics
Pareto frontier — non-dominated systems on this metric pair
How it works
Each task is a sequence of instances. A continually-learning system carries state from one instance to the next; the stateless baseline resets between every instance. The difference between the two is the system's gain.
CONTINUAL LEARNING
S1
S2
S3
S4
STATELESS BASELINE
S1
RESET
S2
RESET
S3
RESET
S4
GAIN
reward
(continual) −
reward
(stateless baseline)
Stateful system vs. stateless baseline
Select a stateful system to compare its reward curve against its own stateless baseline — two lines, same task, with and without continual learning. Values are per-instance reward, averaged across runs.
stateful (continual learning)
stateless baseline
Per-task breakdown
| System | Mean Cum. Reward | Mean Cum. Gain | Cost | Runs |
|---|---|---|---|---|
| ICL · GPT-5.4 | 46.198 ± 1.001 | 26.437 ± 1.001 | $1.93 ± $0.05 | 5 |
| Claude Code · Sonnet 4.6 | 44.282 ± 1.449 | 24.522 ± 1.449 | $10.40 ± $2.27 | 5 |
| ICL · Claude Sonnet 4.6 | 36.584 ± 1.262 | 16.825 ± 1.262 | $3.60 ± $0.17 | 5 |
| ICL Notepad · Claude Sonnet 4.6 | 35.993 ± 2.414 | 16.233 ± 2.414 | $2.99 ± $0.27 | 5 |
| Mem0 · GPT-5.4 | 33.794 ± 2.986 | 14.033 ± 2.986 | $1.39 ± $0.07 | 5 |
| ICL · Claude Opus 4.7 | 33.572 ± 3.082 | 13.813 ± 3.082 | $7.58 ± $0.42 | 5 |
| ICL · Gemini 3 Flash | 33.039 ± 0.879 | 13.279 ± 0.879 | $0.68 ± $0.02 | 5 |
| ICL · Gemini 3.1 Pro Preview | 33.033 ± 1.136 | 13.273 ± 1.136 | $3.84 ± $0.17 | 5 |
| Codex · GPT-5.4 | 32.828 ± 0.000 | 13.068 ± 0.000 | $3.15 ± $0.00 | 1 |
| ICL Notepad · GPT-5.4 | 31.915 ± 2.122 | 12.153 ± 2.122 | $1.02 ± $0.05 | 5 |
| ICL Notepad · Gemini 3.1 Pro Preview | 29.122 ± 3.011 | 9.362 ± 3.011 | $2.80 ± $0.53 | 5 |
| ACE · GPT-5.4 | 19.778 ± 0.009 | 0.017 ± 0.009 | $3.96 ± $0.33 | 5 |
Task suite 1.0
Tasks are authored and validated by domain experts. Each task is a sequence of related instances rather than a single static problem — success requires the agent to adapt as the sequence unfolds.
codebase_adaptation
The agent resolves a sequence of GitHub issues on a shared codebase by executing bash commands in a Docker container. Success is measured by how few steps are needed per issue — rewarding agents that accumulate reusable knowledge of the repo over time.
19 sub-tasks
blind_spectrum_monitoring
The agent monitors RF spectrum signals to detect anomalies and identify emitters, operating with incomplete sensor data and shifting sensor configurations. It must learn persistent emitter patterns while adapting to changing array geometry across monitoring sessions.
90 sub-tasks
cohort_studies
The agent estimates patient survival across sequential clinical studies with inconsistent variable definitions and coding conventions. It must synthesize epidemiological knowledge across schemas to improve Kaplan-Meier survival estimates for predefined population cohorts.
20 sub-tasks
database_exploration
The agent answers natural-language questions about an unknown SQLite database by issuing exploratory queries before committing to a final answer. The schema drifts across instances, requiring the agent to relearn structure over time.
40 sub-tasks
exploitable_poker
The agent plays heads-up poker against a deterministic opponent whose strategy has exploitable patterns. It must infer weaknesses from hand outcomes and adapt its betting decisions to accumulate profit over many hands.
120 sub-tasks
sales_prediction
The agent forecasts furniture sales across stores and time periods by writing Python analysis code in Docker. It must learn store-specific growth patterns and schema conventions from historical data, improving its models with each sequential prediction task.
12 sub-tasks
Methodology
reward ↑
Raw task performance score. Higher is better.
gain ↑
Reward minus the same system's stateless baseline — direct measure of how much the system learned from experience.
Agg. Reward / Gain ↑
Each task's reward (or gain) normalized against a reference ceiling and fixed (or corresponding) stateless baseline, then averaged across tasks. Primary ranking metric.
Cost ↓
Aggregate table: sum of each included task's mean single rollout spend. Task table: mean spend per single task rollout.

