Open Benchmarks Grants

Continual Learning Bench

A benchmark evaluating whether AI systems genuinely improve from prior experience. Unlike static benchmarks that treat every task as independent, Continual Learning Bench measures performance across sequential, stateful task sequences, rewarding systems that accumulate and apply knowledge over time.

Built with

Resources

Overview

Most benchmarks make a core assumption: models are stateless. Once they complete a task, they move on to the next as if the first never happened. In practice, deployed systems encounter new information and operate in sequential environments where meaningful improvement should occur.

Continual Learning Bench is a benchmark of expert-validated task sequences across real-world domains (software engineering, data science, strategic modeling) where tasks are not independent, systems are expected to change during evaluation, and performance depends on what the system has seen before.

Leaderboard

Sorted by aggregate reward. Only systems with complete task coverage receive a rank.

Rank	System	Avg Reward	Avg Gain	Avg Cost
1	ICL · Claude Sonnet 4.6	0.196	0.241	$30.43
2	ICL · GPT-5.4	0.189	0.189	$18.39
3	Claude Code · Sonnet 4.6	0.185	0.241	$38.6
4	ICL · Claude Opus 4.7	0.183	0.195	$49.62
5	Mem0 · GPT-5.4	0.148	0.224	$18.34
6	ICL Notepad · Claude Sonnet 4.6	0.132	0.182	$31.53
7	ICL Notepad · GPT-5.4	0.104	0.156	$14.28
8	ICL · Gemini 3 Flash	0.092	0.155	$7.6
9	ACE · GPT-5.4	0.066	0.077	$62.75
10	Codex · GPT-5.4	0.057	0.12	$27.21
11	ICL Notepad · Gemini 3.1 Pro Preview	0.003	0.081	$13.32
12	ICL · Gemini 3.1 Pro Preview	-0.076	0.036	$15.23

Aggregate metrics

How it works

Each task is a sequence of instances. A continually-learning system carries state from one instance to the next; the stateless baseline resets between every instance. The difference between the two is the system's gain.

CONTINUAL LEARNING

STATELESS BASELINE

RESET

GAIN

reward

(continual) −

reward

(stateless baseline)

Stateful system vs. stateless baseline

Select a stateful system to compare its reward curve against its own stateless baseline — two lines, same task, with and without continual learning. Values are per-instance reward, averaged across runs.

System

Task

stateful (continual learning) stateless baseline

Per-task breakdown

Task

System	Mean Cum. Reward	Mean Cum. Gain	Cost	Runs
ICL · GPT-5.4	46.198 ± 1.001	26.437 ± 1.001	$1.93 ± $0.05	5
Claude Code · Sonnet 4.6	44.282 ± 1.449	24.522 ± 1.449	$10.40 ± $2.27	5
ICL · Claude Sonnet 4.6	36.584 ± 1.262	16.825 ± 1.262	$3.60 ± $0.17	5
ICL Notepad · Claude Sonnet 4.6	35.993 ± 2.414	16.233 ± 2.414	$2.99 ± $0.27	5
Mem0 · GPT-5.4	33.794 ± 2.986	14.033 ± 2.986	$1.39 ± $0.07	5
ICL · Claude Opus 4.7	33.572 ± 3.082	13.813 ± 3.082	$7.58 ± $0.42	5
ICL · Gemini 3 Flash	33.039 ± 0.879	13.279 ± 0.879	$0.68 ± $0.02	5
ICL · Gemini 3.1 Pro Preview	33.033 ± 1.136	13.273 ± 1.136	$3.84 ± $0.17	5
Codex · GPT-5.4	32.828 ± 0.000	13.068 ± 0.000	$3.15 ± $0.00	1
ICL Notepad · GPT-5.4	31.915 ± 2.122	12.153 ± 2.122	$1.02 ± $0.05	5
ICL Notepad · Gemini 3.1 Pro Preview	29.122 ± 3.011	9.362 ± 3.011	$2.80 ± $0.53	5
ACE · GPT-5.4	19.778 ± 0.009	0.017 ± 0.009	$3.96 ± $0.33	5

System	Mean Cum. Reward	Mean Cum. Gain	Cost	Runs
ACE · GPT-5.4	11.580 ± 0.181	1.580 ± 0.181	$15.66 ± $0.79	5
Mem0 · GPT-5.4	11.105 ± 1.100	2.980 ± 1.100	$2.62 ± $0.16	5
ICL · Claude Opus 4.7	10.360 ± 0.924	1.485 ± 0.924	$8.03 ± $1.44	5
ICL · GPT-5.4	10.140 ± 1.959	0.690 ± 1.959	$3.56 ± $0.70	5
ICL · Claude Sonnet 4.6	9.755 ± 0.223	2.705 ± 0.223	$6.92 ± $0.74	5
ICL Notepad · GPT-5.4	8.860 ± 0.685	0.610 ± 0.685	$3.49 ± $0.19	5
ICL Notepad · Claude Sonnet 4.6	8.765 ± 0.455	0.890 ± 0.455	$4.24 ± $0.53	5
Codex · GPT-5.4	7.450 ± 0.000	-0.800 ± 0.000	$3.79 ± $0.00	1
ICL · Gemini 3 Flash	7.420 ± 0.934	-0.330 ± 0.934	$2.75 ± $0.35	5
ICL Notepad · Gemini 3.1 Pro Preview	7.110 ± 0.634	-1.315 ± 0.634	$3.51 ± $0.05	5
Claude Code · Sonnet 4.6	6.630 ± 1.755	0.730 ± 1.755	$6.78 ± $0.43	5
ICL · Gemini 3.1 Pro Preview	5.125 ± 0.652	-1.550 ± 0.652	$3.40 ± $0.96	5

System	Mean Cum. Reward	Mean Cum. Gain	Cost	Runs
ICL · GPT-5.4	0.957 ± 0.393	-0.037 ± 0.393	$3.73 ± $0.34	5
Codex · GPT-5.4	0.821 ± 0.000	0.316 ± 0.000	$7.76 ± $0.00	1
Mem0 · GPT-5.4	0.778 ± 0.315	-0.067 ± 0.315	$6.00 ± $0.55	5
ICL · Claude Sonnet 4.6	0.762 ± 0.240	0.185 ± 0.240	$5.62 ± $0.40	5
ACE · GPT-5.4	0.759 ± 0.127	0.383 ± 0.127	$12.80 ± $1.11	5
ICL · Gemini 3 Flash	0.576 ± 0.215	0.437 ± 0.215	$1.25 ± $0.15	5
Claude Code · Sonnet 4.6	0.496 ± 0.226	-0.309 ± 0.226	$7.21 ± $0.50	5
ICL Notepad · GPT-5.4	0.476 ± 0.046	-0.900 ± 0.046	$4.44 ± $0.05	5
ICL Notepad · Gemini 3.1 Pro Preview	0.327 ± 0.086	-0.366 ± 0.086	$1.53 ± $0.06	5
ICL · Gemini 3.1 Pro Preview	0.257 ± 0.143	-0.563 ± 0.143	$1.83 ± $0.10	5
ICL · Claude Opus 4.7	-0.121 ± 0.093	-0.091 ± 0.093	$6.97 ± $0.22	5
ICL Notepad · Claude Sonnet 4.6	-0.784 ± 0.358	0.795 ± 0.358	$11.56 ± $0.23	5

System	Mean Cum. Reward	Mean Cum. Gain	Cost	Runs
Claude Code · Sonnet 4.6	22.053 ± 1.182	13.853 ± 1.182	$3.34 ± $0.23	5
Mem0 · GPT-5.4	17.240 ± 1.300	12.907 ± 1.300	$1.97 ± $0.10	5
ICL · Claude Opus 4.7	15.653 ± 1.781	9.587 ± 1.781	$5.22 ± $0.22	5
ICL · Gemini 3 Flash	15.027 ± 0.712	11.493 ± 0.712	$0.42 ± $0.02	5
ICL · Claude Sonnet 4.6	15.013 ± 1.427	8.480 ± 1.427	$1.95 ± $0.16	5
ICL · GPT-5.4	13.880 ± 1.715	8.347 ± 1.715	$1.03 ± $0.05	5
ICL Notepad · GPT-5.4	12.373 ± 1.158	6.373 ± 1.158	$1.45 ± $0.07	5
ICL · Gemini 3.1 Pro Preview	11.560 ± 2.894	6.827 ± 2.894	$1.32 ± $0.15	5
ICL Notepad · Claude Sonnet 4.6	11.000 ± 0.877	3.667 ± 0.877	$2.24 ± $0.05	5
Codex · GPT-5.4	9.600 ± 0.000	6.133 ± 0.000	$1.85 ± $0.00	1
ICL Notepad · Gemini 3.1 Pro Preview	8.520 ± 0.651	4.320 ± 0.651	$2.63 ± $0.16	5
ACE · GPT-5.4	7.853 ± 1.291	2.387 ± 1.291	$8.78 ± $0.70	5

System	Mean Cum. Reward	Mean Cum. Gain	Cost	Runs
Claude Code · Sonnet 4.6	343.020 ± 29.367	58.520 ± 29.367	$8.65 ± $0.30	5
ICL · Claude Sonnet 4.6	339.920 ± 19.530	23.220 ± 19.530	$9.39 ± $0.41	5
ACE · GPT-5.4	143.500 ± 41.196	1.600 ± 41.196	$13.20 ± $0.81	5
ICL · Claude Opus 4.7	116.680 ± 26.485	-41.020 ± 26.485	$17.47 ± $0.66	5
ICL Notepad · Claude Sonnet 4.6	115.420 ± 29.931	-201.580 ± 29.931	$6.75 ± $0.31	5
ICL · GPT-5.4	95.760 ± 26.414	-37.840 ± 26.414	$4.61 ± $0.09	5
ICL · Gemini 3 Flash	94.840 ± 16.912	-101.960 ± 16.912	$1.91 ± $0.15	5
Codex · GPT-5.4	85.000 ± 0.000	20.500 ± 0.000	$8.27 ± $0.00	1
ICL Notepad · GPT-5.4	81.340 ± 15.206	0.240 ± 15.206	$1.52 ± $0.04	5
ICL · Gemini 3.1 Pro Preview	76.400 ± 3.868	32.900 ± 3.868	$3.99 ± $0.09	5
Mem0 · GPT-5.4	73.420 ± 41.548	-16.980 ± 41.548	$3.57 ± $0.11	5
ICL Notepad · Gemini 3.1 Pro Preview	53.500 ± 6.684	17.000 ± 6.684	$1.33 ± $0.02	5

System	Mean Cum. Reward	Mean Cum. Gain	Cost	Runs
ICL Notepad · Claude Sonnet 4.6	10.073 ± 0.095	5.790 ± 0.095	$3.75 ± $0.10	5
ICL · Claude Sonnet 4.6	10.018 ± 0.215	4.840 ± 0.215	$2.96 ± $0.13	5
Claude Code · Sonnet 4.6	9.573 ± 0.254	4.537 ± 0.254	$2.21 ± $0.08	5
ICL · GPT-5.4	9.230 ± 0.254	3.633 ± 0.254	$3.52 ± $0.07	5
ICL · Claude Opus 4.7	9.039 ± 0.292	4.649 ± 0.292	$4.36 ± $0.14	5
ICL Notepad · GPT-5.4	8.486 ± 0.317	4.003 ± 0.317	$2.35 ± $0.10	5
ICL · Gemini 3 Flash	8.481 ± 0.211	3.210 ± 0.211	$0.59 ± $0.05	5
Codex · GPT-5.4	8.320 ± 0.198	3.144 ± 0.198	$2.38 ± $0.33	5
ICL Notepad · Gemini 3.1 Pro Preview	8.108 ± 0.350	4.952 ± 0.350	$1.53 ± $0.11	5
Mem0 · GPT-5.4	7.815 ± 0.198	3.064 ± 0.198	$2.79 ± $0.29	5
ICL · Gemini 3.1 Pro Preview	6.468 ± 0.054	2.560 ± 0.054	$0.85 ± $0.01	5
ACE · GPT-5.4	6.116 ± 0.129	0.913 ± 0.129	$8.36 ± $0.48	5

Task suite 1.0

Tasks are authored and validated by domain experts. Each task is a sequence of related instances rather than a single static problem — success requires the agent to adapt as the sequence unfolds.

codebase_adaptation

The agent resolves a sequence of GitHub issues on a shared codebase by executing bash commands in a Docker container. Success is measured by how few steps are needed per issue — rewarding agents that accumulate reusable knowledge of the repo over time.

19 sub-tasks

blind_spectrum_monitoring

The agent monitors RF spectrum signals to detect anomalies and identify emitters, operating with incomplete sensor data and shifting sensor configurations. It must learn persistent emitter patterns while adapting to changing array geometry across monitoring sessions.

90 sub-tasks

cohort_studies

The agent estimates patient survival across sequential clinical studies with inconsistent variable definitions and coding conventions. It must synthesize epidemiological knowledge across schemas to improve Kaplan-Meier survival estimates for predefined population cohorts.

20 sub-tasks

database_exploration

The agent answers natural-language questions about an unknown SQLite database by issuing exploratory queries before committing to a final answer. The schema drifts across instances, requiring the agent to relearn structure over time.

40 sub-tasks

exploitable_poker

The agent plays heads-up poker against a deterministic opponent whose strategy has exploitable patterns. It must infer weaknesses from hand outcomes and adapt its betting decisions to accumulate profit over many hands.

120 sub-tasks

sales_prediction

The agent forecasts furniture sales across stores and time periods by writing Python analysis code in Docker. It must learn store-specific growth patterns and schema conventions from historical data, improving its models with each sequential prediction task.

12 sub-tasks

Methodology

reward ↑

Raw task performance score. Higher is better.

gain ↑

Reward minus the same system's stateless baseline — direct measure of how much the system learned from experience.

Agg. Reward / Gain ↑

Each task's reward (or gain) normalized against a reference ceiling and fixed (or corresponding) stateless baseline, then averaged across tasks. Primary ranking metric.

Cost ↓

Aggregate table: sum of each included task's mean single rollout spend. Task table: mean spend per single task rollout.

Resources

Acknowledgments

The benchmark is led by researchers at UC Berkeley Skylab, UW-Madison, and Snorkel AI via the Open Benchmarks Grants program. Snorkel is actively collaborating on baseline human performance calibration for select tasks.

Get notified when we launch a new benchmark

Share this benchmark

Continual Learning Bench

Leaderboard

Aggregate metrics

How it works

Stateful system vs. stateless baseline

Per-task breakdown

Task suite 1.0

Methodology

Resources

Acknowledgments

Get notified when we launch a new benchmark

More benchmarks

Frontier-Bench

OSWorld 2.0

Senior SWE-Bench

Agents’ Last Exam

Agentic Coding

SlopCode Bench

For models that need to be right. Not just good enough.

How do you want to work with Snorkel?