Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this---in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.
Memory & Continual Learning sessions at AI Engineer World's Fair 2026 in San Francisco.
Wednesday, July 1, 2026
10:45 AM - 11:05 AM·20m
Track 3 · Room 2003
Capacity: 250 attendees
Sign in to add this talk to your schedule.

Parth Asawa
CS PhD student
UC Berkeley
@pgasawa
Parth Asawa is a PhD student at UC Berkeley advised by Professor Matei Zaharia and Professor Joey Gonzalez. Parth's research is on continual learning, studying how to enable models to stably learn from streams of experiences over time. His work focuses on sample-efficient learning and spans the stack of data, learning algorithms, and evaluation.