The full landscape of learning without forgetting — from catastrophic forgetting to five families of solutions, unified by a Bayesian framework and the stability-plasticity tradeoff.
You train a neural network to classify cats and dogs. It reaches 95% accuracy. Then you train it on a new dataset of birds and fish. When you go back and test on cats and dogs, accuracy has collapsed to near-random. The network has catastrophically forgotten what it learned.
This isn't a minor annoyance — it's a fundamental failure mode. Standard neural networks assume all training data is available at once (the i.i.d. assumption). But the real world delivers data sequentially: a self-driving car encounters new road conditions, a medical AI sees new diseases, a chatbot learns new topics. Each time, the model must learn the new without destroying the old.
Continual learning (also called incremental learning, lifelong learning, or sequential learning) aims to solve this: learn a sequence of tasks T1, T2, ..., TK while maintaining performance on all of them. The challenge is to be plastic enough to learn new things while being stable enough to retain old knowledge.
This survey by Wang et al. (2023) is the most comprehensive map of the field. It organizes over 200 papers into a clean taxonomy: 7 learning scenarios, a unifying Bayesian framework, and 5 families of methods. Let's walk through the entire landscape.
Catastrophic forgetting: accuracy on Task A collapses as training shifts to Task B. Click to restart.
Not all continual learning problems are the same. The survey identifies seven distinct scenarios, each defined by what changes between tasks and what information is available at test time. Understanding these scenarios is crucial because the difficulty varies enormously — some are almost trivial, others remain unsolved.
Instance-Incremental Learning (IIL) is the simplest. The task stays the same; data just arrives in batches. Think of training a spam classifier that gets new email batches every week. Same labels, same distribution, just more data. Standard online learning handles this.
Domain-Incremental Learning (DIL) keeps the same label space but the input distribution shifts. A self-driving car trained in sunny California must adapt to snowy Boston. The output classes (stop sign, pedestrian, lane) stay the same, but the visual appearance changes dramatically.
Task-Incremental Learning (TIL) introduces disjoint label sets across tasks, but gives the model a task ID at test time. You know which task you're being tested on, so you can use a task-specific classification head. This is a significant simplification — the model just needs to learn task-specific features without confusing tasks.
Class-Incremental Learning (CIL) is the hardest standard scenario. Like TIL, the label sets are disjoint (Task 1 has classes 1-10, Task 2 has classes 11-20), but there is no task ID at test time. The model must classify among all classes seen so far. It must not only remember old class boundaries but also distinguish old classes from new ones without being told which task an input belongs to.
Task-Free Continual Learning (TFCL) removes task boundaries entirely. The data distribution shifts gradually and continuously — there are no discrete task switches. The learner must detect and adapt to distribution changes on its own.
Online Continual Learning (OCL) adds a single-pass constraint: each data point is seen exactly once. No epoch-based training, no replaying the current batch. This mirrors streaming data in production systems.
Continual Pre-training (CPT) is the newest scenario, motivated by foundation models. A pre-trained model (like GPT or CLIP) is sequentially pre-trained on new domains or corpora while preserving its general capabilities. The goal is downstream transfer, not performance on any single pre-training task.
Interactive scenario comparison. Click a scenario to see its properties.
A continual learning problem is a sequence of K datasets D1, D2, ..., DK, where Dk = {(xi, yi)}i=1nk is drawn from distribution Pk(X, Y). When learning task k, only Dk is available (previous data D1:k-1 is gone). The goal: after training on all K tasks, perform well on the test sets of all tasks.
At the heart of continual learning lies a fundamental tradeoff. A model needs plasticity (the ability to learn new tasks) and stability (the ability to retain old tasks). Too much plasticity → catastrophic forgetting. Too much stability → intransigence (inability to learn new things).
The survey unifies continual learning methods through Bayesian inference. Suppose we've learned tasks 1 through k−1, encoding our knowledge in a posterior p(θ|D1:k-1). When task k arrives, Bayes' rule gives us:
Read this carefully. The posterior from all previous tasks, p(θ|D1:k-1), becomes the prior for the new task. The likelihood p(Dk|θ) captures the new data. Bayesian updating is inherently sequential — it's continual learning baked into probability theory.
The posterior p(θ|D1:k-1) is intractable for neural networks. The Laplace approximation says: approximate it as a Gaussian centered at the MAP estimate μk-1 with precision given by the Fisher information matrix:
Taking the negative log gives us the EWC loss (we'll derive it fully in Chapter 3):
The Fisher matrix F̂ acts as a per-parameter importance measure. Parameters that were crucial for old tasks have high Fisher values → the quadratic penalty strongly discourages changing them. Parameters with low Fisher values are free to adapt.
The survey highlights another theoretical angle: models that converge to flat minima in the loss landscape are more robust to forgetting. Why? A flat minimum means the loss doesn't change much when parameters shift slightly. So when gradient descent for task B nudges weights away from task A's solution, a flat minimum means task A's loss increases only gently. Sharp minima, by contrast, are fragile — even small parameter changes cause large loss increases.
If the gradient for the new task is orthogonal to the gradient subspace of old tasks, learning the new task doesn't interfere with old ones at all. This is the theoretical ideal that optimization-based methods (Chapter 5) try to achieve.
Stability-plasticity tradeoff. Drag the slider to see the effect.
The first family of methods fights forgetting by adding a penalty term to the loss function. The penalty discourages changes to parameters that were important for previous tasks. The key question: how do you measure "importance"?
EWC (Kirkpatrick et al., 2017) is the flagship regularization method. It starts from the Bayesian framework we derived in Chapter 2. Let's trace the full derivation.
After learning task k−1, we have the MAP estimate μk-1. The Laplace approximation gives:
Approximating the prior as Gaussian with precision F̂:
Where F̂i is the diagonal of the Fisher information matrix for parameter i, computed on the data from previous tasks. Highly important parameters (high F̂i) are strongly penalized for moving. Unimportant parameters (low F̂i) are free to change.
SI (Zenke et al., 2017) takes a different approach to measuring importance. Instead of the Fisher matrix (which requires a separate computation pass), SI accumulates importance online during training. For each parameter, it tracks the path integral of the gradient along the optimization trajectory:
Parameters that contributed most to reducing the loss (large gradient times large parameter change) get high importance. This is computationally cheaper than EWC and doesn't require storing and processing old data to compute the Fisher matrix.
LwF (Li & Hoiem, 2016) uses knowledge distillation instead of parameter penalties. Before training on the new task, it records the model's outputs on the new data under the old parameters (the "soft targets"). Then during training, it adds a distillation loss that encourages the model to maintain these soft targets:
This preserves the model's functional behavior rather than its parameters. The insight: what matters isn't that weights stay the same, but that the model's predictions stay consistent.
The most intuitive solution to forgetting: just keep some old data around and replay it while learning new tasks. If the model trains on a mix of new and old data, it shouldn't forget. Replay methods implement this idea in different ways.
The simplest approach: maintain a fixed-size memory buffer M of exemplars from previous tasks. When training on task k, interleave samples from M with the current task data:
The critical question: which samples should you store? With a buffer of only 200 samples per class (typical budget), selection matters enormously. Common strategies include:
What if you can't store real data (privacy constraints, storage limits)? Generate it instead. Deep Generative Replay (DGR, Shin et al., 2017) trains a generative model (GAN or VAE) alongside the task model. When learning task k:
The generator itself must avoid forgetting, creating a recursive problem. DGR solves this by replaying to the generator too — the generator replays to itself.
A middle ground: instead of replaying raw inputs, replay features from an intermediate layer. This is cheaper (features are smaller than images), more privacy-preserving (features can't easily be inverted to raw inputs), and can be more effective (the feature space is where classification actually happens).
Memory buffer management: exemplar selection strategies compared.
Instead of penalizing parameter changes (regularization) or replaying old data, optimization-based methods directly manipulate the gradient to prevent interference. The idea: constrain the gradient for the new task so it doesn't increase the loss on old tasks.
GEM (Lopez-Paz & Ranzato, 2017) stores a small episodic memory of old samples (like experience replay), but uses them differently. Instead of training on them, it uses them to constrain the gradient. The constraint:
In words: the gradient for the new task must have a non-negative inner product with the gradient for every old task. This ensures that each update on the new task does not increase the loss on any old task.
If the constraint is violated, GEM projects the current gradient onto the closest feasible direction using quadratic programming:
GEM's quadratic programming is expensive (scales with the number of tasks). A-GEM (Chaudhry et al., 2019) simplifies: instead of satisfying constraints for every old task individually, it uses a single averaged gradient from a random subset of the memory:
This is just a single gradient projection — vastly cheaper than solving QP. The price: it's an approximation, so it may allow small increases in old task loss.
OGD (Orthogonal Gradient Descent) projects the new task gradient onto the subspace orthogonal to all previous task gradients. If the gradient spaces are truly orthogonal, tasks don't interfere at all.
GPM (Gradient Projection Memory) goes further: it maintains a compact representation of the gradient subspace for old tasks using SVD. The new task gradient is projected to be orthogonal to this subspace. As more tasks arrive, the "forbidden" subspace grows, and the model's freedom shrinks — eventually limiting plasticity.
GEM gradient projection: the new gradient is projected to satisfy old-task constraints. Click to regenerate.
The last two families of methods take a structural approach: either learn features that are inherently robust to forgetting, or change the network architecture itself to accommodate new tasks.
The core insight: if you have a strong, general-purpose feature extractor, you can learn new tasks by only adapting a small classification head. The backbone features are robust enough to generalize across tasks without fine-tuning.
Pre-trained backbones: Models like CLIP, DINOv2, or ImageNet-pretrained ResNets already have excellent features. Freeze the backbone, train only the classifier per task. This is simple and surprisingly effective — especially in the CIL setting where the survey shows pre-trained backbones dominate.
Prompt tuning for CL: L2P (Learning to Prompt, Wang et al., 2022) maintains a pool of learnable prompts. For each input, a key-query matching mechanism selects relevant prompts, which are prepended to the frozen transformer's input. Different tasks naturally select different prompts, providing implicit task-specific adaptation without modifying the backbone. DualPrompt extends this with complementary "general" and "expert" prompt spaces.
Contrastive learning: Co2L and other methods use supervised contrastive learning to create tightly clustered, well-separated representations. Features learned contrastively are more transferable and less prone to forgetting because they capture class-discriminative structure rather than task-specific shortcuts.
Instead of fighting forgetting within a fixed architecture, grow the architecture to accommodate new knowledge.
Progressive Neural Networks (Rusu et al., 2016) add a new column of layers for each task, with lateral connections to all previous columns. Old columns are frozen — zero forgetting by construction. The cost: model size grows linearly with tasks.
PackNet (Mallya & Lazebnik, 2018) takes the opposite approach: instead of growing, it carves out task-specific subnetworks within a fixed architecture. After training on each task, it prunes the least important weights (by magnitude), freezes the remaining ones, and reassigns the pruned weights to the next task. Each task gets its own binary mask.
Parameter isolation methods (HAT, SupSup) learn attention masks or routing functions that select different parameter subsets for different tasks. The shared parameters enable forward transfer; the isolation prevents forgetting.
Measuring continual learning performance requires more than just final accuracy. We need metrics that capture forgetting, forward transfer, and overall trajectory. The survey defines six key metrics.
Let ai,j be the accuracy on task j's test set after training on task i. This gives us a K×K matrix A where entry (i,j) captures the model's state at time i on task j.
The simplest metric: after all K tasks, average the accuracy across all tasks. This is the headline number, but it doesn't tell you how you got there.
Average the AA at each step. This captures the trajectory — a method that maintains high accuracy throughout scores better than one that dips and recovers.
For each old task j, forgetting is the maximum accuracy drop from any previous point. Average this across all old tasks. FM = 0 means no forgetting; FM > 0 means performance was lost.
How much does learning later tasks affect earlier tasks? BWT < 0 means forgetting (typical). BWT > 0 means positive backward transfer — learning new tasks actually helped old tasks (rare but possible when tasks share structure).
How much does learning earlier tasks help on later tasks, compared to training from scratch? ãj is the accuracy of a randomly initialized model on task j. FWT > 0 means earlier tasks provided useful knowledge (positive transfer).
Where ak* is the accuracy achievable by training on task k alone (the oracle). IM measures how much the prior tasks hinder learning the new task. High IM means the model is too stable — it can't adapt.
Continual learning isn't just an academic exercise. The survey catalogs applications across every major ML domain where the real world refuses to be i.i.d.
Class-incremental image classification is the most studied setting. Benchmarks like Split-CIFAR100 (20 tasks of 5 classes each) and Split-ImageNet (10 tasks of 100 classes each) are standard. The state of the art is dominated by methods using pre-trained ViT backbones with prompt tuning (L2P, DualPrompt, CODA-Prompt).
Object detection: A detector must learn new object categories without forgetting old ones. This is harder than classification because detectors must handle background, localization, and classification simultaneously. The background class is especially problematic — old-class objects in new-task images are labeled as "background," actively teaching the model to forget them.
Semantic segmentation: Similar to detection but at the pixel level. Each pixel must be classified, and new classes appear incrementally. Methods like PLOP, MiB, and RECALL adapt distillation and replay for dense prediction.
Continual learning in NLP faces unique challenges: pre-trained language models (BERT, GPT) already encode massive knowledge, and the question is how to sequentially adapt them. Key settings include:
RL agents face continual learning naturally: environments change, new tasks are assigned, reward functions evolve. Policy distillation and progressive nets have been applied to sequential game learning. The challenge is compounded because RL already has non-stationary data (the policy changes the data distribution).
Can a GAN or diffusion model learn to generate new data distributions without forgetting old ones? Lifelong GAN and CLoG address this by combining replay mechanisms with generative architectures. The generator must avoid mode collapse across all seen distributions.
The newest frontier: continually pre-training or fine-tuning foundation models (GPT, CLIP, LLaMA). The challenge is preserving the model's broad capabilities while specializing for new domains. Methods like LoRA-based adapters and prompt tuning provide lightweight continual adaptation without modifying the base model.
Continual learning doesn't exist in isolation. It connects to several neighboring fields, and understanding these connections clarifies what makes continual learning unique.
Meta-learning learns how to learn — finding initializations or optimizers that generalize across tasks. MAML-style approaches can be adapted for continual learning: the meta-learned initialization should be a good starting point for any future task while preserving performance on past tasks. OML (Online Meta-Learning) explicitly optimizes for this.
Transfer learning moves knowledge from a source task to a target task, but doesn't require maintaining performance on the source. Continual learning adds the retention requirement: succeed on the new task and maintain the old ones. Forward transfer in CL is essentially transfer learning; backward transfer is the uniquely CL phenomenon.
Multi-task learning trains on all tasks simultaneously (the upper bound for CL). It doesn't face forgetting because all data is always available. The gap between multi-task performance and continual learning performance is one way to measure CL difficulty. As CL methods improve, they approach this upper bound.
Curriculum learning orders training examples from easy to hard. In CL, the task order is typically fixed by the environment, but some work explores whether reordering tasks can reduce forgetting. The connection: both concern the order in which data is presented to the learner.
| Family | Core Idea | Key Methods | Weakness |
|---|---|---|---|
| Regularization | Penalize changing important params | EWC, SI, LwF | Accumulating constraints reduce plasticity |
| Replay | Store or generate old data | ER, DGR, iCaRL | Memory cost, privacy concerns |
| Optimization | Constrain/project gradients | GEM, A-GEM, GPM | Gradient space fills up over tasks |
| Representation | Learn robust, transferable features | L2P, DualPrompt, Co2L | Relies on pre-trained backbone quality |
| Architecture | Grow or mask network per task | PackNet, PNN, HAT | Model size grows, capacity limited |