Continual Learning Survey

Chapter 0: The Problem

You train a neural network to classify cats and dogs. It reaches 95% accuracy. Then you train it on a new dataset of birds and fish. When you go back and test on cats and dogs, accuracy has collapsed to near-random. The network has catastrophically forgotten what it learned.

This isn't a minor annoyance — it's a fundamental failure mode. Standard neural networks assume all training data is available at once (the i.i.d. assumption). But the real world delivers data sequentially: a self-driving car encounters new road conditions, a medical AI sees new diseases, a chatbot learns new topics. Each time, the model must learn the new without destroying the old.

Why this happens: When you train on task B, gradient descent adjusts weights to minimize loss on B. But those same weights encoded task A's knowledge. The gradients for B have no mechanism to preserve A's solution — they simply overwrite it. This is catastrophic forgetting (also called catastrophic interference), first identified by McCloskey & Cohen (1989) and French (1999).

Continual learning (also called incremental learning, lifelong learning, or sequential learning) aims to solve this: learn a sequence of tasks T₁, T₂, ..., T_K while maintaining performance on all of them. The challenge is to be plastic enough to learn new things while being stable enough to retain old knowledge.

This survey by Wang et al. (2023) is the most comprehensive map of the field. It organizes over 200 papers into a clean taxonomy: 7 learning scenarios, a unifying Bayesian framework, and 5 families of methods. Let's walk through the entire landscape.

Catastrophic forgetting: accuracy on Task A collapses as training shifts to Task B. Click to restart.

Why does training on Task B destroy a neural network's performance on Task A?

Gradient descent for B modifies the same weights that encoded A's knowledge, with no mechanism to preserve them The network runs out of capacity Task B's data corrupts the stored training data for Task A

Chapter 1: The Landscape

Not all continual learning problems are the same. The survey identifies seven distinct scenarios, each defined by what changes between tasks and what information is available at test time. Understanding these scenarios is crucial because the difficulty varies enormously — some are almost trivial, others remain unsolved.

The seven scenarios

Instance-Incremental Learning (IIL) is the simplest. The task stays the same; data just arrives in batches. Think of training a spam classifier that gets new email batches every week. Same labels, same distribution, just more data. Standard online learning handles this.

Domain-Incremental Learning (DIL) keeps the same label space but the input distribution shifts. A self-driving car trained in sunny California must adapt to snowy Boston. The output classes (stop sign, pedestrian, lane) stay the same, but the visual appearance changes dramatically.

Task-Incremental Learning (TIL) introduces disjoint label sets across tasks, but gives the model a task ID at test time. You know which task you're being tested on, so you can use a task-specific classification head. This is a significant simplification — the model just needs to learn task-specific features without confusing tasks.

Class-Incremental Learning (CIL) is the hardest standard scenario. Like TIL, the label sets are disjoint (Task 1 has classes 1-10, Task 2 has classes 11-20), but there is no task ID at test time. The model must classify among all classes seen so far. It must not only remember old class boundaries but also distinguish old classes from new ones without being told which task an input belongs to.

Why CIL is so hard: In TIL, the model picks from 10 classes per task. In CIL, after 5 tasks it picks from 50 classes, including classes it hasn't seen training data for since task 1. The model must maintain inter-task decision boundaries with no task oracle. This is the benchmark that most modern methods target.

Task-Free Continual Learning (TFCL) removes task boundaries entirely. The data distribution shifts gradually and continuously — there are no discrete task switches. The learner must detect and adapt to distribution changes on its own.

Online Continual Learning (OCL) adds a single-pass constraint: each data point is seen exactly once. No epoch-based training, no replaying the current batch. This mirrors streaming data in production systems.

Continual Pre-training (CPT) is the newest scenario, motivated by foundation models. A pre-trained model (like GPT or CLIP) is sequentially pre-trained on new domains or corpora while preserving its general capabilities. The goal is downstream transfer, not performance on any single pre-training task.

Interactive scenario comparison. Click a scenario to see its properties.

Formal definition

A continual learning problem is a sequence of K datasets D₁, D₂, ..., D_K, where D_k = {(x_i, y_i)}_i=1^n_k is drawn from distribution P_k(X, Y). When learning task k, only D_k is available (previous data D_1:k-1 is gone). The goal: after training on all K tasks, perform well on the test sets of all tasks.

What makes Class-Incremental Learning (CIL) harder than Task-Incremental Learning (TIL)?

CIL provides no task ID at test time, so the model must distinguish among ALL classes ever seen without knowing which task an input belongs to CIL has more training data CIL requires larger models

Chapter 2: Stability vs Plasticity

At the heart of continual learning lies a fundamental tradeoff. A model needs plasticity (the ability to learn new tasks) and stability (the ability to retain old tasks). Too much plasticity → catastrophic forgetting. Too much stability → intransigence (inability to learn new things).

The Bayesian framework

The survey unifies continual learning methods through Bayesian inference. Suppose we've learned tasks 1 through k−1, encoding our knowledge in a posterior p(θ|D_1:k-1). When task k arrives, Bayes' rule gives us:

p(θ|D_1:k) ∝ p(D_k|θ) · p(θ|D_1:k-1)

Read this carefully. The posterior from all previous tasks, p(θ|D_1:k-1), becomes the prior for the new task. The likelihood p(D_k|θ) captures the new data. Bayesian updating is inherently sequential — it's continual learning baked into probability theory.

The key insight: Every continual learning method can be seen as an approximation to this Bayesian recursion. The prior p(θ|D_1:k-1) encodes stability (what we know), and the likelihood p(D_k|θ) provides plasticity (what we're learning). The tradeoff is about how we balance these two terms.

From Bayes to EWC

The posterior p(θ|D_1:k-1) is intractable for neural networks. The Laplace approximation says: approximate it as a Gaussian centered at the MAP estimate μ_k-1 with precision given by the Fisher information matrix:

p(θ|D_1:k-1) ≈ N(θ; μ_k-1, F̂_1:k-1⁻¹)

Taking the negative log gives us the EWC loss (we'll derive it fully in Chapter 3):

L_EWC(θ) = ℓ_k(θ) + (λ/2) (θ − μ_k-1)^T F̂_1:k-1 (θ − μ_k-1)

The Fisher matrix F̂ acts as a per-parameter importance measure. Parameters that were crucial for old tasks have high Fisher values → the quadratic penalty strongly discourages changing them. Parameters with low Fisher values are free to adapt.

Flat loss landscapes

The survey highlights another theoretical angle: models that converge to flat minima in the loss landscape are more robust to forgetting. Why? A flat minimum means the loss doesn't change much when parameters shift slightly. So when gradient descent for task B nudges weights away from task A's solution, a flat minimum means task A's loss increases only gently. Sharp minima, by contrast, are fragile — even small parameter changes cause large loss increases.

Gradient orthogonality

If the gradient for the new task is orthogonal to the gradient subspace of old tasks, learning the new task doesn't interfere with old ones at all. This is the theoretical ideal that optimization-based methods (Chapter 5) try to achieve.

Stability-plasticity tradeoff. Drag the slider to see the effect.

In the Bayesian framework for continual learning, what role does the posterior from previous tasks play?

It becomes the prior for the new task, encoding stability by constraining how far parameters can move It is discarded after each task It determines the learning rate

Chapter 3: Regularization Methods

The first family of methods fights forgetting by adding a penalty term to the loss function. The penalty discourages changes to parameters that were important for previous tasks. The key question: how do you measure "importance"?

EWC: Elastic Weight Consolidation

EWC (Kirkpatrick et al., 2017) is the flagship regularization method. It starts from the Bayesian framework we derived in Chapter 2. Let's trace the full derivation.

After learning task k−1, we have the MAP estimate μ_k-1. The Laplace approximation gives:

log p(θ|D_1:k) = log p(D_k|θ) + log p(θ|D_1:k-1) + const

Approximating the prior as Gaussian with precision F̂:

L_EWC(θ) = ℓ_k(θ) + (λ/2) ∑_i F̂_i (θ_i − μ_k-1,i)²

Where F̂_i is the diagonal of the Fisher information matrix for parameter i, computed on the data from previous tasks. Highly important parameters (high F̂_i) are strongly penalized for moving. Unimportant parameters (low F̂_i) are free to change.

Think of it this way: EWC treats each weight like a spring. Important weights are attached to stiff springs (high F) anchored at their old values. Unimportant weights have loose springs. When you train on the new task, gradient descent can freely move the loose springs but must fight the stiff ones — hence "elastic weight consolidation."

Synaptic Intelligence (SI)

SI (Zenke et al., 2017) takes a different approach to measuring importance. Instead of the Fisher matrix (which requires a separate computation pass), SI accumulates importance online during training. For each parameter, it tracks the path integral of the gradient along the optimization trajectory:

ω_i = ∑_t ∇_{θ_i} L(t) · Δθ_i(t)

Parameters that contributed most to reducing the loss (large gradient times large parameter change) get high importance. This is computationally cheaper than EWC and doesn't require storing and processing old data to compute the Fisher matrix.

Learning without Forgetting (LwF)

LwF (Li & Hoiem, 2016) uses knowledge distillation instead of parameter penalties. Before training on the new task, it records the model's outputs on the new data under the old parameters (the "soft targets"). Then during training, it adds a distillation loss that encourages the model to maintain these soft targets:

L_LwF = ℓ_new(θ) + λ · KL(f_old(x) ‖ f_θ(x))

This preserves the model's functional behavior rather than its parameters. The insight: what matters isn't that weights stay the same, but that the model's predictions stay consistent.

Prior-focused vs Likelihood-focused: EWC and SI are prior-focused — they constrain the parameter space by penalizing changes to important weights. LwF is likelihood-focused — it constrains the function space by preserving input-output mappings. Both approximate the Bayesian recursion, just from different angles.

How does EWC decide which parameters are "important" for old tasks?

Via the diagonal Fisher information matrix — parameters with high Fisher values had large influence on old task performance and are penalized for changing By measuring parameter magnitude — larger weights are more important By random selection

Chapter 4: Replay Methods

The most intuitive solution to forgetting: just keep some old data around and replay it while learning new tasks. If the model trains on a mix of new and old data, it shouldn't forget. Replay methods implement this idea in different ways.

Experience Replay

The simplest approach: maintain a fixed-size memory buffer M of exemplars from previous tasks. When training on task k, interleave samples from M with the current task data:

L_ER(θ) = E_{(x,y)~D_k}[ℓ(f_θ(x), y)] + α · E_(x,y)~M[ℓ(f_θ(x), y)]

The critical question: which samples should you store? With a buffer of only 200 samples per class (typical budget), selection matters enormously. Common strategies include:

Random sampling: simple but surprisingly effective
Herding: select exemplars whose mean feature representation is closest to the class mean (used by iCaRL)
Reservoir sampling: maintains a uniform sample of all data seen so far, even without knowing the total count in advance — ideal for streaming settings
Gradient-based: store samples that produce the most diverse or representative gradients

Generative Replay

What if you can't store real data (privacy constraints, storage limits)? Generate it instead. Deep Generative Replay (DGR, Shin et al., 2017) trains a generative model (GAN or VAE) alongside the task model. When learning task k:

Use the generator to produce pseudo-samples from previous task distributions
Mix these with real data from task k
Train both the task model and the generator on this mixed dataset

The generator itself must avoid forgetting, creating a recursive problem. DGR solves this by replaying to the generator too — the generator replays to itself.

The memory-accuracy tradeoff: Experience replay with just 20 samples per class can match or exceed regularization methods with zero replay. But storing raw samples raises privacy concerns and has a linear memory cost. Generative replay avoids storing data but requires training and maintaining a separate generative model, which can itself suffer from mode collapse or forgetting.

Feature replay

A middle ground: instead of replaying raw inputs, replay features from an intermediate layer. This is cheaper (features are smaller than images), more privacy-preserving (features can't easily be inverted to raw inputs), and can be more effective (the feature space is where classification actually happens).

Memory buffer management: exemplar selection strategies compared.

What is the key advantage of generative replay over experience replay?

It doesn't require storing real data samples, avoiding privacy concerns and fixed memory costs It always produces better accuracy It is faster to train

Chapter 5: Optimization Methods

Instead of penalizing parameter changes (regularization) or replaying old data, optimization-based methods directly manipulate the gradient to prevent interference. The idea: constrain the gradient for the new task so it doesn't increase the loss on old tasks.

GEM: Gradient Episodic Memory

GEM (Lopez-Paz & Ranzato, 2017) stores a small episodic memory of old samples (like experience replay), but uses them differently. Instead of training on them, it uses them to constrain the gradient. The constraint:

⟨∇L_k, ∇L_t⟩ ≥ 0 for all old tasks t

In words: the gradient for the new task must have a non-negative inner product with the gradient for every old task. This ensures that each update on the new task does not increase the loss on any old task.

If the constraint is violated, GEM projects the current gradient onto the closest feasible direction using quadratic programming:

ĝ = argmin_g ‖g − ∇L_k‖² s.t. g^T ∇L_t ≥ 0 ∀ t

A-GEM: Averaged GEM

GEM's quadratic programming is expensive (scales with the number of tasks). A-GEM (Chaudhry et al., 2019) simplifies: instead of satisfying constraints for every old task individually, it uses a single averaged gradient from a random subset of the memory:

if ⟨∇L_k, g_ref⟩ < 0: ĝ = ∇L_k − ^{⟨∇L_k, g_ref⟩}⁄_{‖g_ref‖²} g_ref

This is just a single gradient projection — vastly cheaper than solving QP. The price: it's an approximation, so it may allow small increases in old task loss.

Gradient Projection Methods

OGD (Orthogonal Gradient Descent) projects the new task gradient onto the subspace orthogonal to all previous task gradients. If the gradient spaces are truly orthogonal, tasks don't interfere at all.

GPM (Gradient Projection Memory) goes further: it maintains a compact representation of the gradient subspace for old tasks using SVD. The new task gradient is projected to be orthogonal to this subspace. As more tasks arrive, the "forbidden" subspace grows, and the model's freedom shrinks — eventually limiting plasticity.

The geometry of no-forgetting: Think of each task as claiming a subspace of the gradient space. If task A uses dimensions 1-3 and task B uses dimensions 4-6, their gradients are orthogonal and there's zero interference. But real tasks share structure — their gradient subspaces overlap. Gradient projection methods try to find the best compromise: update along directions that help the new task while being orthogonal to (or at least not opposing) old task directions.

GEM gradient projection: the new gradient is projected to satisfy old-task constraints. Click to regenerate.

What does GEM's constraint ⟨∇L_k, ∇L_t⟩ ≥ 0 enforce?

That each gradient step for the new task must not increase the loss on any old task — the gradients must point in compatible directions That the new task gradient must be large That the learning rate must decrease over time

Chapter 6: Representation & Architecture

The last two families of methods take a structural approach: either learn features that are inherently robust to forgetting, or change the network architecture itself to accommodate new tasks.

Representation-based methods

The core insight: if you have a strong, general-purpose feature extractor, you can learn new tasks by only adapting a small classification head. The backbone features are robust enough to generalize across tasks without fine-tuning.

Pre-trained backbones: Models like CLIP, DINOv2, or ImageNet-pretrained ResNets already have excellent features. Freeze the backbone, train only the classifier per task. This is simple and surprisingly effective — especially in the CIL setting where the survey shows pre-trained backbones dominate.

Prompt tuning for CL: L2P (Learning to Prompt, Wang et al., 2022) maintains a pool of learnable prompts. For each input, a key-query matching mechanism selects relevant prompts, which are prepended to the frozen transformer's input. Different tasks naturally select different prompts, providing implicit task-specific adaptation without modifying the backbone. DualPrompt extends this with complementary "general" and "expert" prompt spaces.

Contrastive learning: Co²L and other methods use supervised contrastive learning to create tightly clustered, well-separated representations. Features learned contrastively are more transferable and less prone to forgetting because they capture class-discriminative structure rather than task-specific shortcuts.

Architecture-based methods

Instead of fighting forgetting within a fixed architecture, grow the architecture to accommodate new knowledge.

Progressive Neural Networks (Rusu et al., 2016) add a new column of layers for each task, with lateral connections to all previous columns. Old columns are frozen — zero forgetting by construction. The cost: model size grows linearly with tasks.

PackNet (Mallya & Lazebnik, 2018) takes the opposite approach: instead of growing, it carves out task-specific subnetworks within a fixed architecture. After training on each task, it prunes the least important weights (by magnitude), freezes the remaining ones, and reassigns the pruned weights to the next task. Each task gets its own binary mask.

Parameter isolation methods (HAT, SupSup) learn attention masks or routing functions that select different parameter subsets for different tasks. The shared parameters enable forward transfer; the isolation prevents forgetting.

The representation revolution: The survey's most striking finding is that simple methods with strong pre-trained representations often beat sophisticated continual learning algorithms with weaker backbones. This suggests that much of continual learning's difficulty comes from representation quality, not the learning algorithm itself. In the foundation model era, the problem may be shifting from "how to learn sequentially" to "how to adapt large pre-trained models sequentially."

Why do pre-trained backbones (like CLIP) help with continual learning?

They provide robust, general-purpose features that transfer across tasks — only a small classifier head needs updating, minimizing forgetting They are faster to train They use less memory

Chapter 7: Evaluation Metrics

Measuring continual learning performance requires more than just final accuracy. We need metrics that capture forgetting, forward transfer, and overall trajectory. The survey defines six key metrics.

Setup

Let a_i,j be the accuracy on task j's test set after training on task i. This gives us a K×K matrix A where entry (i,j) captures the model's state at time i on task j.

Average Accuracy (AA)

AA_K = (1/K) ∑_j=1^K a_K,j

The simplest metric: after all K tasks, average the accuracy across all tasks. This is the headline number, but it doesn't tell you how you got there.

Average Incremental Accuracy (AIA)

AIA = (1/K) ∑_k=1^K AA_k

Average the AA at each step. This captures the trajectory — a method that maintains high accuracy throughout scores better than one that dips and recovers.

Forgetting Measure (FM)

f_j,k = max_{i∈{1,...,k-1}} (a_i,j − a_k,j)

FM_K = (1/(K-1)) ∑_j=1^K-1 f_j,K

For each old task j, forgetting is the maximum accuracy drop from any previous point. Average this across all old tasks. FM = 0 means no forgetting; FM > 0 means performance was lost.

Worked example: Suppose after Task 1, accuracy on Task 1 is 90%. After Task 2, it drops to 75%. After Task 3, it drops to 70%. The forgetting for Task 1 is max(90−75, 90−70) = 90−70 = 20%. Note we take the max across all time steps, capturing the worst degradation from the peak performance.

Backward Transfer (BWT)

BWT_K = (1/(K-1)) ∑_j=1^K-1 (a_K,j − a_j,j)

How much does learning later tasks affect earlier tasks? BWT < 0 means forgetting (typical). BWT > 0 means positive backward transfer — learning new tasks actually helped old tasks (rare but possible when tasks share structure).

Forward Transfer (FWT)

FWT_K = (1/(K-1)) ∑_j=2^K (a_j,j − ã_j)

How much does learning earlier tasks help on later tasks, compared to training from scratch? ã_j is the accuracy of a randomly initialized model on task j. FWT > 0 means earlier tasks provided useful knowledge (positive transfer).

Intransigence Measure (IM)

IM_k = a_k^* − a_k,k

Where a_k^* is the accuracy achievable by training on task k alone (the oracle). IM measures how much the prior tasks hinder learning the new task. High IM means the model is too stable — it can't adapt.

The full picture: AA tells you the result. FM and BWT tell you how much you forgot. FWT tells you how much you transferred forward. IM tells you how much stability hurt plasticity. Together, these six metrics give a complete picture of a continual learner's behavior.

What does negative Backward Transfer (BWT < 0) indicate?

The model has forgotten — accuracy on earlier tasks decreased after learning later ones The model failed to learn new tasks The training diverged

Chapter 8: Applications

Continual learning isn't just an academic exercise. The survey catalogs applications across every major ML domain where the real world refuses to be i.i.d.

Computer Vision

Class-incremental image classification is the most studied setting. Benchmarks like Split-CIFAR100 (20 tasks of 5 classes each) and Split-ImageNet (10 tasks of 100 classes each) are standard. The state of the art is dominated by methods using pre-trained ViT backbones with prompt tuning (L2P, DualPrompt, CODA-Prompt).

Object detection: A detector must learn new object categories without forgetting old ones. This is harder than classification because detectors must handle background, localization, and classification simultaneously. The background class is especially problematic — old-class objects in new-task images are labeled as "background," actively teaching the model to forget them.

Semantic segmentation: Similar to detection but at the pixel level. Each pixel must be classified, and new classes appear incrementally. Methods like PLOP, MiB, and RECALL adapt distillation and replay for dense prediction.

Natural Language Processing

Continual learning in NLP faces unique challenges: pre-trained language models (BERT, GPT) already encode massive knowledge, and the question is how to sequentially adapt them. Key settings include:

Continual relation extraction: Learn new relation types without forgetting old ones
Continual named entity recognition: New entity types appear over time
Continual text classification: New sentiment categories, topics, or intents
Continual language model pre-training: Adapt LLMs to new domains or languages sequentially

Reinforcement Learning

RL agents face continual learning naturally: environments change, new tasks are assigned, reward functions evolve. Policy distillation and progressive nets have been applied to sequential game learning. The challenge is compounded because RL already has non-stationary data (the policy changes the data distribution).

Generative Models

Can a GAN or diffusion model learn to generate new data distributions without forgetting old ones? Lifelong GAN and CLoG address this by combining replay mechanisms with generative architectures. The generator must avoid mode collapse across all seen distributions.

Foundation Models

The newest frontier: continually pre-training or fine-tuning foundation models (GPT, CLIP, LLaMA). The challenge is preserving the model's broad capabilities while specializing for new domains. Methods like LoRA-based adapters and prompt tuning provide lightweight continual adaptation without modifying the base model.

The common thread: Across all applications, the survey finds that the best-performing methods combine (1) a strong pre-trained backbone, (2) a small replay buffer or distillation mechanism, and (3) careful task-boundary handling. No single method family dominates — the best approach depends on the specific constraints (memory budget, privacy requirements, task structure).

Why is continual object detection harder than continual classification?

Old-class objects in new-task images are labeled as "background," actively teaching the model to forget them Detection models have more parameters Detection requires color images

Chapter 9: Connections

Continual learning doesn't exist in isolation. It connects to several neighboring fields, and understanding these connections clarifies what makes continual learning unique.

Meta-Learning

Meta-learning learns how to learn — finding initializations or optimizers that generalize across tasks. MAML-style approaches can be adapted for continual learning: the meta-learned initialization should be a good starting point for any future task while preserving performance on past tasks. OML (Online Meta-Learning) explicitly optimizes for this.

Transfer Learning

Transfer learning moves knowledge from a source task to a target task, but doesn't require maintaining performance on the source. Continual learning adds the retention requirement: succeed on the new task and maintain the old ones. Forward transfer in CL is essentially transfer learning; backward transfer is the uniquely CL phenomenon.

Multi-Task Learning

Multi-task learning trains on all tasks simultaneously (the upper bound for CL). It doesn't face forgetting because all data is always available. The gap between multi-task performance and continual learning performance is one way to measure CL difficulty. As CL methods improve, they approach this upper bound.

Curriculum Learning

Curriculum learning orders training examples from easy to hard. In CL, the task order is typically fixed by the environment, but some work explores whether reordering tasks can reduce forgetting. The connection: both concern the order in which data is presented to the learner.

The five families: a cheat sheet

Family	Core Idea	Key Methods	Weakness
Regularization	Penalize changing important params	EWC, SI, LwF	Accumulating constraints reduce plasticity
Replay	Store or generate old data	ER, DGR, iCaRL	Memory cost, privacy concerns
Optimization	Constrain/project gradients	GEM, A-GEM, GPM	Gradient space fills up over tasks
Representation	Learn robust, transferable features	L2P, DualPrompt, Co²L	Relies on pre-trained backbone quality
Architecture	Grow or mask network per task	PackNet, PNN, HAT	Model size grows, capacity limited

The survey's verdict: No single method dominates. The best approach depends on your constraints: memory budget, privacy requirements, number of tasks, availability of pre-trained models, and whether you have task IDs at test time. The field is converging on hybrid methods that combine the strengths of multiple families — e.g., replay + distillation + pre-trained backbone.

What is the fundamental difference between continual learning and transfer learning?

Continual learning requires maintaining performance on ALL previous tasks, not just transferring to the new one Continual learning uses more data Transfer learning is only for NLP