Mathematics of Continual Learning

Chapter 0: Catastrophic Forgetting, Mathematically

You train a network on Task A. It performs beautifully. Then you train it on Task B. It learns B — but now it has forgotten A. This is catastrophic forgetting, and it has plagued neural networks since the 1980s.

But what does forgetting actually look like mathematically? Peng and Vidal give us an elegant framework. Suppose we have T tasks, each defined by a data pair (x_t, y_t). After training on tasks 1 through j, we get parameters θ^j. The error matrix captures everything:

ε_ij = y_i − x_i^T θ^j

This is the error on task i after training through task j. It forms a matrix — rows are tasks, columns are training stages. Three learning paradigms correspond to three regions of this matrix:

Online learning: Only the diagonal minus one — we evaluate on task i using θⁱ⁻¹ (parameters before seeing task i)
Fine-tuning: Only the diagonal — we evaluate on task i using θⁱ (parameters after training on task i)
Continual learning: The entire lower triangle — we care about performance on all past tasks at every stage

The key insight: Continual learning is the hardest paradigm because it demands that every entry in the lower triangle of the error matrix be small. Online learning only needs one diagonal. Fine-tuning only needs another. CL needs the whole triangle.

From this matrix, two metrics capture everything we care about:

MSE at time t: E_t = (1/t) ∑_i=1^t ε_it²

Forgetting at time t: F_t = (1/(t−1)) ∑_i=1^t−1 (ε_it² − ε_ii²)

MSE is the average squared error across all tasks seen so far, evaluated at the current parameters. Forgetting measures how much worse we got on old tasks compared to right after we trained on them. If F_t > 0, we forgot. If F_t < 0, we actually got better at old tasks — that's positive backward transfer.

Why is continual learning harder than fine-tuning or online learning?

CL requires good performance on ALL past tasks at EVERY stage (the full lower triangle of the error matrix), while fine-tuning and online learning only need a single diagonal CL uses larger models CL has more hyperparameters to tune

Chapter 1: The Hidden Connection

In 1960, Bernard Widrow and Marcian Hoff invented the Least Mean Squares (LMS) algorithm. Their guiding principle was deceptively simple: inject new information with minimal disturbance to stored information. They called it the minimal disturbance principle.

Widrow tried to apply this to multi-layer networks in the 1960s and 70s. It didn't work well — the tools for training deep networks didn't exist yet. So he pivoted to adaptive filtering, where LMS became the dominant algorithm for noise cancellation, echo suppression, and channel equalization.

Meanwhile, the neural network community rediscovered forgetting in the 1990s, coined "catastrophic interference," and spent three decades developing methods to fight it. They built Experience Replay, Elastic Weight Consolidation, Gradient Episodic Memory, Progressive Neural Networks — a zoo of techniques.

The 60-year-old secret: The adaptive filtering community had already solved these problems mathematically. LMS is online SGD. The Affine Projection Algorithm is gradient projection. Recursive Least Squares is regularization. The Kalman filter models task relationships. Peng and Vidal (2025) finally connected the dots.

Here is the mapping — Table I from the paper — that unifies 60 years of parallel research:

Adaptive Filter	CL Method	Memory	Key Property
LMS	Online SGD	None	Must revisit tasks to converge
APA	ICL / Gradient projection	Past samples	Exact solution with full replay
RLS	Regularization (EWC)	Sufficient statistics	Forgetting factor trades recency vs. accuracy
Kalman Filter	Task-relationship models	State + covariance	Positive backward transfer possible

Each row adds more structure: more memory, more assumptions about the task sequence, and better theoretical guarantees. The progression from LMS to KF is the progression from "no memory, must revisit" to "models how tasks relate, can improve the past."

What is the minimal disturbance principle (Widrow, 1960)?

Adapt to new information with minimal disruption to already-stored knowledge — the core objective of all continual learning methods Minimize the norm of the gradient during training Use the smallest possible learning rate

Chapter 2: LMS — The Simplest Continual Learner

Let's derive LMS from the minimal disturbance principle. We have current parameters θ^t−1 and a new task (x_t, y_t). We want to find new parameters θ^t that:

Stay as close as possible to θ^t−1 (minimal disturbance)
Reduce the error on the new task

Formally:

minimize ||θ − θ^t−1||² subject to y_t − x_t^Tθ = (1 − γ)ε_t

Here γ ∈ (0, 1] controls how much of the error we correct. When γ = 1, we fully correct the error on the new task. When γ < 1, we only partially correct it.

This is a constrained optimization with a quadratic objective and a single linear equality constraint. Using the method of Lagrange multipliers, we get:

θ^t = θ^t−1 − γ (x_t^Tθ^t−1 − y_t) x_t / ||x_t||²

If we absorb ||x_t||² into the learning rate (or normalize the data), this simplifies to:

θ^t = θ^t−1 − γ (x_t^Tθ^t−1 − y_t) x_t

This is online SGD. The update rule is identical: take the current error, multiply by the input, scale by a learning rate, and subtract from the current parameters. The LMS algorithm invented in 1960 for adaptive filtering IS the same algorithm used for online gradient descent in neural networks.

Convergence guarantees

The paper proves two key theorems about LMS convergence in the continual learning setting:

Theorem 1 (i.i.d. tasks): If tasks are drawn i.i.d. from a distribution with input covariance Σ_x, then the expected MSE converges exponentially:

E[E_t] ≤ (1 − γ(2−γ)λ_min(Σ_x))^t ||θ*||²

The convergence rate depends on λ_min(Σ_x) — the smallest eigenvalue of the input covariance. Ill-conditioned data (small λ_min) means slow convergence. The optimal γ = 1 (full correction).

Theorem 2 (recurring tasks): For tasks that cycle through a fixed set with period T and γ = 1, the MSE after seeing each task at least once satisfies:

E_T ≤ ||θ*||² / (e(T−1))

LMS only converges if tasks are repeated. See a task once and move on? LMS will forget it. This is the fundamental limitation of memoryless algorithms.

Why does LMS (online SGD) forget previously learned tasks?

LMS has no memory — each update only considers the current task, so the parameter vector drifts away from old task solutions unless those tasks are revisited The learning rate is too large LMS uses a nonlinear update rule that causes instability

Chapter 3: APA — Memory Equals Solution

LMS forgets because it only uses the current sample. What if we remembered all past samples and demanded that the new parameters satisfy all constraints simultaneously?

The Affine Projection Algorithm (APA) does exactly this. Instead of projecting onto a single hyperplane (the constraint from the current task), it projects onto the intersection of all hyperplanes (the constraints from all tasks seen so far).

minimize ||θ − θ^t−1||² subject to X_:t^T θ = y_:t

Where X_:t = [x₁, ..., x_t] stacks all past inputs and y_:t = [y₁, ..., y_t]^T stacks all past targets. This says: find the closest point to θ^t−1 that satisfies every task constraint.

The closed-form solution

Using Lagrange multipliers on this constrained least-squares problem:

θ^t = X_:t (X_:t^T X_:t)⁻¹ y_:t

This is the ordinary least-squares solution. APA with full memory gives the exact OLS solution at every step. Once we've seen at least as many linearly independent tasks as parameters (t ≥ d), we get θ^t = θ* — the true solution. Zero forgetting. Zero error. Exact recovery.

Why does this work so perfectly? Because each task constraint y_i = x_i^Tθ defines a hyperplane in parameter space. The true θ* lies at the intersection of all these hyperplanes. By projecting onto their intersection, we converge to θ* as soon as we have enough constraints to uniquely determine it.

The connection to modern CL methods is immediate:

Experience Replay: Storing and replaying past data is exactly what APA does — keeping all past constraints active
Ideal Continual Learner (ICL): The theoretical gold standard in CL is equivalent to APA with full memory
Gradient Episodic Memory (GEM): Projects the current gradient to avoid increasing past task losses — an approximate version of APA's projection

The tradeoff is obvious: APA requires storing all past data. For t tasks in d dimensions, we need O(td) memory and O(td²) computation per step. Practical CL methods approximate this by storing a subset (coreset) of past samples.

Memory size: All (6)

Why does APA with full memory achieve zero forgetting?

It projects onto the intersection of ALL past task constraints simultaneously, converging to the exact OLS solution θ* once enough linearly independent tasks are seen It uses a very small learning rate It freezes old parameters

Chapter 4: RLS — Graceful Forgetting

APA stores all past data. But what if we want to prioritize recent tasks? In a non-stationary world, old tasks might no longer be relevant. We need a way to gracefully forget the past while retaining useful information.

Recursive Least Squares (RLS) introduces a forgetting factor λ ∈ (0, 1] that exponentially downweights old data:

minimize ∑_i=1^t λ^t−i (y_i − x_i^Tθ)²

When λ = 1, all past data is weighted equally (identical to APA/OLS). When λ < 1, old tasks are exponentially forgotten. The effective memory window is roughly 1/(1 − λ) samples.

The recursive update

The beauty of RLS is that we never need to re-solve the full problem. The solution updates recursively:

θ^t = θ^t−1 + P_t x_t (y_t − x_t^T θ^t−1)

Where P_t is the inverse correlation matrix, updated via the matrix inversion lemma:

P_t = (1/λ)(P_t−1 − P_t−1 x_t x_t^T P_t−1 / (λ + x_t^T P_t−1 x_t))

The connection to EWC: Elastic Weight Consolidation (Kirkpatrick et al., 2017) penalizes changes to parameters proportional to their Fisher Information. The Fisher Information is essentially the inverse of P_t. RLS is the sequential update of this information matrix. EWC is RLS with an approximation: it resets and re-estimates the Fisher after each task, while RLS updates it smoothly.

The forgetting factor λ creates a fundamental tradeoff:

λ = 1: Perfect memory, no forgetting. Converges to the global optimum over all tasks. But cannot adapt to non-stationary environments.
λ = 0.99: Effective window of ~100 tasks. Good for slowly changing environments.
λ = 0.95: Effective window of ~20 tasks. Tracks rapid changes but forgets fast.
λ → 0: Only the current task matters. Equivalent to LMS with no memory.

RLS needs O(d²) memory (for P_t) and O(d²) computation per step — independent of the number of tasks. Compare to APA's O(td) memory: RLS compresses all past information into the sufficient statistics (θ^t, P_t).

λ = 1.00

What does the forgetting factor λ in RLS control?

How quickly old tasks are exponentially downweighted — λ=1 means no forgetting, λ<1 means recent tasks matter more, with effective memory window ≈ 1/(1−λ) The learning rate for the current task The number of parameters to freeze

Chapter 5: Kalman Filter — The Gold Standard

LMS, APA, and RLS all share a hidden assumption: the true parameter θ* is fixed. Tasks might arrive sequentially, but they all come from the same underlying model. What if the true parameters change from task to task?

The Kalman filter introduces a state transition model that describes how the true parameters evolve:

State: θ_t = A θ_t−1 + w_t (w_t ~ N(0, Q))

Measurement: y_t = x_t^T θ_t + v_t (v_t ~ N(0, R))

Here A is the state transition matrix (how parameters relate across tasks), Q is the process noise (how much parameters change between tasks), and R is the measurement noise. Unlike RLS, the Kalman filter models the dynamics of how tasks are related.

The Kalman filter update

Predict: Use the transition model to predict where θ will be before seeing the new task:

θ_t|t−1 = A θ_t−1

P_t|t−1 = A P_t−1 A^T + Q

Update: Incorporate the new measurement to correct the prediction:

K_t = P_t|t−1 x_t / (x_t^T P_t|t−1 x_t + R)

θ_t = θ_t|t−1 + K_t (y_t − x_t^T θ_t|t−1)

P_t = (I − K_t x_t^T) P_t|t−1

The Kalman gain K_t is a trust slider. When P_t|t−1 is large (we're uncertain about our prediction), K_t is large — trust the new measurement more. When R is large (the measurement is noisy), K_t is small — trust the prediction more. The KF optimally balances prior knowledge with new evidence at every step.

Positive backward transfer via smoothing

Here is where the Kalman filter truly shines. After processing all T tasks, we can run the Rauch-Tung-Striebel (RTS) smoother backward through the sequence. This uses future task information to improve past task estimates:

θ_t|T = θ_t + G_t(θ_t+1|T − Aθ_t)

where G_t = P_t A^T P_t+1|t⁻¹.

The smoother gives positive backward transfer: learning task B genuinely improves your estimate of task A, even after you've moved past it. No other method in the LMS/APA/RLS family can do this, because they don't model inter-task dynamics.

What enables the Kalman filter to achieve positive backward transfer (improving past task estimates)?

The state transition model A relates tasks to each other, and the RTS smoother uses future task information to refine past estimates backward through the sequence It stores all past data like APA It uses a very large forgetting factor

Chapter 6: From Linear to Deep

Everything so far has been for linear models: y = x^Tθ. Neural networks are nonlinear. How do we extend these ideas?

The paper identifies three strategies, each with different tradeoffs:

Strategy 1: Linearize the network (NTK connection)

Near initialization, a neural network f(x; θ) can be approximated by its first-order Taylor expansion:

f(x; θ) ≈ f(x; θ₀) + ∇_θf(x; θ₀)^T(θ − θ₀)

This is a linear model in (θ − θ₀) with features φ(x) = ∇_θf(x; θ₀) — the Neural Tangent Kernel (NTK) features. We can apply LMS/APA/RLS/KF directly in this linearized space.

The catch: the NTK approximation is only valid for small parameter changes from initialization. Large updates break the linearization.

Strategy 2: Layer-wise adaptive filtering

Apply RLS independently to each layer of the network. Each layer's weights are updated using its own P matrix (inverse covariance). This is computationally tractable because each layer is small compared to the full network.

This connects to block-diagonal Fisher Information approximations used in practical EWC implementations.

Strategy 3: Linear classifier on frozen features

Use a pre-trained foundation model (CLIP, DINOv2, etc.) as a frozen feature extractor. Only train a linear classifier on top. Now the problem IS linear, and LMS/APA/RLS/KF apply exactly.

This is why prompt tuning works. Methods like L2P (Learning to Prompt) and CODA-Prompt keep the backbone frozen and learn small prompt vectors. The optimization over prompt parameters, given fixed backbone features, is approximately linear — putting it squarely in the domain where adaptive filtering theory applies.

The gap between theory and practice is narrowing. As pre-trained models get larger and more capable, the "linear head on frozen features" approach becomes more practical. The adaptive filtering framework provides exact guarantees for this setting.

Why does training a linear classifier on frozen pre-trained features connect directly to adaptive filtering theory?

With frozen features, the optimization over classifier weights IS a linear model — exactly the setting where LMS, APA, RLS, and KF have proven convergence guarantees Pre-trained models don't suffer from catastrophic forgetting Linear classifiers are always better than nonlinear ones

Chapter 7: The Minimal Disturbance Principle

Let's zoom out and see the unifying thread. Every continual learning method, whether invented in 1960 or 2024, solves some version of this optimization:

minimize d(θ, θ^old) subject to θ satisfies new task constraints

The distance d(·, ·) and the constraint formulation differ across methods, but the structure is always the same: change as little as possible while accommodating new information. This is Widrow's minimal disturbance principle, generalized.

How each method implements this

LMS / Online SGD: d = ||θ − θ^old||² with a single equality constraint. The simplest version: minimize Euclidean distance subject to correcting the current sample's error.

APA / Replay: Same Euclidean distance, but with constraints from ALL past tasks. More constraints = less freedom to move = less forgetting.

EWC: d = (θ − θ^old)^T F (θ − θ^old) where F is the Fisher Information matrix. Important parameters (high Fisher) are disturbed less. This is RLS with a diagonal approximation to P_t⁻¹.

GEM / A-GEM: Instead of hard constraints, project the gradient: ensure the update doesn't increase loss on any past task. This is a relaxed version of APA's projection — instead of hitting the intersection of hyperplanes exactly, just ensure you don't move away from it.

Progressive Networks: d = 0 for old parameters (freeze them entirely) and add new parameters for new tasks. The most extreme minimal disturbance: zero disturbance to old parameters, at the cost of growing network size.

The landscape of CL methods is a landscape of distance functions and constraint sets. The adaptive filtering framework reveals that all these methods are points on a spectrum. The choice of d(·, ·) determines what kind of "memory" you have: Euclidean distance = no preference among parameters. Fisher-weighted = parameter importance. Task model (KF) = inter-task dynamics. More structure in the distance function = better continual learning, but more computation.

How does EWC modify the minimal disturbance principle compared to basic LMS?

EWC replaces the Euclidean distance with a Fisher-weighted distance, so important parameters (high Fisher Information) are disturbed less than unimportant ones EWC uses a larger learning rate EWC trains on more data per step

Chapter 8: What the Math Tells Us

Let's collect the key results and see what they teach us about designing continual learning systems.

The four-algorithm progression

Algorithm	Memory	Compute/step	Forgetting	Backward Transfer
LMS	O(d)	O(d)	Forgets unless tasks repeat	No
APA	O(td)	O(td²)	Zero (with full memory)	No
RLS	O(d²)	O(d²)	Controlled by λ	No
KF	O(d²)	O(d²)	Optimal given model	Yes (with smoother)

Lessons from the theory

Lesson 1: Memory is necessary. LMS (no memory) provably cannot avoid forgetting unless tasks repeat. This is why SGD without replay forgets — it's not a bug, it's a mathematical necessity. Any memoryless algorithm will have this limitation.

Lesson 2: Replay works because it's projection. Experience Replay isn't just a heuristic that "seems to help." It's performing projection onto the intersection of task constraints. APA proves that full replay gives the exact solution. Partial replay (coresets) gives an approximation whose quality depends on how well the coreset spans the constraint space.

Lesson 3: Regularization is exponential forgetting. EWC-style methods correspond to RLS with a forgetting factor. They trade recency for accuracy. The forgetting factor λ is implicit in the regularization strength: stronger regularization = higher λ = more memory of old tasks = less adaptability to new ones.

Lesson 4: Task models enable backward transfer. Only the Kalman filter achieves positive backward transfer, because only it models how tasks relate. If you can specify (or learn) the transition matrix A, you can improve past task performance by processing future tasks. This is unique — no amount of replay or regularization can do this without a task dynamics model.

Lesson 5: There is no free lunch. Better CL performance requires either more memory (APA), more computation (RLS), or more assumptions (KF). The adaptive filtering framework makes these tradeoffs precise and quantitative.

Cheat sheet: LMS = SGD. APA = Replay. RLS = Regularization. KF = Task modeling. More structure in your algorithm = less forgetting = more cost. Choose based on your constraints.

Which adaptive filtering algorithm is the ONLY one that can achieve positive backward transfer, and why?

The Kalman filter, because it models inter-task dynamics via the state transition matrix A, allowing the RTS smoother to propagate future task information backward to improve past estimates APA, because it stores all past data RLS, because of the forgetting factor

Chapter 9: Connections

This paper sits at the intersection of two rich fields. Here's how it connects to other topics:

Related Veanors lessons

Continual Learning Survey — The comprehensive taxonomy of CL methods (7 scenarios, 5 method families). This math paper provides the theoretical foundation for why each family works.
Nested Learning — Multi-timescale updates as nested optimization levels. The LMS→APA→RLS→KF progression mirrors the nested learning hierarchy: faster inner loops (LMS) vs. slower outer loops with more memory (KF).

Related Gleams lessons

Kalman Filter — Learn the KF from scratch with interactive simulations. This paper's Chapter 5 assumes KF familiarity; the micro lesson builds that foundation.

Key methods explained by this framework

EWC (Kirkpatrick et al., 2017) — Fisher-weighted regularization ≈ RLS with diagonal P_t⁻¹ approximation
GEM / A-GEM (Lopez-Paz et al., 2017 / Chaudhry et al., 2019) — Gradient projection ≈ relaxed APA constraint satisfaction
Experience Replay — Storing and replaying past data = APA with full or partial memory
Progressive Neural Networks (Rusu et al., 2016) — Zero disturbance to old parameters, new capacity for new tasks
L2P / CODA-Prompt — Linear optimization over prompt vectors on frozen features — exactly the setting where adaptive filtering theory applies

Formulas at a glance

Method	Update Rule
LMS	θ^t = θ^t−1 − γ(x_t^Tθ^t−1 − y_t) x_t
APA	θ^t = X_:t(X_:t^TX_:t)⁻¹ y_:t
RLS	θ^t = θ^t−1 + P_t x_t(y_t − x_t^Tθ^t−1)
KF	θ^t = θ_t\|t−1 + K_t(y_t − x_t^Tθ_t\|t−1)

The big picture: Adaptive filtering and continual learning were parallel threads of the same story for 60 years. This paper weaves them together. The mathematics of LMS, APA, RLS, and the Kalman filter provide exact, provable guarantees for continual learning — not just heuristics. As we move toward lifelong learning systems, these classical tools become more relevant than ever.

What is the key mapping between adaptive filtering and modern CL methods?

LMS = SGD (memoryless), APA = Replay (store past data), RLS = Regularization (sufficient statistics), KF = Task modeling (inter-task dynamics) All adaptive filters map to SGD with different learning rates Adaptive filters only work for signal processing, not machine learning