Peng & Vidal — IEEE Signal Processing Magazine, 2025

Mathematics of Continual Learning

Adaptive filtering IS continual learning. LMS, APA, RLS, and the Kalman filter are continual learning algorithms — and their 60-year-old math reveals exactly why modern CL methods work or fail.

Prerequisites: Linear algebra + Least squares + Basic probability
10
Chapters
5+
Simulations

Chapter 0: Catastrophic Forgetting, Mathematically

You train a network on Task A. It performs beautifully. Then you train it on Task B. It learns B — but now it has forgotten A. This is catastrophic forgetting, and it has plagued neural networks since the 1980s.

But what does forgetting actually look like mathematically? Peng and Vidal give us an elegant framework. Suppose we have T tasks, each defined by a data pair (xt, yt). After training on tasks 1 through j, we get parameters θj. The error matrix captures everything:

εij = yi − xiT θj

This is the error on task i after training through task j. It forms a matrix — rows are tasks, columns are training stages. Three learning paradigms correspond to three regions of this matrix:

The key insight: Continual learning is the hardest paradigm because it demands that every entry in the lower triangle of the error matrix be small. Online learning only needs one diagonal. Fine-tuning only needs another. CL needs the whole triangle.

From this matrix, two metrics capture everything we care about:

MSE at time t:   Et = (1/t) ∑i=1t εit²
Forgetting at time t:   Ft = (1/(t−1)) ∑i=1t−1it² − εii²)

MSE is the average squared error across all tasks seen so far, evaluated at the current parameters. Forgetting measures how much worse we got on old tasks compared to right after we trained on them. If Ft > 0, we forgot. If Ft < 0, we actually got better at old tasks — that's positive backward transfer.

Why is continual learning harder than fine-tuning or online learning?

Chapter 1: The Hidden Connection

In 1960, Bernard Widrow and Marcian Hoff invented the Least Mean Squares (LMS) algorithm. Their guiding principle was deceptively simple: inject new information with minimal disturbance to stored information. They called it the minimal disturbance principle.

Widrow tried to apply this to multi-layer networks in the 1960s and 70s. It didn't work well — the tools for training deep networks didn't exist yet. So he pivoted to adaptive filtering, where LMS became the dominant algorithm for noise cancellation, echo suppression, and channel equalization.

Meanwhile, the neural network community rediscovered forgetting in the 1990s, coined "catastrophic interference," and spent three decades developing methods to fight it. They built Experience Replay, Elastic Weight Consolidation, Gradient Episodic Memory, Progressive Neural Networks — a zoo of techniques.

The 60-year-old secret: The adaptive filtering community had already solved these problems mathematically. LMS is online SGD. The Affine Projection Algorithm is gradient projection. Recursive Least Squares is regularization. The Kalman filter models task relationships. Peng and Vidal (2025) finally connected the dots.

Here is the mapping — Table I from the paper — that unifies 60 years of parallel research:

Adaptive FilterCL MethodMemoryKey Property
LMSOnline SGDNoneMust revisit tasks to converge
APAICL / Gradient projectionPast samplesExact solution with full replay
RLSRegularization (EWC)Sufficient statisticsForgetting factor trades recency vs. accuracy
Kalman FilterTask-relationship modelsState + covariancePositive backward transfer possible

Each row adds more structure: more memory, more assumptions about the task sequence, and better theoretical guarantees. The progression from LMS to KF is the progression from "no memory, must revisit" to "models how tasks relate, can improve the past."

What is the minimal disturbance principle (Widrow, 1960)?

Chapter 2: LMS — The Simplest Continual Learner

Let's derive LMS from the minimal disturbance principle. We have current parameters θt−1 and a new task (xt, yt). We want to find new parameters θt that:

  1. Stay as close as possible to θt−1 (minimal disturbance)
  2. Reduce the error on the new task

Formally:

minimize   ||θ − θt−1||²   subject to   yt − xtTθ = (1 − γ)εt

Here γ ∈ (0, 1] controls how much of the error we correct. When γ = 1, we fully correct the error on the new task. When γ < 1, we only partially correct it.

This is a constrained optimization with a quadratic objective and a single linear equality constraint. Using the method of Lagrange multipliers, we get:

θt = θt−1 − γ (xtTθt−1 − yt) xt / ||xt||²

If we absorb ||xt||² into the learning rate (or normalize the data), this simplifies to:

θt = θt−1 − γ (xtTθt−1 − yt) xt
This is online SGD. The update rule is identical: take the current error, multiply by the input, scale by a learning rate, and subtract from the current parameters. The LMS algorithm invented in 1960 for adaptive filtering IS the same algorithm used for online gradient descent in neural networks.

Convergence guarantees

The paper proves two key theorems about LMS convergence in the continual learning setting:

Theorem 1 (i.i.d. tasks): If tasks are drawn i.i.d. from a distribution with input covariance Σx, then the expected MSE converges exponentially:

E[Et] ≤ (1 − γ(2−γ)λminx))t ||θ*||²

The convergence rate depends on λminx) — the smallest eigenvalue of the input covariance. Ill-conditioned data (small λmin) means slow convergence. The optimal γ = 1 (full correction).

Theorem 2 (recurring tasks): For tasks that cycle through a fixed set with period T and γ = 1, the MSE after seeing each task at least once satisfies:

ET ≤ ||θ*||² / (e(T−1))

LMS only converges if tasks are repeated. See a task once and move on? LMS will forget it. This is the fundamental limitation of memoryless algorithms.

Why does LMS (online SGD) forget previously learned tasks?

Chapter 3: APA — Memory Equals Solution

LMS forgets because it only uses the current sample. What if we remembered all past samples and demanded that the new parameters satisfy all constraints simultaneously?

The Affine Projection Algorithm (APA) does exactly this. Instead of projecting onto a single hyperplane (the constraint from the current task), it projects onto the intersection of all hyperplanes (the constraints from all tasks seen so far).

minimize   ||θ − θt−1||²   subject to   X:tT θ = y:t

Where X:t = [x1, ..., xt] stacks all past inputs and y:t = [y1, ..., yt]T stacks all past targets. This says: find the closest point to θt−1 that satisfies every task constraint.

The closed-form solution

Using Lagrange multipliers on this constrained least-squares problem:

θt = X:t (X:tT X:t)−1 y:t
This is the ordinary least-squares solution. APA with full memory gives the exact OLS solution at every step. Once we've seen at least as many linearly independent tasks as parameters (t ≥ d), we get θt = θ* — the true solution. Zero forgetting. Zero error. Exact recovery.

Why does this work so perfectly? Because each task constraint yi = xiTθ defines a hyperplane in parameter space. The true θ* lies at the intersection of all these hyperplanes. By projecting onto their intersection, we converge to θ* as soon as we have enough constraints to uniquely determine it.

The connection to modern CL methods is immediate:

The tradeoff is obvious: APA requires storing all past data. For t tasks in d dimensions, we need O(td) memory and O(td²) computation per step. Practical CL methods approximate this by storing a subset (coreset) of past samples.

Why does APA with full memory achieve zero forgetting?

Chapter 4: RLS — Graceful Forgetting

APA stores all past data. But what if we want to prioritize recent tasks? In a non-stationary world, old tasks might no longer be relevant. We need a way to gracefully forget the past while retaining useful information.

Recursive Least Squares (RLS) introduces a forgetting factor λ ∈ (0, 1] that exponentially downweights old data:

minimize   ∑i=1t λt−i (yi − xiTθ)²

When λ = 1, all past data is weighted equally (identical to APA/OLS). When λ < 1, old tasks are exponentially forgotten. The effective memory window is roughly 1/(1 − λ) samples.

The recursive update

The beauty of RLS is that we never need to re-solve the full problem. The solution updates recursively:

θt = θt−1 + Pt xt (yt − xtT θt−1)

Where Pt is the inverse correlation matrix, updated via the matrix inversion lemma:

Pt = (1/λ)(Pt−1 − Pt−1 xt xtT Pt−1 / (λ + xtT Pt−1 xt))
The connection to EWC: Elastic Weight Consolidation (Kirkpatrick et al., 2017) penalizes changes to parameters proportional to their Fisher Information. The Fisher Information is essentially the inverse of Pt. RLS is the sequential update of this information matrix. EWC is RLS with an approximation: it resets and re-estimates the Fisher after each task, while RLS updates it smoothly.

The forgetting factor λ creates a fundamental tradeoff:

RLS needs O(d²) memory (for Pt) and O(d²) computation per step — independent of the number of tasks. Compare to APA's O(td) memory: RLS compresses all past information into the sufficient statisticst, Pt).

What does the forgetting factor λ in RLS control?

Chapter 5: Kalman Filter — The Gold Standard

LMS, APA, and RLS all share a hidden assumption: the true parameter θ* is fixed. Tasks might arrive sequentially, but they all come from the same underlying model. What if the true parameters change from task to task?

The Kalman filter introduces a state transition model that describes how the true parameters evolve:

State:   θt = A θt−1 + wt   (wt ~ N(0, Q))
Measurement:   yt = xtT θt + vt   (vt ~ N(0, R))

Here A is the state transition matrix (how parameters relate across tasks), Q is the process noise (how much parameters change between tasks), and R is the measurement noise. Unlike RLS, the Kalman filter models the dynamics of how tasks are related.

The Kalman filter update

Predict: Use the transition model to predict where θ will be before seeing the new task:

θt|t−1 = A θt−1
Pt|t−1 = A Pt−1 AT + Q

Update: Incorporate the new measurement to correct the prediction:

Kt = Pt|t−1 xt / (xtT Pt|t−1 xt + R)
θt = θt|t−1 + Kt (yt − xtT θt|t−1)
Pt = (I − Kt xtT) Pt|t−1
The Kalman gain Kt is a trust slider. When Pt|t−1 is large (we're uncertain about our prediction), Kt is large — trust the new measurement more. When R is large (the measurement is noisy), Kt is small — trust the prediction more. The KF optimally balances prior knowledge with new evidence at every step.

Positive backward transfer via smoothing

Here is where the Kalman filter truly shines. After processing all T tasks, we can run the Rauch-Tung-Striebel (RTS) smoother backward through the sequence. This uses future task information to improve past task estimates:

θt|T = θt + Gtt+1|T − Aθt)

where Gt = Pt AT Pt+1|t−1.

The smoother gives positive backward transfer: learning task B genuinely improves your estimate of task A, even after you've moved past it. No other method in the LMS/APA/RLS family can do this, because they don't model inter-task dynamics.

What enables the Kalman filter to achieve positive backward transfer (improving past task estimates)?

Chapter 6: From Linear to Deep

Everything so far has been for linear models: y = xTθ. Neural networks are nonlinear. How do we extend these ideas?

The paper identifies three strategies, each with different tradeoffs:

Strategy 1: Linearize the network (NTK connection)

Near initialization, a neural network f(x; θ) can be approximated by its first-order Taylor expansion:

f(x; θ) ≈ f(x; θ0) + ∇θf(x; θ0)T(θ − θ0)

This is a linear model in (θ − θ0) with features φ(x) = ∇θf(x; θ0) — the Neural Tangent Kernel (NTK) features. We can apply LMS/APA/RLS/KF directly in this linearized space.

The catch: the NTK approximation is only valid for small parameter changes from initialization. Large updates break the linearization.

Strategy 2: Layer-wise adaptive filtering

Apply RLS independently to each layer of the network. Each layer's weights are updated using its own P matrix (inverse covariance). This is computationally tractable because each layer is small compared to the full network.

This connects to block-diagonal Fisher Information approximations used in practical EWC implementations.

Strategy 3: Linear classifier on frozen features

Use a pre-trained foundation model (CLIP, DINOv2, etc.) as a frozen feature extractor. Only train a linear classifier on top. Now the problem IS linear, and LMS/APA/RLS/KF apply exactly.

This is why prompt tuning works. Methods like L2P (Learning to Prompt) and CODA-Prompt keep the backbone frozen and learn small prompt vectors. The optimization over prompt parameters, given fixed backbone features, is approximately linear — putting it squarely in the domain where adaptive filtering theory applies.

The gap between theory and practice is narrowing. As pre-trained models get larger and more capable, the "linear head on frozen features" approach becomes more practical. The adaptive filtering framework provides exact guarantees for this setting.

Why does training a linear classifier on frozen pre-trained features connect directly to adaptive filtering theory?

Chapter 7: The Minimal Disturbance Principle

Let's zoom out and see the unifying thread. Every continual learning method, whether invented in 1960 or 2024, solves some version of this optimization:

minimize   d(θ, θold)   subject to   θ satisfies new task constraints

The distance d(·, ·) and the constraint formulation differ across methods, but the structure is always the same: change as little as possible while accommodating new information. This is Widrow's minimal disturbance principle, generalized.

How each method implements this

LMS / Online SGD: d = ||θ − θold||² with a single equality constraint. The simplest version: minimize Euclidean distance subject to correcting the current sample's error.

APA / Replay: Same Euclidean distance, but with constraints from ALL past tasks. More constraints = less freedom to move = less forgetting.

EWC: d = (θ − θold)T F (θ − θold) where F is the Fisher Information matrix. Important parameters (high Fisher) are disturbed less. This is RLS with a diagonal approximation to Pt−1.

GEM / A-GEM: Instead of hard constraints, project the gradient: ensure the update doesn't increase loss on any past task. This is a relaxed version of APA's projection — instead of hitting the intersection of hyperplanes exactly, just ensure you don't move away from it.

Progressive Networks: d = 0 for old parameters (freeze them entirely) and add new parameters for new tasks. The most extreme minimal disturbance: zero disturbance to old parameters, at the cost of growing network size.

The landscape of CL methods is a landscape of distance functions and constraint sets. The adaptive filtering framework reveals that all these methods are points on a spectrum. The choice of d(·, ·) determines what kind of "memory" you have: Euclidean distance = no preference among parameters. Fisher-weighted = parameter importance. Task model (KF) = inter-task dynamics. More structure in the distance function = better continual learning, but more computation.
How does EWC modify the minimal disturbance principle compared to basic LMS?

Chapter 8: What the Math Tells Us

Let's collect the key results and see what they teach us about designing continual learning systems.

The four-algorithm progression

AlgorithmMemoryCompute/stepForgettingBackward Transfer
LMSO(d)O(d)Forgets unless tasks repeatNo
APAO(td)O(td²)Zero (with full memory)No
RLSO(d²)O(d²)Controlled by λNo
KFO(d²)O(d²)Optimal given modelYes (with smoother)

Lessons from the theory

Lesson 1: Memory is necessary. LMS (no memory) provably cannot avoid forgetting unless tasks repeat. This is why SGD without replay forgets — it's not a bug, it's a mathematical necessity. Any memoryless algorithm will have this limitation.

Lesson 2: Replay works because it's projection. Experience Replay isn't just a heuristic that "seems to help." It's performing projection onto the intersection of task constraints. APA proves that full replay gives the exact solution. Partial replay (coresets) gives an approximation whose quality depends on how well the coreset spans the constraint space.

Lesson 3: Regularization is exponential forgetting. EWC-style methods correspond to RLS with a forgetting factor. They trade recency for accuracy. The forgetting factor λ is implicit in the regularization strength: stronger regularization = higher λ = more memory of old tasks = less adaptability to new ones.

Lesson 4: Task models enable backward transfer. Only the Kalman filter achieves positive backward transfer, because only it models how tasks relate. If you can specify (or learn) the transition matrix A, you can improve past task performance by processing future tasks. This is unique — no amount of replay or regularization can do this without a task dynamics model.

Lesson 5: There is no free lunch. Better CL performance requires either more memory (APA), more computation (RLS), or more assumptions (KF). The adaptive filtering framework makes these tradeoffs precise and quantitative.

Cheat sheet: LMS = SGD. APA = Replay. RLS = Regularization. KF = Task modeling. More structure in your algorithm = less forgetting = more cost. Choose based on your constraints.
Which adaptive filtering algorithm is the ONLY one that can achieve positive backward transfer, and why?

Chapter 9: Connections

This paper sits at the intersection of two rich fields. Here's how it connects to other topics:

Related Veanors lessons

Related Gleams lessons

Key methods explained by this framework

Formulas at a glance

MethodUpdate Rule
LMSθt = θt−1 − γ(xtTθt−1 − yt) xt
APAθt = X:t(X:tTX:t)−1 y:t
RLSθt = θt−1 + Pt xt(yt − xtTθt−1)
KFθt = θt|t−1 + Kt(yt − xtTθt|t−1)
The big picture: Adaptive filtering and continual learning were parallel threads of the same story for 60 years. This paper weaves them together. The mathematics of LMS, APA, RLS, and the Kalman filter provide exact, provable guarantees for continual learning — not just heuristics. As we move toward lifelong learning systems, these classical tools become more relevant than ever.
What is the key mapping between adaptive filtering and modern CL methods?