Adaptive filtering IS continual learning. LMS, APA, RLS, and the Kalman filter are continual learning algorithms — and their 60-year-old math reveals exactly why modern CL methods work or fail.
You train a network on Task A. It performs beautifully. Then you train it on Task B. It learns B — but now it has forgotten A. This is catastrophic forgetting, and it has plagued neural networks since the 1980s.
But what does forgetting actually look like mathematically? Peng and Vidal give us an elegant framework. Suppose we have T tasks, each defined by a data pair (xt, yt). After training on tasks 1 through j, we get parameters θj. The error matrix captures everything:
This is the error on task i after training through task j. It forms a matrix — rows are tasks, columns are training stages. Three learning paradigms correspond to three regions of this matrix:
From this matrix, two metrics capture everything we care about:
MSE is the average squared error across all tasks seen so far, evaluated at the current parameters. Forgetting measures how much worse we got on old tasks compared to right after we trained on them. If Ft > 0, we forgot. If Ft < 0, we actually got better at old tasks — that's positive backward transfer.
In 1960, Bernard Widrow and Marcian Hoff invented the Least Mean Squares (LMS) algorithm. Their guiding principle was deceptively simple: inject new information with minimal disturbance to stored information. They called it the minimal disturbance principle.
Widrow tried to apply this to multi-layer networks in the 1960s and 70s. It didn't work well — the tools for training deep networks didn't exist yet. So he pivoted to adaptive filtering, where LMS became the dominant algorithm for noise cancellation, echo suppression, and channel equalization.
Meanwhile, the neural network community rediscovered forgetting in the 1990s, coined "catastrophic interference," and spent three decades developing methods to fight it. They built Experience Replay, Elastic Weight Consolidation, Gradient Episodic Memory, Progressive Neural Networks — a zoo of techniques.
Here is the mapping — Table I from the paper — that unifies 60 years of parallel research:
| Adaptive Filter | CL Method | Memory | Key Property |
|---|---|---|---|
| LMS | Online SGD | None | Must revisit tasks to converge |
| APA | ICL / Gradient projection | Past samples | Exact solution with full replay |
| RLS | Regularization (EWC) | Sufficient statistics | Forgetting factor trades recency vs. accuracy |
| Kalman Filter | Task-relationship models | State + covariance | Positive backward transfer possible |
Each row adds more structure: more memory, more assumptions about the task sequence, and better theoretical guarantees. The progression from LMS to KF is the progression from "no memory, must revisit" to "models how tasks relate, can improve the past."
Let's derive LMS from the minimal disturbance principle. We have current parameters θt−1 and a new task (xt, yt). We want to find new parameters θt that:
Formally:
Here γ ∈ (0, 1] controls how much of the error we correct. When γ = 1, we fully correct the error on the new task. When γ < 1, we only partially correct it.
This is a constrained optimization with a quadratic objective and a single linear equality constraint. Using the method of Lagrange multipliers, we get:
If we absorb ||xt||² into the learning rate (or normalize the data), this simplifies to:
The paper proves two key theorems about LMS convergence in the continual learning setting:
Theorem 1 (i.i.d. tasks): If tasks are drawn i.i.d. from a distribution with input covariance Σx, then the expected MSE converges exponentially:
The convergence rate depends on λmin(Σx) — the smallest eigenvalue of the input covariance. Ill-conditioned data (small λmin) means slow convergence. The optimal γ = 1 (full correction).
Theorem 2 (recurring tasks): For tasks that cycle through a fixed set with period T and γ = 1, the MSE after seeing each task at least once satisfies:
LMS only converges if tasks are repeated. See a task once and move on? LMS will forget it. This is the fundamental limitation of memoryless algorithms.
LMS forgets because it only uses the current sample. What if we remembered all past samples and demanded that the new parameters satisfy all constraints simultaneously?
The Affine Projection Algorithm (APA) does exactly this. Instead of projecting onto a single hyperplane (the constraint from the current task), it projects onto the intersection of all hyperplanes (the constraints from all tasks seen so far).
Where X:t = [x1, ..., xt] stacks all past inputs and y:t = [y1, ..., yt]T stacks all past targets. This says: find the closest point to θt−1 that satisfies every task constraint.
Using Lagrange multipliers on this constrained least-squares problem:
Why does this work so perfectly? Because each task constraint yi = xiTθ defines a hyperplane in parameter space. The true θ* lies at the intersection of all these hyperplanes. By projecting onto their intersection, we converge to θ* as soon as we have enough constraints to uniquely determine it.
The connection to modern CL methods is immediate:
The tradeoff is obvious: APA requires storing all past data. For t tasks in d dimensions, we need O(td) memory and O(td²) computation per step. Practical CL methods approximate this by storing a subset (coreset) of past samples.
APA stores all past data. But what if we want to prioritize recent tasks? In a non-stationary world, old tasks might no longer be relevant. We need a way to gracefully forget the past while retaining useful information.
Recursive Least Squares (RLS) introduces a forgetting factor λ ∈ (0, 1] that exponentially downweights old data:
When λ = 1, all past data is weighted equally (identical to APA/OLS). When λ < 1, old tasks are exponentially forgotten. The effective memory window is roughly 1/(1 − λ) samples.
The beauty of RLS is that we never need to re-solve the full problem. The solution updates recursively:
Where Pt is the inverse correlation matrix, updated via the matrix inversion lemma:
The forgetting factor λ creates a fundamental tradeoff:
RLS needs O(d²) memory (for Pt) and O(d²) computation per step — independent of the number of tasks. Compare to APA's O(td) memory: RLS compresses all past information into the sufficient statistics (θt, Pt).
LMS, APA, and RLS all share a hidden assumption: the true parameter θ* is fixed. Tasks might arrive sequentially, but they all come from the same underlying model. What if the true parameters change from task to task?
The Kalman filter introduces a state transition model that describes how the true parameters evolve:
Here A is the state transition matrix (how parameters relate across tasks), Q is the process noise (how much parameters change between tasks), and R is the measurement noise. Unlike RLS, the Kalman filter models the dynamics of how tasks are related.
Predict: Use the transition model to predict where θ will be before seeing the new task:
Update: Incorporate the new measurement to correct the prediction:
Here is where the Kalman filter truly shines. After processing all T tasks, we can run the Rauch-Tung-Striebel (RTS) smoother backward through the sequence. This uses future task information to improve past task estimates:
where Gt = Pt AT Pt+1|t−1.
The smoother gives positive backward transfer: learning task B genuinely improves your estimate of task A, even after you've moved past it. No other method in the LMS/APA/RLS family can do this, because they don't model inter-task dynamics.
Everything so far has been for linear models: y = xTθ. Neural networks are nonlinear. How do we extend these ideas?
The paper identifies three strategies, each with different tradeoffs:
Near initialization, a neural network f(x; θ) can be approximated by its first-order Taylor expansion:
This is a linear model in (θ − θ0) with features φ(x) = ∇θf(x; θ0) — the Neural Tangent Kernel (NTK) features. We can apply LMS/APA/RLS/KF directly in this linearized space.
The catch: the NTK approximation is only valid for small parameter changes from initialization. Large updates break the linearization.
Apply RLS independently to each layer of the network. Each layer's weights are updated using its own P matrix (inverse covariance). This is computationally tractable because each layer is small compared to the full network.
This connects to block-diagonal Fisher Information approximations used in practical EWC implementations.
Use a pre-trained foundation model (CLIP, DINOv2, etc.) as a frozen feature extractor. Only train a linear classifier on top. Now the problem IS linear, and LMS/APA/RLS/KF apply exactly.
The gap between theory and practice is narrowing. As pre-trained models get larger and more capable, the "linear head on frozen features" approach becomes more practical. The adaptive filtering framework provides exact guarantees for this setting.
Let's zoom out and see the unifying thread. Every continual learning method, whether invented in 1960 or 2024, solves some version of this optimization:
The distance d(·, ·) and the constraint formulation differ across methods, but the structure is always the same: change as little as possible while accommodating new information. This is Widrow's minimal disturbance principle, generalized.
LMS / Online SGD: d = ||θ − θold||² with a single equality constraint. The simplest version: minimize Euclidean distance subject to correcting the current sample's error.
APA / Replay: Same Euclidean distance, but with constraints from ALL past tasks. More constraints = less freedom to move = less forgetting.
EWC: d = (θ − θold)T F (θ − θold) where F is the Fisher Information matrix. Important parameters (high Fisher) are disturbed less. This is RLS with a diagonal approximation to Pt−1.
GEM / A-GEM: Instead of hard constraints, project the gradient: ensure the update doesn't increase loss on any past task. This is a relaxed version of APA's projection — instead of hitting the intersection of hyperplanes exactly, just ensure you don't move away from it.
Progressive Networks: d = 0 for old parameters (freeze them entirely) and add new parameters for new tasks. The most extreme minimal disturbance: zero disturbance to old parameters, at the cost of growing network size.
Let's collect the key results and see what they teach us about designing continual learning systems.
| Algorithm | Memory | Compute/step | Forgetting | Backward Transfer |
|---|---|---|---|---|
| LMS | O(d) | O(d) | Forgets unless tasks repeat | No |
| APA | O(td) | O(td²) | Zero (with full memory) | No |
| RLS | O(d²) | O(d²) | Controlled by λ | No |
| KF | O(d²) | O(d²) | Optimal given model | Yes (with smoother) |
Lesson 1: Memory is necessary. LMS (no memory) provably cannot avoid forgetting unless tasks repeat. This is why SGD without replay forgets — it's not a bug, it's a mathematical necessity. Any memoryless algorithm will have this limitation.
Lesson 2: Replay works because it's projection. Experience Replay isn't just a heuristic that "seems to help." It's performing projection onto the intersection of task constraints. APA proves that full replay gives the exact solution. Partial replay (coresets) gives an approximation whose quality depends on how well the coreset spans the constraint space.
Lesson 3: Regularization is exponential forgetting. EWC-style methods correspond to RLS with a forgetting factor. They trade recency for accuracy. The forgetting factor λ is implicit in the regularization strength: stronger regularization = higher λ = more memory of old tasks = less adaptability to new ones.
Lesson 4: Task models enable backward transfer. Only the Kalman filter achieves positive backward transfer, because only it models how tasks relate. If you can specify (or learn) the transition matrix A, you can improve past task performance by processing future tasks. This is unique — no amount of replay or regularization can do this without a task dynamics model.
Lesson 5: There is no free lunch. Better CL performance requires either more memory (APA), more computation (RLS), or more assumptions (KF). The adaptive filtering framework makes these tradeoffs precise and quantitative.
This paper sits at the intersection of two rich fields. Here's how it connects to other topics:
| Method | Update Rule |
|---|---|
| LMS | θt = θt−1 − γ(xtTθt−1 − yt) xt |
| APA | θt = X:t(X:tTX:t)−1 y:t |
| RLS | θt = θt−1 + Pt xt(yt − xtTθt−1) |
| KF | θt = θt|t−1 + Kt(yt − xtTθt|t−1) |