Markov models, hidden Markov models, the forward-backward algorithm, Viterbi decoding, Kalman filters, and particle filters — modeling temporal structure.
Much of the data we encounter has a natural ordering: speech signals, DNA sequences, stock prices, robot trajectories. The key property: nearby points in the sequence are statistically related. Today's weather helps predict tomorrow's.
Two types of latent variable model for sequences:
Hidden Markov Models (HMMs): Discrete latent states, arbitrary observations. The workhorse of speech recognition, bioinformatics, and NLP (before neural methods).
Linear Dynamical Systems (LDS): Continuous Gaussian latent states, linear dynamics. The Kalman filter. Foundation of control theory and tracking.
A first-order Markov model factorizes the joint distribution as:
The graphical model is a chain: x1 → x2 → ... → xN.
For discrete states with K values, the model has K−1 parameters for p(x1) and K(K−1) parameters for the transition matrix Ajk = p(xn=k|xn−1=j). For a stationary process (transition probabilities don't change over time), the parameters are shared across all time steps.
A hidden Markov model (HMM) has two layers: a Markov chain over latent states z1, ..., zN and observations x1, ..., xN that depend on the corresponding latent state.
The joint distribution:
Three components define an HMM:
| Component | Notation | Description |
|---|---|---|
| Initial state | πk = p(z1k=1) | Probability of starting in state k |
| Transitions | Ajk = p(znk=1|z(n-1)j=1) | Probability of transitioning from j to k |
| Emissions | p(xn|zn) | Observation distribution given state |
The HMM is a special case of the graphical model framework from Chapter 8. The latent states form a chain: z1 → z2 → ... → zN, with each zn emitting observation xn.
Key conditional independencies (verified by d-separation):
• xn ⊥ xm | zn, zm (observations are independent given their states)
• zn+1 ⊥ z1:n-1 | zn (Markov property: future is independent of past given present)
• xn+1:N ⊥ x1:n-1 | zn (future observations are independent of past observations given current state)
The forward algorithm computes p(x1:n, zn) recursively:
The backward algorithm computes p(xn+1:N|zn) recursively:
Combining them gives the posterior over states:
The scaling technique prevents numerical underflow: normalize α(zn) at each step by its sum cn = ∑k α(znk). The log-likelihood is ln p(x) = ∑n ln cn.
The Viterbi algorithm finds the single most probable state sequence z* = arg maxz p(z|x). It uses the max-sum algorithm (Chapter 8) on the HMM chain.
Instead of summing over zn-1 (forward algorithm), we take the max:
Then backtrack from the final state to recover the optimal sequence.
The Baum-Welch algorithm is EM applied to HMMs. The latent variables are the state sequence z1:N.
E-step: Run forward-backward to compute γ(zn) = p(zn|x) and ξ(zn-1, zn) = p(zn-1, zn|x).
M-step: Update parameters:
A 3-state HMM generates observations. The top row shows the true hidden states. The bottom shows the observations. The middle shows the posterior state probabilities from forward-backward.
Replace discrete latent states with continuous Gaussian states, and you get a linear dynamical system (LDS):
| Property | HMM | LDS |
|---|---|---|
| Latent states | Discrete (K values) | Continuous (Gaussian) |
| Inference | Forward-backward | Kalman filter/smoother |
| Learning | Baum-Welch (EM) | EM with Kalman E-step |
| Cost per step | O(K2) | O(Dz3) |
The Kalman filter computes the posterior p(zn|x1:n) recursively in two steps:
Predict: Propagate the previous posterior through the dynamics:
Update: Incorporate the new observation:
where Kn is the Kalman gain.
The Kalman smoother (backward pass) uses future observations to refine the estimate: p(zn|x1:N). The smoothed estimate is always more accurate than the filtered estimate.
When the dynamics or observations are nonlinear or non-Gaussian, the Kalman filter no longer applies. Particle filters (sequential Monte Carlo) use sampling to approximate the posterior.
The algorithm maintains a set of weighted particles {(zn(l), wn(l))}l=1L:
1. Propagate: Move each particle through the dynamics: zn(l) ~ p(zn|zn-1(l)).
2. Reweight: wn(l) ∝ p(xn|zn(l)).
3. Resample: Draw L new particles with replacement, proportional to weights (SIR from Ch 11).
| Model | Latent | Inference | Key application |
|---|---|---|---|
| Markov model | None (observed) | Direct | Language models, sequence statistics |
| HMM | Discrete | Forward-backward, Viterbi | Speech, bioinformatics, NLP |
| LDS / Kalman | Continuous Gaussian | Kalman filter/smoother | Tracking, control, navigation |
| Nonlinear SSM | Continuous | Particle filter | Robotics, nonlinear tracking |
What comes next: Chapter 14, the final chapter, covers combining models — committees, boosting, decision trees, and mixtures of experts.