Bishop PRML, Chapter 13

Sequential Data

Markov models, hidden Markov models, the forward-backward algorithm, Viterbi decoding, Kalman filters, and particle filters — modeling temporal structure.

Prerequisites: Chapters 2, 8–9 (distributions, graphical models, EM).
12
Chapters
2
Simulations
12
Quizzes

Chapter 0: Why Sequential Models?

Much of the data we encounter has a natural ordering: speech signals, DNA sequences, stock prices, robot trajectories. The key property: nearby points in the sequence are statistically related. Today's weather helps predict tomorrow's.

i.i.d. is not enough: Previous chapters assumed data points are independent and identically distributed (i.i.d.). Sequential models relax this: observation xt depends on previous observations. The challenge is keeping this tractable — without the Markov assumption (xt depends only on xt-1), the model would need to condition on the entire past, with exponentially growing complexity.

Two types of latent variable model for sequences:

Hidden Markov Models (HMMs): Discrete latent states, arbitrary observations. The workhorse of speech recognition, bioinformatics, and NLP (before neural methods).

Linear Dynamical Systems (LDS): Continuous Gaussian latent states, linear dynamics. The Kalman filter. Foundation of control theory and tracking.

Check: What assumption makes sequential models tractable?

Chapter 1: Markov Models

A first-order Markov model factorizes the joint distribution as:

p(x1, ..., xN) = p(x1) ∏n=2N p(xn|xn−1)

The graphical model is a chain: x1 → x2 → ... → xN.

For discrete states with K values, the model has K−1 parameters for p(x1) and K(K−1) parameters for the transition matrix Ajk = p(xn=k|xn−1=j). For a stationary process (transition probabilities don't change over time), the parameters are shared across all time steps.

Limitation: Observed Markov models assume we see the true state directly. In practice, observations are noisy or incomplete. A patient's true disease state is hidden; we only observe symptoms. This motivates hidden Markov models, where the Markov chain is over latent states, and observations are noisy functions of those states.
Check: What does a first-order Markov model assume?

Chapter 2: Hidden Markov Models

A hidden Markov model (HMM) has two layers: a Markov chain over latent states z1, ..., zN and observations x1, ..., xN that depend on the corresponding latent state.

The joint distribution:

p(x, z) = p(z1) [∏n=2N p(zn|zn−1)] [∏n=1N p(xn|zn)]

Three components define an HMM:

ComponentNotationDescription
Initial stateπk = p(z1k=1)Probability of starting in state k
TransitionsAjk = p(znk=1|z(n-1)j=1)Probability of transitioning from j to k
Emissionsp(xn|zn)Observation distribution given state
Three fundamental problems:
1. Evaluation: p(x|θ) — how likely is an observation sequence? (Forward algorithm)
2. Decoding: arg maxz p(z|x) — what is the most likely state sequence? (Viterbi algorithm)
3. Learning: arg maxθ p(x|θ) — what parameters best explain the data? (Baum-Welch / EM)
Check: What are the three components that define an HMM?

Chapter 3: HMM as a Graphical Model

The HMM is a special case of the graphical model framework from Chapter 8. The latent states form a chain: z1 → z2 → ... → zN, with each zn emitting observation xn.

Key conditional independencies (verified by d-separation):

• xn ⊥ xm | zn, zm (observations are independent given their states)

• zn+1 ⊥ z1:n-1 | zn (Markov property: future is independent of past given present)

• xn+1:N ⊥ x1:n-1 | zn (future observations are independent of past observations given current state)

The graphical model perspective matters: Seeing the HMM as a graphical model tells us exactly which inference algorithms apply. The chain structure means the sum-product algorithm (Ch 8) gives exact marginals — this is the forward-backward algorithm. The max-sum algorithm gives the MAP sequence — this is the Viterbi algorithm. The graphical model framework unifies what were historically separate algorithms.
Check: What inference algorithm on the HMM graph corresponds to the forward-backward algorithm?

Chapter 4: The Forward-Backward Algorithm

The forward algorithm computes p(x1:n, zn) recursively:

α(zn) = p(xn|zn) ∑zn-1 α(zn-1) p(zn|zn-1)

The backward algorithm computes p(xn+1:N|zn) recursively:

β(zn) = ∑zn+1 β(zn+1) p(xn+1|zn+1) p(zn+1|zn)

Combining them gives the posterior over states:

p(zn|x) = γ(zn) = α(zn) β(zn) / p(x)
Computational cost: Naive summation over all KN state sequences is exponential. Forward-backward costs O(NK2) — linear in sequence length, quadratic in number of states. This is because the chain structure allows us to factor the sum using dynamic programming. Each step multiplies the forward vector by the K × K transition matrix.

The scaling technique prevents numerical underflow: normalize α(zn) at each step by its sum cn = ∑k α(znk). The log-likelihood is ln p(x) = ∑n ln cn.

Check: What is the computational complexity of the forward-backward algorithm?

Chapter 5: The Viterbi Algorithm

The Viterbi algorithm finds the single most probable state sequence z* = arg maxz p(z|x). It uses the max-sum algorithm (Chapter 8) on the HMM chain.

Instead of summing over zn-1 (forward algorithm), we take the max:

ω(zn) = ln p(xn|zn) + maxzn-1 [ω(zn-1) + ln p(zn|zn-1)]

Then backtrack from the final state to recover the optimal sequence.

Viterbi vs. marginal decoding: The Viterbi path is the globally most probable sequence. An alternative: choose the most probable state at each time step individually using γ(zn). These can differ! The Viterbi path respects transition constraints (no impossible transitions), while marginal decoding might select states that are individually likely but form an improbable sequence.
Check: How does the Viterbi algorithm differ from the forward algorithm?

Chapter 6: Learning HMMs with EM

The Baum-Welch algorithm is EM applied to HMMs. The latent variables are the state sequence z1:N.

E-step: Run forward-backward to compute γ(zn) = p(zn|x) and ξ(zn-1, zn) = p(zn-1, zn|x).

M-step: Update parameters:

Ajknew = ∑n=2N ξ(z(n-1)j, znk) / ∑n=2N γ(z(n-1)j)
πknew = γ(z1k)
EM for HMMs unifies everything: The E-step is the forward-backward algorithm (inference). The M-step uses the expected sufficient statistics to update transitions and emissions. For Gaussian emissions, the M-step updates mean and covariance using responsibility-weighted data — identical in spirit to EM for GMMs but with temporal structure.
Check: What is the E-step in EM for HMMs?

Chapter 7: HMM Simulation

Hidden Markov Model: Inference

A 3-state HMM generates observations. The top row shows the true hidden states. The bottom shows the observations. The middle shows the posterior state probabilities from forward-backward.

T=30, K=3
Check: Can the forward-backward algorithm perfectly recover the true hidden states?

Chapter 8: Linear Dynamical Systems

Replace discrete latent states with continuous Gaussian states, and you get a linear dynamical system (LDS):

zn = Azn-1 + wn,    wn ~ N(0, Γ)
xn = Czn + vn,    vn ~ N(0, Σ)
HMM vs. LDS: Both are state-space models with the same graphical structure (chain over latent states with emissions). The difference: HMM has discrete states (inference uses forward-backward). LDS has continuous Gaussian states (inference uses the Kalman filter). The HMM can represent multimodal, nonlinear dynamics but with finite states. The LDS handles continuous dynamics but is limited to linear-Gaussian models.
PropertyHMMLDS
Latent statesDiscrete (K values)Continuous (Gaussian)
InferenceForward-backwardKalman filter/smoother
LearningBaum-Welch (EM)EM with Kalman E-step
Cost per stepO(K2)O(Dz3)
Check: What is the continuous analog of the forward-backward algorithm?

Chapter 9: The Kalman Filter

The Kalman filter computes the posterior p(zn|x1:n) recursively in two steps:

Predict: Propagate the previous posterior through the dynamics:

p(zn|x1:n-1) = N(zn|Aμn-1, APn-1AT + Γ)

Update: Incorporate the new observation:

Kn = Pn|n-1CT(CPn|n-1CT + Σ)−1
μn = Aμn-1 + Kn(xnCAμn-1)

where Kn is the Kalman gain.

The Kalman gain as a trust slider: K balances the prediction (from the dynamics model) and the observation. When the observation noise Σ is small (reliable sensor), K is large and we trust the measurement. When the process noise Γ is small (reliable model), K is small and we trust the prediction. The Kalman filter optimally fuses these two information sources in the Gaussian case.

The Kalman smoother (backward pass) uses future observations to refine the estimate: p(zn|x1:N). The smoothed estimate is always more accurate than the filtered estimate.

Check: What does the Kalman gain control?

Chapter 10: Particle Filters

When the dynamics or observations are nonlinear or non-Gaussian, the Kalman filter no longer applies. Particle filters (sequential Monte Carlo) use sampling to approximate the posterior.

The algorithm maintains a set of weighted particles {(zn(l), wn(l))}l=1L:

1. Propagate: Move each particle through the dynamics: zn(l) ~ p(zn|zn-1(l)).

2. Reweight: wn(l) ∝ p(xn|zn(l)).

3. Resample: Draw L new particles with replacement, proportional to weights (SIR from Ch 11).

From Kalman to particles: The Kalman filter is the exact solution for linear-Gaussian models. Extended and unscented Kalman filters handle mild nonlinearity via local linearization. Particle filters handle arbitrary nonlinearity and non-Gaussianity, but with cost O(L) per step and potential weight degeneracy in high dimensions. The hierarchy: Kalman (exact, cheap, linear) → EKF/UKF (approximate, cheap, mildly nonlinear) → particles (approximate, expensive, any nonlinearity).
Check: When are particle filters preferred over Kalman filters?

Chapter 11: Summary

ModelLatentInferenceKey application
Markov modelNone (observed)DirectLanguage models, sequence statistics
HMMDiscreteForward-backward, ViterbiSpeech, bioinformatics, NLP
LDS / KalmanContinuous GaussianKalman filter/smootherTracking, control, navigation
Nonlinear SSMContinuousParticle filterRobotics, nonlinear tracking
The state-space framework: All sequential models share the same structure: a latent state evolves over time according to transition dynamics, and we observe it through a noisy measurement process. The graphical model is always a chain with emissions. The differences are in the state space (discrete vs. continuous), the noise model (Gaussian vs. general), and the dynamics (linear vs. nonlinear). This framework unifies an enormous range of applications across engineering, science, and machine learning.

What comes next: Chapter 14, the final chapter, covers combining models — committees, boosting, decision trees, and mixtures of experts.

"The most widely used framework for treating sequential data
is the hidden Markov model."
— Christopher Bishop, PRML §13.2
Check: What unifying structure do HMMs, Kalman filters, and particle filters share?