Bishop PRML, Chapter 13

Sequential Data

Markov models, hidden Markov models, the forward-backward algorithm, Viterbi decoding, Kalman filters, and particle filters — modeling temporal structure.

Prerequisites: Chapters 2, 8–9 (distributions, graphical models, EM).

Chapters

Simulations

Quizzes

Chapter 0: Why Sequential Models?

Much of the data we encounter has a natural ordering: speech signals, DNA sequences, stock prices, robot trajectories. The key property: nearby points in the sequence are statistically related. Today's weather helps predict tomorrow's.

i.i.d. is not enough: Previous chapters assumed data points are independent and identically distributed (i.i.d.). Sequential models relax this: observation x_t depends on previous observations. The challenge is keeping this tractable — without the Markov assumption (x_t depends only on x_t-1), the model would need to condition on the entire past, with exponentially growing complexity.

Two types of latent variable model for sequences:

Hidden Markov Models (HMMs): Discrete latent states, arbitrary observations. The workhorse of speech recognition, bioinformatics, and NLP (before neural methods).

Linear Dynamical Systems (LDS): Continuous Gaussian latent states, linear dynamics. The Kalman filter. Foundation of control theory and tracking.

Check: What assumption makes sequential models tractable?

The Markov property — the current state depends only on the previous state, not the entire history The data must be Gaussian All observations must be independent

Chapter 1: Markov Models

A first-order Markov model factorizes the joint distribution as:

p(x₁, ..., x_N) = p(x₁) ∏_n=2^N p(x_n|x_n−1)

The graphical model is a chain: x₁ → x₂ → ... → x_N.

For discrete states with K values, the model has K−1 parameters for p(x₁) and K(K−1) parameters for the transition matrix A_jk = p(x_n=k|x_n−1=j). For a stationary process (transition probabilities don't change over time), the parameters are shared across all time steps.

Limitation: Observed Markov models assume we see the true state directly. In practice, observations are noisy or incomplete. A patient's true disease state is hidden; we only observe symptoms. This motivates hidden Markov models, where the Markov chain is over latent states, and observations are noisy functions of those states.

Check: What does a first-order Markov model assume?

The next state depends only on the current state: p(x_n | x_1, ..., x_{n-1}) = p(x_n | x_{n-1}) All states are independent The process has no memory

Chapter 2: Hidden Markov Models

A hidden Markov model (HMM) has two layers: a Markov chain over latent states z₁, ..., z_N and observations x₁, ..., x_N that depend on the corresponding latent state.

The joint distribution:

p(x, z) = p(z₁) [∏_n=2^N p(z_n|z_n−1)] [∏_n=1^N p(x_n|z_n)]

Three components define an HMM:

Component	Notation	Description
Initial state	π_k = p(z_1k=1)	Probability of starting in state k
Transitions	A_jk = p(z_nk=1\|z_(n-1)j=1)	Probability of transitioning from j to k
Emissions	p(x_n\|z_n)	Observation distribution given state

Three fundamental problems:
1. Evaluation: p(x|θ) — how likely is an observation sequence? (Forward algorithm)
2. Decoding: arg max_z p(z|x) — what is the most likely state sequence? (Viterbi algorithm)
3. Learning: arg max_θ p(x|θ) — what parameters best explain the data? (Baum-Welch / EM)

Check: What are the three components that define an HMM?

Initial state probabilities, transition matrix, and emission distributions Mean, variance, and number of states Encoder, decoder, and bottleneck

Chapter 3: HMM as a Graphical Model

The HMM is a special case of the graphical model framework from Chapter 8. The latent states form a chain: z₁ → z₂ → ... → z_N, with each z_n emitting observation x_n.

Key conditional independencies (verified by d-separation):

• x_n ⊥ x_m | z_n, z_m (observations are independent given their states)

• z_n+1 ⊥ z_1:n-1 | z_n (Markov property: future is independent of past given present)

• x_n+1:N ⊥ x_1:n-1 | z_n (future observations are independent of past observations given current state)

The graphical model perspective matters: Seeing the HMM as a graphical model tells us exactly which inference algorithms apply. The chain structure means the sum-product algorithm (Ch 8) gives exact marginals — this is the forward-backward algorithm. The max-sum algorithm gives the MAP sequence — this is the Viterbi algorithm. The graphical model framework unifies what were historically separate algorithms.

Check: What inference algorithm on the HMM graph corresponds to the forward-backward algorithm?

The sum-product algorithm on the chain-structured factor graph The max-sum algorithm Loopy belief propagation

Chapter 4: The Forward-Backward Algorithm

The forward algorithm computes p(x_1:n, z_n) recursively:

α(z_n) = p(x_n|z_n) ∑_{z_n-1} α(z_n-1) p(z_n|z_n-1)

The backward algorithm computes p(x_n+1:N|z_n) recursively:

β(z_n) = ∑_{z_n+1} β(z_n+1) p(x_n+1|z_n+1) p(z_n+1|z_n)

Combining them gives the posterior over states:

p(z_n|x) = γ(z_n) = α(z_n) β(z_n) / p(x)

Computational cost: Naive summation over all K^N state sequences is exponential. Forward-backward costs O(NK²) — linear in sequence length, quadratic in number of states. This is because the chain structure allows us to factor the sum using dynamic programming. Each step multiplies the forward vector by the K × K transition matrix.

The scaling technique prevents numerical underflow: normalize α(z_n) at each step by its sum c_n = ∑_k α(z_nk). The log-likelihood is ln p(x) = ∑_n ln c_n.

Check: What is the computational complexity of the forward-backward algorithm?

O(NK^2) — linear in sequence length N, quadratic in number of states K O(K^N) — exponential in sequence length O(N log K)

Chapter 5: The Viterbi Algorithm

The Viterbi algorithm finds the single most probable state sequence z* = arg max_z p(z|x). It uses the max-sum algorithm (Chapter 8) on the HMM chain.

Instead of summing over z_n-1 (forward algorithm), we take the max:

ω(z_n) = ln p(x_n|z_n) + max_{z_n-1} [ω(z_n-1) + ln p(z_n|z_n-1)]

Then backtrack from the final state to recover the optimal sequence.

Viterbi vs. marginal decoding: The Viterbi path is the globally most probable sequence. An alternative: choose the most probable state at each time step individually using γ(z_n). These can differ! The Viterbi path respects transition constraints (no impossible transitions), while marginal decoding might select states that are individually likely but form an improbable sequence.

Check: How does the Viterbi algorithm differ from the forward algorithm?

Viterbi replaces summation with maximization, finding the most probable state sequence instead of marginal probabilities Viterbi is faster They give the same result

Chapter 6: Learning HMMs with EM

The Baum-Welch algorithm is EM applied to HMMs. The latent variables are the state sequence z_1:N.

E-step: Run forward-backward to compute γ(z_n) = p(z_n|x) and ξ(z_n-1, z_n) = p(z_n-1, z_n|x).

M-step: Update parameters:

A_jk^new = ∑_n=2^N ξ(z_(n-1)j, z_nk) / ∑_n=2^N γ(z_(n-1)j)

π_k^new = γ(z_1k)

EM for HMMs unifies everything: The E-step is the forward-backward algorithm (inference). The M-step uses the expected sufficient statistics to update transitions and emissions. For Gaussian emissions, the M-step updates mean and covariance using responsibility-weighted data — identical in spirit to EM for GMMs but with temporal structure.

Check: What is the E-step in EM for HMMs?

Running the forward-backward algorithm to compute state posteriors and pairwise posteriors Running the Viterbi algorithm Computing cluster assignments

Chapter 7: HMM Simulation

Hidden Markov Model: Inference

A 3-state HMM generates observations. The top row shows the true hidden states. The bottom shows the observations. The middle shows the posterior state probabilities from forward-backward.

T=30, K=3

Check: Can the forward-backward algorithm perfectly recover the true hidden states?

Yes, always No — it gives posterior probabilities over states, reflecting uncertainty. When emissions overlap, states are ambiguous. Only if there are more observations than states

Chapter 8: Linear Dynamical Systems

Replace discrete latent states with continuous Gaussian states, and you get a linear dynamical system (LDS):

z_n = Az_n-1 + w_n, w_n ~ N(0, Γ)

x_n = Cz_n + v_n, v_n ~ N(0, Σ)

HMM vs. LDS: Both are state-space models with the same graphical structure (chain over latent states with emissions). The difference: HMM has discrete states (inference uses forward-backward). LDS has continuous Gaussian states (inference uses the Kalman filter). The HMM can represent multimodal, nonlinear dynamics but with finite states. The LDS handles continuous dynamics but is limited to linear-Gaussian models.

Property	HMM	LDS
Latent states	Discrete (K values)	Continuous (Gaussian)
Inference	Forward-backward	Kalman filter/smoother
Learning	Baum-Welch (EM)	EM with Kalman E-step
Cost per step	O(K²)	O(D_z³)

Check: What is the continuous analog of the forward-backward algorithm?

The Kalman filter (forward pass) and Rauch-Tung-Striebel smoother (backward pass) The Viterbi algorithm PCA

Chapter 9: The Kalman Filter

The Kalman filter computes the posterior p(z_n|x_1:n) recursively in two steps:

Predict: Propagate the previous posterior through the dynamics:

p(z_n|x_1:n-1) = N(z_n|Aμ_n-1, AP_n-1A^T + Γ)

Update: Incorporate the new observation:

K_n = P_n|n-1C^T(CP_n|n-1C^T + Σ)⁻¹

μ_n = Aμ_n-1 + K_n(x_n − CAμ_n-1)

where K_n is the Kalman gain.

The Kalman gain as a trust slider: K balances the prediction (from the dynamics model) and the observation. When the observation noise Σ is small (reliable sensor), K is large and we trust the measurement. When the process noise Γ is small (reliable model), K is small and we trust the prediction. The Kalman filter optimally fuses these two information sources in the Gaussian case.

The Kalman smoother (backward pass) uses future observations to refine the estimate: p(z_n|x_1:N). The smoothed estimate is always more accurate than the filtered estimate.

Check: What does the Kalman gain control?

How much to trust the measurement vs. the model prediction — balancing observation noise and process noise The number of latent states The step size of the dynamics

Chapter 10: Particle Filters

When the dynamics or observations are nonlinear or non-Gaussian, the Kalman filter no longer applies. Particle filters (sequential Monte Carlo) use sampling to approximate the posterior.

The algorithm maintains a set of weighted particles {(z_n^(l), w_n^(l))}_l=1^L:

1. Propagate: Move each particle through the dynamics: z_n^(l) ~ p(z_n|z_n-1^(l)).

2. Reweight: w_n^(l) ∝ p(x_n|z_n^(l)).

3. Resample: Draw L new particles with replacement, proportional to weights (SIR from Ch 11).

From Kalman to particles: The Kalman filter is the exact solution for linear-Gaussian models. Extended and unscented Kalman filters handle mild nonlinearity via local linearization. Particle filters handle arbitrary nonlinearity and non-Gaussianity, but with cost O(L) per step and potential weight degeneracy in high dimensions. The hierarchy: Kalman (exact, cheap, linear) → EKF/UKF (approximate, cheap, mildly nonlinear) → particles (approximate, expensive, any nonlinearity).

Check: When are particle filters preferred over Kalman filters?

When the dynamics or observations are nonlinear or non-Gaussian, where the Kalman filter's assumptions break down Always — particle filters are strictly better Only for discrete state spaces

Chapter 11: Summary

Model	Latent	Inference	Key application
Markov model	None (observed)	Direct	Language models, sequence statistics
HMM	Discrete	Forward-backward, Viterbi	Speech, bioinformatics, NLP
LDS / Kalman	Continuous Gaussian	Kalman filter/smoother	Tracking, control, navigation
Nonlinear SSM	Continuous	Particle filter	Robotics, nonlinear tracking

The state-space framework: All sequential models share the same structure: a latent state evolves over time according to transition dynamics, and we observe it through a noisy measurement process. The graphical model is always a chain with emissions. The differences are in the state space (discrete vs. continuous), the noise model (Gaussian vs. general), and the dynamics (linear vs. nonlinear). This framework unifies an enormous range of applications across engineering, science, and machine learning.

What comes next: Chapter 14, the final chapter, covers combining models — committees, boosting, decision trees, and mixtures of experts.

"The most widely used framework for treating sequential data
is the hidden Markov model."
— Christopher Bishop, PRML §13.2

Check: What unifying structure do HMMs, Kalman filters, and particle filters share?

A state-space model: latent state evolves via transition dynamics, observations are noisy measurements of the state They all use discrete states They all require Gaussian noise