The framework behind early speech recognition, gene finding, and the conceptual ancestor of every sequence model in deep learning — from RNNs to Transformers.
A casino dealer has two dice: a fair die (each face equally likely) and a loaded die (heavily biased toward rolling a 6). The dealer secretly switches between them as the night goes on. You're sitting at the table. You can see every roll, but you never see which die is being used.
This is the fundamental HMM setup: there is a hidden state (which die) that you cannot observe directly, and an observation (the roll) that depends on that hidden state. Your goal: figure out the hidden sequence from the observations alone.
Watch the dealer switch between Fair and Loaded dice. You only see the rolls — can you guess which die is being used?
An HMM is fully specified by five objects. Think of it as a recipe: once you have these five ingredients, the entire model is defined and you can do all the inference you want.
Python import numpy as np N, M = 2, 6 # 2 states, 6 observation symbols A = np.array([[0.95, 0.05], # [N, N] transition matrix [0.10, 0.90]]) B = np.array([[1/6]*6, # [N, M] emission matrix [0.1, 0.1, 0.1, 0.1, 0.1, 0.5]]) pi = np.array([0.5, 0.5]) # [N] initial distribution assert A.shape == (N, N) assert B.shape == (N, M) assert np.allclose(A.sum(axis=1), 1) # each row sums to 1 # Sampling from the HMM (the generative story) def sample_hmm(A, B, pi, T): states, obs = [], [] s = np.random.choice(len(pi), p=pi) for t in range(T): states.append(s) obs.append(np.random.choice(B.shape[1], p=B[s])) s = np.random.choice(N, p=A[s]) return states, obs
The five components visualized as a graphical model. Hidden states on top, observations on bottom.
The transition matrix A captures the dynamics of the hidden states. A[i][j] is the probability of moving from state i to state j. Each row must sum to 1 (you have to go somewhere). In the casino: a high A[Fair][Fair] means the dealer tends to stick with the fair die.
The Markov property is the key assumption: the next state depends only on the current state, not on the entire history. The dealer's decision to switch dice depends only on which die they're currently using — not on all previous switches.
The stationary distribution tells you where the system spends its time in the long run. For our casino with A[F][F]=0.95, A[L][L]=0.90: the stationary distribution is π∞ = [2/3, 1/3]. The dealer uses the fair die twice as often as the loaded die. You find it by solving π · A = π, or equivalently: πFair = 0.10 / (0.05 + 0.10) = 2/3. This is useful for sanity-checking — if your Baum-Welch output gives a stationary distribution that doesn't match the observed state frequencies, something is wrong.
Adjust P(stay with Fair die) to see how the transition dynamics change. The complementary probabilities update automatically.
| To Fair | To Loaded | |
|---|---|---|
| From Fair | A[F][F] = 0.95 | A[F][L] = 0.05 |
| From Loaded | A[L][F] = 0.10 | A[L][L] = 0.90 |
The emission matrix B describes what you see given the hidden state. B[j][k] = P(observing k | in state j). For the fair die, B[Fair] is uniform: each face has probability 1/6. For the loaded die, B[Loaded] is heavily biased toward 6.
The emission distribution is what makes the hidden state indirectly observable. If you see many 6s in a row, the loaded die is more likely — but you can never be 100% sure, because the fair die can also produce 6s.
The emission likelihood ratio is what drives inference. When you observe a 6: P(6|Loaded)/P(6|Fair) = 0.5/(1/6) = 3.0. A 6 is 3 times more likely under the loaded die. When you observe a 3: P(3|Loaded)/P(3|Fair) = 0.1/(1/6) = 0.6. A 3 slightly favors the fair die. The larger the emission likelihood ratio, the more diagnostic an observation is. If both dice had identical emission distributions, observations would carry zero information about the hidden state — the problem would be unsolvable.
Adjust how biased the loaded die is toward rolling 6. Watch the probability bars shift.
The timesteps with observation 6 (positions 1, 2, 4, 6) are most likely Loaded because the likelihood ratio is P(6|Loaded)/P(6|Fair) = 0.5/(1/6) = 3.0 — a 6 is 3x more likely under Loaded. Position 3 (observation 3) slightly favors Fair: ratio = 0.1/(1/6) = 0.6. Position 5 (observation 1) also favors Fair with the same 0.6 ratio.
But the key insight is that isolated likelihood ratios aren't enough. The transition matrix creates temporal coherence — if the dealer tends to stay with the same die (A[L][L] = 0.90), then a cluster of 6s is much stronger evidence than scattered 6s. The sequence [6,6] is more convincing than two isolated 6s because the first 6 shifts your belief toward Loaded, and the high self-transition probability keeps it there. This is exactly what the Forward algorithm formalizes: it combines emission evidence with transition dynamics at every timestep.
Given an HMM, there are three fundamental questions you can ask. Everything in HMM theory boils down to solving one of these three:
| Problem | Question | Algorithm | Complexity |
|---|---|---|---|
| Evaluation | P(O | λ) | Forward | O(N²T) |
| Decoding | argmax P(Q | O, λ) | Viterbi | O(N²T) |
| Learning | argmaxλ P(O | λ) | Baum-Welch | O(N²T) per iter |
Click each problem to see what question it answers and how information flows.
The Forward algorithm answers: how likely is this observation sequence, given our model? It builds a trellis — a grid where each column is a timestep and each row is a state. Each cell holds the probability of being in that state at that time, having seen all observations so far.
The recursion is elegant: to compute αt(j), sum over all states i at time t−1, multiply by the transition A[i][j] and the emission B[j][ot]:
Think of αt as a [N] vector — one entry per state. The recursion is a matrix-vector multiply followed by an element-wise multiply:
AT · αt-1 costs O(N²) — one matrix-vector product. The element-wise multiply with the emission column costs O(N). Over T timesteps, the total cost is O(N² · T). For a speech recognizer with N=5,000 phoneme states and a 10-second utterance (T=1,000 frames), that's 25 billion operations — manageable, but it shows why HMM state counts matter.
Click "Step" to advance through the trellis computation one column at a time. Watch how probabilities flow from left to right.
Python def forward(obs, A, B, pi): """Compute forward probabilities alpha[t][j].""" N = len(pi) # number of states T = len(obs) # sequence length alpha = [[0.0]*N for _ in range(T)] # Base case for j in range(N): alpha[0][j] = pi[j] * B[j][obs[0]] # Recursion for t in range(1, T): for j in range(N): alpha[t][j] = B[j][obs[t]] * sum( alpha[t-1][i] * A[i][j] for i in range(N)) return alpha, sum(alpha[T-1])
Python — Log-space version import math def logsumexp(a, b): """Numerically stable log(exp(a) + exp(b)).""" mx = max(a, b) return mx + math.log(1 + math.exp(-abs(a - b))) def forward_log(obs, A, B, pi): """Forward algorithm in log space — no underflow.""" N = len(pi) T = len(obs) logA = [[math.log(A[i][j]) for j in range(N)] for i in range(N)] logB = [[math.log(B[j][k]) for k in range(len(B[0]))] for j in range(N)] la = [math.log(pi[j]) + logB[j][obs[0]] for j in range(N)] for t in range(1, T): new_la = [] for j in range(N): terms = [la[i] + logA[i][j] for i in range(N)] val = terms[0] for k in range(1, N): val = logsumexp(val, terms[k]) new_la.append(val + logB[j][obs[t]]) la = new_la # Total log-probability total = la[0] for j in range(1, N): total = logsumexp(total, la[j]) return total # log P(O | model)
We define αt(j) = P(o1, o2, ..., ot, statet = j). We want to compute αt(j) from values at time t−1.
Your task: Starting from the definition of αt(j), use the law of total probability to derive the recursion αt(j) = B[j][ot] · ∑i αt-1(i) · A[i][j]. Explain why each factor appears.
Full derivation:
Start: αt(j) = P(o1, ..., ot, st=j)
Marginalize over previous state: = ∑i P(o1..t, st-1=i, st=j)
Chain rule: = ∑i P(o1..t-1, st-1=i) · P(st=j | st-1=i) · P(ot | st=j)
Substitute definitions: = ∑i αt-1(i) · A[i][j] · B[j][ot]
Factor out emission (doesn't depend on i): = B[j][ot] · ∑i αt-1(i) · A[i][j]
The key insight: The Markov property is what makes this recursion possible. Without it, P(st=j | st-1=i, o1..t-1) would depend on the entire history, and we couldn't factor the joint probability so cleanly. The emission independence assumption lets us pull B[j][ot] out of the sum. Together, they reduce an O(NT) sum-over-all-paths to an O(N2) sum at each timestep.
python def forward(obs, A, B, pi): N = len(pi) T = len(obs) alpha = [[0.0] * N for _ in range(T)] # Base case: first observation for j in range(N): alpha[0][j] = pi[j] * B[j][obs[0]] # Recursion: sum over all paths into state j for t in range(1, T): for j in range(N): alpha[t][j] = B[j][obs[t]] * sum( alpha[t-1][i] * A[i][j] for i in range(N) ) # Total probability: sum over final states prob = sum(alpha[T-1]) return alpha, prob
The Forward algorithm IS the Bayes filter, specialized to discrete states. The transition matrix A is the motion model. The emission matrix B is the sensor model. The α vector is the (unnormalized) belief. The only difference: the Bayes filter normalizes at each step (giving a posterior), while Forward accumulates the joint probability (giving P(observations AND state)). Divide αt by its sum and you get the Bayes filter posterior exactly.
In a Kalman filter, what plays the role of the transition matrix A? What plays the role of the emission matrix B? (Hint: they're both matrices, but they operate on continuous vectors.)
The Viterbi algorithm answers the most dramatic question: what is the single most likely sequence of hidden states? It's identical in structure to the Forward algorithm, but replaces the sum with a max and keeps backpointers to trace back the winning path.
At each cell, instead of summing all paths (Forward), we keep only the best path into each state. The key data structure is the backpointer array ψ — a [T × N] matrix where ψ[t][j] stores which previous state led to the best score for state j at time t. After filling the trellis, you find the best final state, then follow backpointers backward to recover the entire optimal path.
Roll dice and watch Viterbi decode which die the dealer is using in real time. The trellis builds as you go. Green = Fair, Red = Loaded.
Python def viterbi(obs, A, B, pi): """Find the most likely hidden state sequence.""" N = len(pi) T = len(obs) delta = [[0.0]*N for _ in range(T)] psi = [[0]*N for _ in range(T)] # Base case for j in range(N): delta[0][j] = pi[j] * B[j][obs[0]] # Recursion: max instead of sum for t in range(1, T): for j in range(N): best_i, best_v = 0, 0 for i in range(N): v = delta[t-1][i] * A[i][j] if v > best_v: best_v, best_i = v, i delta[t][j] = best_v * B[j][obs[t]] psi[t][j] = best_i # Backtrace path = [0] * T path[T-1] = max(range(N), key=lambda j: delta[T-1][j]) for t in range(T-2, -1, -1): path[t] = psi[t+1][path[t+1]] return path
The Forward algorithm computes αt(j) = P(o1..t, st=j) using a sum. Viterbi computes δt(j) = maxs1..t-1 P(o1..t, s1..t-1, st=j) using a max. Both look similar, but they answer fundamentally different questions.
Your task: Prove that the Viterbi path (obtained by backtracing from argmaxj δT(j)) is indeed the globally optimal path — the one that maximizes P(s1, ..., sT | o1, ..., oT). Why doesn't greedily picking the best state at each timestep (argmaxj γt(j)) give the same answer?
Proof of optimality:
We want: argmaxs1..T P(s1..T, o1..T)
Write the joint: P(s1..T, o1..T) = π(s1) · B[s1][o1] · ∏t=2..T A[st-1][st] · B[st][ot]
The max over all sequences decomposes recursively: maxs1..T = maxsT maxs1..T-1 because the max operator distributes over independent sub-problems (just like sum does in Forward, but max satisfies max(a·b, a·c) = a·max(b,c) for a>0).
This gives: δt(j) = B[j][ot] · maxi [δt-1(i) · A[i][j]]
At each step, we only keep the BEST predecessor for each state (stored in ψ[t][j]). The backtrace recovers the globally optimal path because the Bellman principle of optimality holds: any sub-path of an optimal path must itself be optimal.
The key insight: Viterbi works because the max operator, like addition, distributes over multiplication of positive numbers. This is why it's called the "max-product" algorithm (vs "sum-product" for Forward). The greedy approach fails because it ignores transition constraints — the marginal posterior γt(j) = P(st=j|O) doesn't encode whether consecutive states are reachable from each other.
What if you don't know the model parameters (A, B, π)? You only have observation sequences. The Baum-Welch algorithm learns the parameters from data using Expectation-Maximization (EM): iterate between estimating the hidden states (E-step) and updating the parameters (M-step).
Watch Baum-Welch learn the transition probability from data. Each iteration refines the estimate. The orange line is the learned value; the teal dashed line is the true value.
Let's walk through the M-step numerically. The transition re-estimation formula is:
where ξt(i,j) = αt(i) · A[i,j] · B[j,ot+1] · βt+1(j) / P(O). In plain English: the numerator counts how often we expected to transition from i to j, and the denominator counts how often we expected to be in state i at all. The ratio gives the new transition probability.
In the M-step of Baum-Welch, we update A[i][j] using soft counts from the E-step. The update rule is: Anew[i][j] = ∑t ξt(i,j) / ∑t γt(i).
Your task: Derive why this is the maximum likelihood estimate. Start from the expected complete-data log-likelihood and show that the optimal A[i][j] has this form. Why does it look like "expected count of transitions from i to j" divided by "expected count of being in state i"?
Full derivation:
The expected complete-data log-likelihood with respect to A (ignoring terms not involving A):
Q(A) = ∑t=1T-1 ∑i ∑j ξt(i,j) · log A[i][j]
Subject to: ∑j A[i][j] = 1 for each row i.
Lagrangian for row i: Li = ∑j cij log A[i][j] − λi(∑j A[i][j] − 1)
where cij = ∑t=1T-1 ξt(i,j)
Take derivative and set to zero: ∂Li/∂A[i][j] = cij/A[i][j] − λi = 0
So: A[i][j] = cij / λi
Apply constraint: ∑j cij/λi = 1, so λi = ∑j cij = ∑t γt(i)
Final answer: Anew[i][j] = ∑t ξt(i,j) / ∑t γt(i)
The key insight: This is just counting! In the fully-observed case, you'd estimate A[i][j] by counting transitions i→j and dividing by total time in state i. With hidden states, you can't count directly, so you use expected counts from the posterior. The same logic applies to B: expected emission counts / expected state occupancy. EM always has this structure: compute expected sufficient statistics, then maximize as if they were real counts.
The log-likelihood surface of an HMM has multiple modes due to label symmetry: swapping state 0 and state 1 (permuting the rows/columns of A and B) gives a different parameter setting with identical likelihood. For N states, there are N! equivalent permutations. Beyond this trivial symmetry, genuinely different explanations can fit the data: a model with A[F][F]=0.95 and B[L][6]=0.5 might fit the data as well as one with A[F][F]=0.80 and B[L][6]=0.7 (fewer switches but weaker emission signal). EM gets trapped in whichever basin its initialization falls into. This is why multiple random restarts are standard practice — run Baum-Welch 10-50 times with different initial parameters and keep the solution with highest final log-likelihood.
HMMs dominated sequence modeling from the 1970s through the 2000s. Then neural networks took over. Understanding this evolution reveals what each approach does well — and why Transformers ultimately won.
| Model | Era | Hidden State | Key Limitation |
|---|---|---|---|
| HMM | 1970s–2000s | Discrete, finite | Fixed number of states, Markov assumption |
| RNN/LSTM | 2010s | Continuous vector | Sequential processing, vanishing gradients |
| Transformer | 2017+ | Contextual embeddings | Quadratic attention cost |
The scaling wall. Training complexity is O(N² · T · iterations). For a speech recognizer with N=5,000 phoneme states, T=1,000 frames per utterance, and 100 EM iterations: that's 5000² × 1000 × 100 = 2.5 trillion operations per training utterance. Modern speech corpora have millions of utterances. HMMs weren't replaced because they're wrong — they were replaced because they can't handle the state space needed for modern tasks. A Transformer with 768-dimensional hidden states effectively has a continuous state space of infinite cardinality.
| System | States (N) | Seq Length (T) | Forward Pass Cost |
|---|---|---|---|
| Dishonest casino | 2 | 100 | 400 ops |
| POS tagger | ~45 | ~30 | ~61K ops |
| Gene finder | ~100 | ~10K | ~100M ops |
| Speech recognizer | ~5,000 | ~1,000 | ~25B ops |
| Large-vocab speech | ~50,000 | ~3,000 | ~7.5T ops |
From discrete hidden states to continuous representations. Each generation relaxes a constraint of the previous one.
The classic CMU/HTK design (circa 1995):
1. Topology: 3 states per phoneme (beginning/middle/end), strictly left-to-right with self-loops. No skips — they add too many transitions. ~40 phonemes × 3 = 120 base states, but context-dependent triphones expand this to ~5,000–10,000 tied states using decision-tree clustering.
2. Emissions: Continuous GMMs (Gaussian Mixture Models), typically 8–32 mixtures per state. A VQ codebook (discrete) saves RAM but loses ~15% relative accuracy. Each mixture has a 13-dim mean + diagonal covariance = 26 parameters. With 10K states × 16 mixtures = 160K Gaussians × 26 params = 4.2M float parameters ≈ 17 MB. Fits in budget.
3. Silence/breathing: A special 1-state or 3-state silence model with a very "flat" Gaussian (high variance). Word boundaries allow optional silence insertion. Variable word lengths are handled naturally: each word is a concatenation of phoneme HMMs, and self-loops absorb duration variation.
4. Tractability: Beam search (Viterbi with pruning). At each timestep, only keep states with δ within a beam width of the current best δ. Typical beam keeps ~500–2000 active states (of 420K). Combined with a language model (bigram/trigram) that prunes unlikely word sequences, decoding runs at 0.5–3x real-time.
HMMs don't exist in isolation. They're part of a rich family of models for reasoning about hidden states over time. Understanding these connections deepens your intuition for all of them.
How HMMs connect to other models. Each arrow relaxes a constraint or adds a capability.
The deep connection: both HMM decoding and Transformer attention solve the alignment problem — "which input positions are relevant to this output position?" The HMM does it with hard assignments via DP (one best path). Attention does it with soft weights via learned similarity (weighted combination of all positions). Early neural machine translation literally used attention to replace the HMM alignment model. The "attention is all you need" paper eliminated the recurrence (no sequential dependence), making alignment fully parallel — but the core operation (figure out which inputs matter for each output) is the same problem HMMs were solving in 1970.
In CTC (Connectionist Temporal Classification), the forward algorithm is used inside a neural network's loss function. How is this the literal fusion of HMM dynamic programming with neural emissions?
You now understand the mathematics of hidden states. Every modern sequence model — from LSTMs to Transformers — carries the DNA of the HMM inside it.