The Complete Beginner's Path

Understand Hidden
Markov Models

The framework behind early speech recognition, gene finding, and the conceptual ancestor of every sequence model in deep learning — from RNNs to Transformers.

Prerequisites: Basic probability + Intuition for sequences. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Hidden States?

A casino dealer has two dice: a fair die (each face equally likely) and a loaded die (heavily biased toward rolling a 6). The dealer secretly switches between them as the night goes on. You're sitting at the table. You can see every roll, but you never see which die is being used.

This is the fundamental HMM setup: there is a hidden state (which die) that you cannot observe directly, and an observation (the roll) that depends on that hidden state. Your goal: figure out the hidden sequence from the observations alone.

The core idea: The world has states you cannot see. Those states produce signals you can see. An HMM gives you the mathematical machinery to reason backward from observations to hidden states — to see through the veil.
The Dishonest Casino

Watch the dealer switch between Fair and Loaded dice. You only see the rolls — can you guess which die is being used?

Check: In the casino analogy, what is the "hidden state"?

Chapter 1: The HMM Tuple — (S, O, A, B, π)

An HMM is fully specified by five objects. Think of it as a recipe: once you have these five ingredients, the entire model is defined and you can do all the inference you want.

S — States
The set of hidden states. Casino: {Fair, Loaded}
O — Observations
What you can see. Casino: {1, 2, 3, 4, 5, 6}
A — Transition Matrix
P(next state | current state). How likely is a die switch?
B — Emission Matrix
P(observation | state). Each die's roll probabilities.
π — Initial Distribution
P(starting state). Which die does the dealer pick first?
Think of it this way: S and O define the vocabulary. A describes how hidden states evolve over time. B describes how hidden states produce observations. π tells you where you start. Together: a complete generative story.
Matrix shapes for implementation: Every HMM component has a concrete shape. Let N = number of states and M = number of observation symbols. Then:
• Transition matrix A is [N × N] — each row sums to 1.
• Emission matrix B is [N × M] — each row sums to 1.
• Initial distribution π is [N] — sums to 1.
For the dishonest casino: N=2 (Fair, Loaded), M=6 (dice faces 1–6). So A is [2×2], B is [2×6], π is [2]. That's 2+12+2 = 16 parameters to fully specify the model.
Python
import numpy as np

N, M = 2, 6    # 2 states, 6 observation symbols
A  = np.array([[0.95, 0.05],   # [N, N] transition matrix
               [0.10, 0.90]])
B  = np.array([[1/6]*6,           # [N, M] emission matrix
               [0.1, 0.1, 0.1, 0.1, 0.1, 0.5]])
pi = np.array([0.5, 0.5])       # [N] initial distribution

assert A.shape == (N, N)
assert B.shape == (N, M)
assert np.allclose(A.sum(axis=1), 1)  # each row sums to 1

# Sampling from the HMM (the generative story)
def sample_hmm(A, B, pi, T):
    states, obs = [], []
    s = np.random.choice(len(pi), p=pi)
    for t in range(T):
        states.append(s)
        obs.append(np.random.choice(B.shape[1], p=B[s]))
        s = np.random.choice(N, p=A[s])
    return states, obs
HMM Structure

The five components visualized as a graphical model. Hidden states on top, observations on bottom.

Check: Which component describes how likely the system is to switch from one hidden state to another?

Chapter 2: Transitions — A[i][j]

The transition matrix A captures the dynamics of the hidden states. A[i][j] is the probability of moving from state i to state j. Each row must sum to 1 (you have to go somewhere). In the casino: a high A[Fair][Fair] means the dealer tends to stick with the fair die.

A[i][j] = P(statet+1 = j | statet = i)      ∑j A[i][j] = 1

The Markov property is the key assumption: the next state depends only on the current state, not on the entire history. The dealer's decision to switch dice depends only on which die they're currently using — not on all previous switches.

Why "Markov"? Named after Andrey Markov, who studied chains of random events where the future depends only on the present. "Given the present, the future is independent of the past." This memoryless property makes HMMs tractable.

The stationary distribution tells you where the system spends its time in the long run. For our casino with A[F][F]=0.95, A[L][L]=0.90: the stationary distribution is π = [2/3, 1/3]. The dealer uses the fair die twice as often as the loaded die. You find it by solving π · A = π, or equivalently: πFair = 0.10 / (0.05 + 0.10) = 2/3. This is useful for sanity-checking — if your Baum-Welch output gives a stationary distribution that doesn't match the observed state frequencies, something is wrong.

Interactive Transition Matrix

Adjust P(stay with Fair die) to see how the transition dynamics change. The complementary probabilities update automatically.

P(Fair → Fair)0.95
P(Loaded → Loaded)0.90
To FairTo Loaded
From FairA[F][F] = 0.95A[F][L] = 0.05
From LoadedA[L][F] = 0.10A[L][L] = 0.90
Check: What does the Markov property assume?

Chapter 3: Emissions — B[j][k]

The emission matrix B describes what you see given the hidden state. B[j][k] = P(observing k | in state j). For the fair die, B[Fair] is uniform: each face has probability 1/6. For the loaded die, B[Loaded] is heavily biased toward 6.

B[j][k] = P(observationt = k | statet = j)      ∑k B[j][k] = 1

The emission distribution is what makes the hidden state indirectly observable. If you see many 6s in a row, the loaded die is more likely — but you can never be 100% sure, because the fair die can also produce 6s.

Fair vs Loaded: Fair die: [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]. Loaded die: [1/10, 1/10, 1/10, 1/10, 1/10, 1/2]. The loaded die rolls a 6 half the time — but it can still roll other numbers.

The emission likelihood ratio is what drives inference. When you observe a 6: P(6|Loaded)/P(6|Fair) = 0.5/(1/6) = 3.0. A 6 is 3 times more likely under the loaded die. When you observe a 3: P(3|Loaded)/P(3|Fair) = 0.1/(1/6) = 0.6. A 3 slightly favors the fair die. The larger the emission likelihood ratio, the more diagnostic an observation is. If both dice had identical emission distributions, observations would carry zero information about the hidden state — the problem would be unsolvable.

Emission Distributions

Adjust how biased the loaded die is toward rolling 6. Watch the probability bars shift.

Loaded P(6)0.50
Check: If you observe a 6, can you be certain the loaded die was used?
Checkpoint — Before you move on
You observe the sequence [6, 6, 3, 6, 1, 6]. Without running any algorithm, explain intuitively which timesteps are most likely to be the loaded die and WHY. What role does the emission likelihood ratio P(obs|Loaded)/P(obs|Fair) play in your reasoning?
✓ Gate cleared
Model Answer

The timesteps with observation 6 (positions 1, 2, 4, 6) are most likely Loaded because the likelihood ratio is P(6|Loaded)/P(6|Fair) = 0.5/(1/6) = 3.0 — a 6 is 3x more likely under Loaded. Position 3 (observation 3) slightly favors Fair: ratio = 0.1/(1/6) = 0.6. Position 5 (observation 1) also favors Fair with the same 0.6 ratio.

But the key insight is that isolated likelihood ratios aren't enough. The transition matrix creates temporal coherence — if the dealer tends to stay with the same die (A[L][L] = 0.90), then a cluster of 6s is much stronger evidence than scattered 6s. The sequence [6,6] is more convincing than two isolated 6s because the first 6 shifts your belief toward Loaded, and the high self-transition probability keeps it there. This is exactly what the Forward algorithm formalizes: it combines emission evidence with transition dynamics at every timestep.

Chapter 4: The Three Problems

Given an HMM, there are three fundamental questions you can ask. Everything in HMM theory boils down to solving one of these three:

1. Evaluation (Forward Algorithm)
What is P(observations | model)? How well does this model explain the data?
2. Decoding (Viterbi Algorithm)
What is the most likely sequence of hidden states given the observations?
3. Learning (Baum-Welch / EM)
How do I learn A, B, π from data when the states are truly hidden?
ProblemQuestionAlgorithmComplexity
EvaluationP(O | λ)ForwardO(N²T)
Decodingargmax P(Q | O, λ)ViterbiO(N²T)
Learningargmaxλ P(O | λ)Baum-WelchO(N²T) per iter
N = number of states, T = length of observation sequence. Without dynamic programming, evaluation alone would be O(NT) — exponential in sequence length. The Forward and Viterbi algorithms make HMMs practical by reducing this to O(N²T).
The Three Problems Visualized

Click each problem to see what question it answers and how information flows.

Check: Which algorithm finds the most likely sequence of hidden states?

Chapter 5: The Forward Algorithm

The Forward algorithm answers: how likely is this observation sequence, given our model? It builds a trellis — a grid where each column is a timestep and each row is a state. Each cell holds the probability of being in that state at that time, having seen all observations so far.

αt(j) = P(o1, o2, ..., ot, statet = j | model)

The recursion is elegant: to compute αt(j), sum over all states i at time t−1, multiply by the transition A[i][j] and the emission B[j][ot]:

αt(j) = B[j][ot] · ∑i αt-1(i) · A[i][j]
Base case: α1(j) = π(j) · B[j][o1]. Start with the initial probability of each state times the probability of emitting the first observation. Then recurse forward through time.

Think of αt as a [N] vector — one entry per state. The recursion is a matrix-vector multiply followed by an element-wise multiply:

αt = B[:, ot] ⊙ (AT · αt-1)

AT · αt-1 costs O(N²) — one matrix-vector product. The element-wise multiply with the emission column costs O(N). Over T timesteps, the total cost is O(N² · T). For a speech recognizer with N=5,000 phoneme states and a 10-second utterance (T=1,000 frames), that's 25 billion operations — manageable, but it shows why HMM state counts matter.

The log-space trick: In practice, α values underflow to 0 for long sequences because you're multiplying many small probabilities. Solution: work in log space. Instead of α, track log(α). The product becomes a sum, and the sum-over-states becomes a logsumexp:

log(αt(j)) = log(B[j,ot]) + logsumexpi(log(αt-1(i)) + log(A[i,j]))

where logsumexp(a, b) = max(a,b) + log(1 + exp(−|a−b|)). This keeps everything numerically stable even for sequences of length 10,000+.
Interactive Forward Trellis

Click "Step" to advance through the trellis computation one column at a time. Watch how probabilities flow from left to right.

Step 0 / 6
Python
def forward(obs, A, B, pi):
    """Compute forward probabilities alpha[t][j]."""
    N = len(pi)            # number of states
    T = len(obs)            # sequence length
    alpha = [[0.0]*N for _ in range(T)]

    # Base case
    for j in range(N):
        alpha[0][j] = pi[j] * B[j][obs[0]]

    # Recursion
    for t in range(1, T):
        for j in range(N):
            alpha[t][j] = B[j][obs[t]] * sum(
                alpha[t-1][i] * A[i][j] for i in range(N))

    return alpha, sum(alpha[T-1])
Python — Log-space version
import math

def logsumexp(a, b):
    """Numerically stable log(exp(a) + exp(b))."""
    mx = max(a, b)
    return mx + math.log(1 + math.exp(-abs(a - b)))

def forward_log(obs, A, B, pi):
    """Forward algorithm in log space — no underflow."""
    N = len(pi)
    T = len(obs)
    logA = [[math.log(A[i][j]) for j in range(N)] for i in range(N)]
    logB = [[math.log(B[j][k]) for k in range(len(B[0]))] for j in range(N)]
    la = [math.log(pi[j]) + logB[j][obs[0]] for j in range(N)]

    for t in range(1, T):
        new_la = []
        for j in range(N):
            terms = [la[i] + logA[i][j] for i in range(N)]
            val = terms[0]
            for k in range(1, N):
                val = logsumexp(val, terms[k])
            new_la.append(val + logB[j][obs[t]])
        la = new_la

    # Total log-probability
    total = la[0]
    for j in range(1, N):
        total = logsumexp(total, la[j])
    return total  # log P(O | model)
Check: What does αt(j) represent?
🔨 Derivation Derive the Forward recursion from first principles ✓ ATTEMPTED

We define αt(j) = P(o1, o2, ..., ot, statet = j). We want to compute αt(j) from values at time t−1.

Your task: Starting from the definition of αt(j), use the law of total probability to derive the recursion αt(j) = B[j][ot] · ∑i αt-1(i) · A[i][j]. Explain why each factor appears.

Write αt(j) = ∑i P(o1..t, statet-1=i, statet=j). This marginalizes over all possible previous states. Now apply the chain rule to factor this joint probability.
The Markov property gives P(statet=j | statet-1=i, o1..t-1) = P(statet=j | statet-1=i) = A[i][j]. The emission independence gives P(ot | statet=j, everything else) = P(ot | statet=j) = B[j][ot].
After factoring: P(o1..t-1, statet-1=i) · A[i][j] · B[j][ot]. The first term IS αt-1(i) by definition. Now factor out what doesn't depend on i.

Full derivation:

Start: αt(j) = P(o1, ..., ot, st=j)

Marginalize over previous state: = ∑i P(o1..t, st-1=i, st=j)

Chain rule: = ∑i P(o1..t-1, st-1=i) · P(st=j | st-1=i) · P(ot | st=j)

Substitute definitions: = ∑i αt-1(i) · A[i][j] · B[j][ot]

Factor out emission (doesn't depend on i): = B[j][ot] · ∑i αt-1(i) · A[i][j]

The key insight: The Markov property is what makes this recursion possible. Without it, P(st=j | st-1=i, o1..t-1) would depend on the entire history, and we couldn't factor the joint probability so cleanly. The emission independence assumption lets us pull B[j][ot] out of the sum. Together, they reduce an O(NT) sum-over-all-paths to an O(N2) sum at each timestep.

💻 Build It Implement the Forward Algorithm from Scratch ✓ ATTEMPTED
You've seen the recursion. Now implement it yourself. Given a sequence of observations and HMM parameters, compute the forward probabilities α[t][j] for all timesteps and states, and return the total sequence probability P(O|model).
signature def forward(obs: list[int], A: list[list[float]], B: list[list[float]], pi: list[float]) -> tuple[list[list[float]], float]: """ Compute forward probabilities for an HMM. Args: obs: observation sequence (T integers, each in 0..M-1) A: transition matrix [N x N], A[i][j] = P(state j | state i) B: emission matrix [N x M], B[j][k] = P(obs k | state j) pi: initial distribution [N], pi[j] = P(start in state j) Returns: alpha: [T x N] forward probabilities prob: P(obs | model) = sum of alpha[T-1] """
Test case
A = [[0.95, 0.05], [0.10, 0.90]]
B = [[1/6]*6, [0.1, 0.1, 0.1, 0.1, 0.1, 0.5]]
pi = [0.5, 0.5]
obs = [5, 5, 2]   # observations: 6, 6, 3

Expected: alpha[0] = [0.0833, 0.2500]
alpha[1] ≈ [0.0147, 0.1145]
P(O|model) ≈ sum(alpha[2]) ≈ 0.0076
For each timestep t from 1 to T-1, for each state j, you need to sum over ALL previous states i: alpha[t-1][i] * A[i][j]. Then multiply by B[j][obs[t]]. The tricky part is getting the base case right: alpha[0][j] = pi[j] * B[j][obs[0]].
python
def forward(obs, A, B, pi):
    N = len(pi)
    T = len(obs)
    alpha = [[0.0] * N for _ in range(T)]

    # Base case: first observation
    for j in range(N):
        alpha[0][j] = pi[j] * B[j][obs[0]]

    # Recursion: sum over all paths into state j
    for t in range(1, T):
        for j in range(N):
            alpha[t][j] = B[j][obs[t]] * sum(
                alpha[t-1][i] * A[i][j]
                for i in range(N)
            )

    # Total probability: sum over final states
    prob = sum(alpha[T-1])
    return alpha, prob
Bonus challenge: Modify this to work in log-space. Replace multiplication with addition and the sum-over-states with logsumexp. This prevents underflow for sequences longer than ~50 timesteps.
🔗 Pattern Recognition
Forward Algorithm = Bayes Predict-Update on Discrete States
HMM Forward
αt(j) = B[j][ot] · ∑i αt-1(i) · A[i][j]
Predict: propagate through transitions
Update: multiply by emission likelihood
Bayes Filter
bel(xt) = η · P(zt|xt) · ∑x P(xt|xt-1) · bel(xt-1)
Predict: propagate through motion model
Update: multiply by sensor likelihood
Bayes Filter lesson

The Forward algorithm IS the Bayes filter, specialized to discrete states. The transition matrix A is the motion model. The emission matrix B is the sensor model. The α vector is the (unnormalized) belief. The only difference: the Bayes filter normalizes at each step (giving a posterior), while Forward accumulates the joint probability (giving P(observations AND state)). Divide αt by its sum and you get the Bayes filter posterior exactly.

In a Kalman filter, what plays the role of the transition matrix A? What plays the role of the emission matrix B? (Hint: they're both matrices, but they operate on continuous vectors.)

Chapter 6: Viterbi Decoding — The Showcase

The Viterbi algorithm answers the most dramatic question: what is the single most likely sequence of hidden states? It's identical in structure to the Forward algorithm, but replaces the sum with a max and keeps backpointers to trace back the winning path.

δt(j) = maxi [ δt-1(i) · A[i][j] ] · B[j][ot]

At each cell, instead of summing all paths (Forward), we keep only the best path into each state. The key data structure is the backpointer array ψ — a [T × N] matrix where ψ[t][j] stores which previous state led to the best score for state j at time t. After filling the trellis, you find the best final state, then follow backpointers backward to recover the entire optimal path.

The backpointer traceback: Start at the end: pick state* = argmaxj δT(j). Then walk backward: state at t = ψ[t+1][state at t+1]. This costs O(T) — you just follow the arrows. The dynamic programming trellis is O(N² · T) to fill, so the traceback is essentially free.
Forward vs Viterbi: Forward sums over all paths (total probability). Viterbi takes the max over all paths (best single path). The trellis structure is identical — only the aggregation changes.
The Dishonest Casino — Live Viterbi Decoding

Roll dice and watch Viterbi decode which die the dealer is using in real time. The trellis builds as you go. Green = Fair, Red = Loaded.

Rolls: 0 | Accuracy: --
P(switch die)0.05
Loaded P(6)0.50
Python
def viterbi(obs, A, B, pi):
    """Find the most likely hidden state sequence."""
    N = len(pi)
    T = len(obs)
    delta = [[0.0]*N for _ in range(T)]
    psi   = [[0]*N for _ in range(T)]

    # Base case
    for j in range(N):
        delta[0][j] = pi[j] * B[j][obs[0]]

    # Recursion: max instead of sum
    for t in range(1, T):
        for j in range(N):
            best_i, best_v = 0, 0
            for i in range(N):
                v = delta[t-1][i] * A[i][j]
                if v > best_v:
                    best_v, best_i = v, i
            delta[t][j] = best_v * B[j][obs[t]]
            psi[t][j] = best_i

    # Backtrace
    path = [0] * T
    path[T-1] = max(range(N), key=lambda j: delta[T-1][j])
    for t in range(T-2, -1, -1):
        path[t] = psi[t+1][path[t+1]]
    return path
Check: How does Viterbi differ from the Forward algorithm?
🔨 Derivation Why does replacing sum with max give the best path? ✓ ATTEMPTED

The Forward algorithm computes αt(j) = P(o1..t, st=j) using a sum. Viterbi computes δt(j) = maxs1..t-1 P(o1..t, s1..t-1, st=j) using a max. Both look similar, but they answer fundamentally different questions.

Your task: Prove that the Viterbi path (obtained by backtracing from argmaxj δT(j)) is indeed the globally optimal path — the one that maximizes P(s1, ..., sT | o1, ..., oT). Why doesn't greedily picking the best state at each timestep (argmaxj γt(j)) give the same answer?

δT(j) = max over ALL possible paths s1..T-1 of P(o1..T, s1..T-1, sT=j). So maxj δT(j) is the probability of the single best complete path through the entire trellis. The max has already considered all NT-1 possible predecessors at each step.
The greedy approach picks argmaxj γt(j) independently at each timestep. But γt(j) = P(st=j | O) marginalizes over all paths, so the "best state" at time t might connect to the "best state" at time t+1 via a zero-probability transition. The greedy path may be individually optimal at each step but jointly impossible.
Consider: γ1 = [0.6, 0.4], γ2 = [0.4, 0.6]. Greedy picks path [0, 1]. But if A[0][1] = 0 (impossible transition), this path has zero probability. The Viterbi path might be [0, 0] or [1, 1] instead.

Proof of optimality:

We want: argmaxs1..T P(s1..T, o1..T)

Write the joint: P(s1..T, o1..T) = π(s1) · B[s1][o1] · ∏t=2..T A[st-1][st] · B[st][ot]

The max over all sequences decomposes recursively: maxs1..T = maxsT maxs1..T-1 because the max operator distributes over independent sub-problems (just like sum does in Forward, but max satisfies max(a·b, a·c) = a·max(b,c) for a>0).

This gives: δt(j) = B[j][ot] · maxit-1(i) · A[i][j]]

At each step, we only keep the BEST predecessor for each state (stored in ψ[t][j]). The backtrace recovers the globally optimal path because the Bellman principle of optimality holds: any sub-path of an optimal path must itself be optimal.

The key insight: Viterbi works because the max operator, like addition, distributes over multiplication of positive numbers. This is why it's called the "max-product" algorithm (vs "sum-product" for Forward). The greedy approach fails because it ignores transition constraints — the marginal posterior γt(j) = P(st=j|O) doesn't encode whether consecutive states are reachable from each other.

⚔ Adversarial: Your Viterbi path goes through a state with zero emission probability for the observed symbol. How is this possible?
You have a 3-state HMM. State C has B[C][x] = 0 for observation x. Yet after running Viterbi on a sequence containing x at timestep t, your decoded path has statet = C. Your code has no bugs. What happened?

Chapter 7: Baum-Welch (EM)

What if you don't know the model parameters (A, B, π)? You only have observation sequences. The Baum-Welch algorithm learns the parameters from data using Expectation-Maximization (EM): iterate between estimating the hidden states (E-step) and updating the parameters (M-step).

E-Step: Estimate Hidden States
Run Forward-Backward to compute P(statet = j | observations). Uses current A, B, π.
M-Step: Update Parameters
Re-estimate A, B, π using the soft state assignments from E-step. Each parameter is a weighted average.
↓ repeat until convergence
The Backward variable: βt(i) = P(ot+1, ..., oT | statet = i). Combined with the forward variable: γt(i) = αt(i) · βt(i) / P(O) gives the posterior probability of being in state i at time t.
γt(i) = P(statet = i | O, model) = αt(i) · βt(i) / P(O)
EM Convergence

Watch Baum-Welch learn the transition probability from data. Each iteration refines the estimate. The orange line is the learned value; the teal dashed line is the true value.

Iteration: 0 | A[F][F]: 0.50

Let's walk through the M-step numerically. The transition re-estimation formula is:

Anew[i,j] = ∑t ξt(i,j) / ∑t γt(i)

where ξt(i,j) = αt(i) · A[i,j] · B[j,ot+1] · βt+1(j) / P(O). In plain English: the numerator counts how often we expected to transition from i to j, and the denominator counts how often we expected to be in state i at all. The ratio gives the new transition probability.

Worked example: Suppose after the E-step, the expected counts are: transitions Fair→Fair = 142.3, Fair→Loaded = 7.7, total time in Fair = 150. Then Anew[F,F] = 142.3 / 150 = 0.949 and Anew[F,L] = 7.7 / 150 = 0.051. Compare the true values: 0.95 and 0.05. One EM iteration already gets close. The emission matrix updates follow the same logic: Bnew[j,k] = (expected count of observing k in state j) / (expected total time in state j).
Guaranteed to converge? Yes — EM always increases the likelihood (or keeps it the same). But it finds a local maximum, not necessarily the global one. Multiple random restarts help. In practice, Baum-Welch works remarkably well.
Check: What does the E-step of Baum-Welch compute?
🔨 Derivation Derive the M-step update for the transition matrix ✓ ATTEMPTED

In the M-step of Baum-Welch, we update A[i][j] using soft counts from the E-step. The update rule is: Anew[i][j] = ∑t ξt(i,j) / ∑t γt(i).

Your task: Derive why this is the maximum likelihood estimate. Start from the expected complete-data log-likelihood and show that the optimal A[i][j] has this form. Why does it look like "expected count of transitions from i to j" divided by "expected count of being in state i"?

The complete-data log-likelihood (if you knew the states) is: L = ∑t log A[st][st+1] + ∑t log B[st][ot] + log π[s1]. The expected version replaces indicator functions with posteriors: E[L] = ∑ti,j ξt(i,j) · log A[i][j] + ...
For row i of A, maximize ∑j cij · log A[i][j] subject to ∑j A[i][j] = 1, where cij = ∑t ξt(i,j). Use a Lagrange multiplier λ for the constraint.
Set ∂/∂A[i][j] of [∑j cij log A[i][j] − λ(∑j A[i][j] − 1)] = 0. This gives cij/A[i][j] = λ. So A[i][j] = cij/λ. Use the constraint ∑j A[i][j] = 1 to find λ = ∑j cij.

Full derivation:

The expected complete-data log-likelihood with respect to A (ignoring terms not involving A):

Q(A) = ∑t=1T-1ij ξt(i,j) · log A[i][j]

Subject to: ∑j A[i][j] = 1 for each row i.

Lagrangian for row i: Li = ∑j cij log A[i][j] − λi(∑j A[i][j] − 1)

where cij = ∑t=1T-1 ξt(i,j)

Take derivative and set to zero: ∂Li/∂A[i][j] = cij/A[i][j] − λi = 0

So: A[i][j] = cij / λi

Apply constraint: ∑j ciji = 1, so λi = ∑j cij = ∑t γt(i)

Final answer: Anew[i][j] = ∑t ξt(i,j) / ∑t γt(i)

The key insight: This is just counting! In the fully-observed case, you'd estimate A[i][j] by counting transitions i→j and dividing by total time in state i. With hidden states, you can't count directly, so you use expected counts from the posterior. The same logic applies to B: expected emission counts / expected state occupancy. EM always has this structure: compute expected sufficient statistics, then maximize as if they were real counts.

💥 Break-It Lab What Dies When You Break the HMM? ✓ ATTEMPTED
A working dishonest casino HMM: A[F][F]=0.95, A[L][L]=0.90, B[Loaded][6]=0.5, π=[0.5, 0.5]. The Viterbi decoder correctly identifies loaded-die segments. Now let's break things.
Wrong initial distribution (π = [1.0, 0.0]) ACTIVE
Failure mode: The model is absolutely certain it starts in Fair. For the first few timesteps, it refuses to decode Loaded even with strong evidence (runs of 6s), because the prior is infinitely confident. The effect fades as the transition matrix eventually allows switching — but the first ~5 timesteps are wrong. This shows how initial conditions create a "burn-in" period where the model's beliefs are dominated by its prior rather than evidence.
Identical emissions (B[Loaded] = B[Fair] = uniform) ACTIVE
Failure mode: Observations carry ZERO information about the hidden state. The emission likelihood ratio is 1.0 for every observation. Viterbi degrades to following the transition matrix's stationary distribution — it always predicts Fair (the more common state) regardless of what it observes. Accuracy drops to ~67% (just predicting the majority class). The HMM is blind.
Zero transition (A[F][L] = 0, stuck in Fair) ACTIVE
Failure mode: Once in Fair, the model can NEVER transition to Loaded. If π starts you in Fair, the Viterbi path stays in Fair forever — even when seeing 20 consecutive 6s. A zero transition is an absolute barrier. In practice, this creates an absorbing state. This is why Baum-Welch can get stuck in local optima: if any A[i][j] reaches zero during learning, it can never recover (the expected count ξt(i,j) will also be zero).
Checkpoint — Before you move on
You've now seen all three HMM algorithms. In your own words: why is Baum-Welch's convergence only guaranteed to a LOCAL maximum? Construct a scenario where two very different parameter settings (different A, B matrices) could both explain the same observation sequence reasonably well.
✓ Gate cleared
Model Answer

The log-likelihood surface of an HMM has multiple modes due to label symmetry: swapping state 0 and state 1 (permuting the rows/columns of A and B) gives a different parameter setting with identical likelihood. For N states, there are N! equivalent permutations. Beyond this trivial symmetry, genuinely different explanations can fit the data: a model with A[F][F]=0.95 and B[L][6]=0.5 might fit the data as well as one with A[F][F]=0.80 and B[L][6]=0.7 (fewer switches but weaker emission signal). EM gets trapped in whichever basin its initialization falls into. This is why multiple random restarts are standard practice — run Baum-Welch 10-50 times with different initial parameters and keep the solution with highest final log-likelihood.

Chapter 8: HMMs vs Modern Models

HMMs dominated sequence modeling from the 1970s through the 2000s. Then neural networks took over. Understanding this evolution reveals what each approach does well — and why Transformers ultimately won.

ModelEraHidden StateKey Limitation
HMM1970s–2000sDiscrete, finiteFixed number of states, Markov assumption
RNN/LSTM2010sContinuous vectorSequential processing, vanishing gradients
Transformer2017+Contextual embeddingsQuadratic attention cost
Speech recognition: HMMs powered every major speech system for 30 years (Dragon, Siri v1, Google Voice). Each phoneme was an HMM state. Words were sequences of phoneme-HMMs. This worked — until deep learning produced a 30% error reduction overnight.
POS tagging: In NLP, HMMs were the standard for part-of-speech tagging. States = tags (noun, verb, adjective...), observations = words. "The dog runs" → [DET, NOUN, VERB]. Trigram HMMs achieved ~96% accuracy — a bar that stood for years.
What breaks — the first-order Markov assumption: HMMs assume the current state depends only on the previous state. For speech, this means a phoneme depends only on the immediately preceding phoneme — not on the word, the sentence, or the topic. For gene-finding, it means a codon depends only on the previous codon. This is clearly wrong for any task requiring long-range context. Neural models won precisely because they can condition on entire histories.

The scaling wall. Training complexity is O(N² · T · iterations). For a speech recognizer with N=5,000 phoneme states, T=1,000 frames per utterance, and 100 EM iterations: that's 5000² × 1000 × 100 = 2.5 trillion operations per training utterance. Modern speech corpora have millions of utterances. HMMs weren't replaced because they're wrong — they were replaced because they can't handle the state space needed for modern tasks. A Transformer with 768-dimensional hidden states effectively has a continuous state space of infinite cardinality.

SystemStates (N)Seq Length (T)Forward Pass Cost
Dishonest casino2100400 ops
POS tagger~45~30~61K ops
Gene finder~100~10K~100M ops
Speech recognizer~5,000~1,000~25B ops
Large-vocab speech~50,000~3,000~7.5T ops
Evolution of Sequence Models

From discrete hidden states to continuous representations. Each generation relaxes a constraint of the previous one.

Check: What is the primary limitation of HMMs compared to RNNs?
🏗 Design Challenge You're the Architect: Design an HMM for Speech Recognition ✓ ATTEMPTED
It's 1995. You're building a dictation system that must recognize 20,000 English words from audio. Your input is a stream of MFCC feature vectors (13 coefficients, extracted every 10ms). Your output is the most likely word sequence. You must design the HMM topology.
Vocabulary
20,000 words
Input features
13-dim MFCC @ 100 Hz
RAM budget
64 MB (1995 workstation)
Latency target
<2x real-time decoding
Training data
100 hours of transcribed speech
1. How many states per phoneme? (3? 5? 10?) What topology — left-to-right only, or allow self-loops and skips?
2. Emission model: discrete (VQ codebook) or continuous (Gaussian mixtures)? How many Gaussians per state?
3. How do you handle silence, breathing, and variable word lengths?
4. With 20,000 words averaging 7 phonemes × 3 states = ~420,000 total states, how do you keep the Viterbi trellis tractable?

The classic CMU/HTK design (circa 1995):

1. Topology: 3 states per phoneme (beginning/middle/end), strictly left-to-right with self-loops. No skips — they add too many transitions. ~40 phonemes × 3 = 120 base states, but context-dependent triphones expand this to ~5,000–10,000 tied states using decision-tree clustering.

2. Emissions: Continuous GMMs (Gaussian Mixture Models), typically 8–32 mixtures per state. A VQ codebook (discrete) saves RAM but loses ~15% relative accuracy. Each mixture has a 13-dim mean + diagonal covariance = 26 parameters. With 10K states × 16 mixtures = 160K Gaussians × 26 params = 4.2M float parameters ≈ 17 MB. Fits in budget.

3. Silence/breathing: A special 1-state or 3-state silence model with a very "flat" Gaussian (high variance). Word boundaries allow optional silence insertion. Variable word lengths are handled naturally: each word is a concatenation of phoneme HMMs, and self-loops absorb duration variation.

4. Tractability: Beam search (Viterbi with pruning). At each timestep, only keep states with δ within a beam width of the current best δ. Typical beam keeps ~500–2000 active states (of 420K). Combined with a language model (bigram/trigram) that prunes unlikely word sequences, decoding runs at 0.5–3x real-time.

Chapter 9: Connections

HMMs don't exist in isolation. They're part of a rich family of models for reasoning about hidden states over time. Understanding these connections deepens your intuition for all of them.

Bayes Filter
The general framework: predict → update with Bayes' rule. HMMs are a special case where states and observations are discrete.
↓ continuous states
Kalman Filter
A "continuous HMM" with linear-Gaussian assumptions. States are real-valued vectors. The math is the same predict-update cycle.
↓ actions + rewards
Reinforcement Learning (POMDPs)
Add actions and rewards to an HMM. The agent must act optimally despite hidden states. Belief-state planning.
↓ learned representations
Deep Sequence Models
Replace discrete states with neural hidden representations. The forward pass is a learned version of the HMM recursion.
The key insight: HMM, Kalman filter, particle filter, RNN — they all share the same skeleton: maintain a belief about hidden states, update it as new observations arrive. The difference is the representation of that belief and how updates are computed.
The Family Tree

How HMMs connect to other models. Each arrow relaxes a constraint or adds a capability.

Check: A Kalman filter can be seen as...
🔗 Pattern Recognition
Attention = Soft Viterbi Alignment
HMM (Viterbi)
For each output position, find the BEST (hardmax) alignment to input positions via dynamic programming.
Score = transition + emission (log-additive).
Result: hard 1-to-1 alignment path.
Transformer (Attention)
For each output position, compute a SOFT (softmax) weighting over all input positions.
Score = Q · KT / √d (learned similarity).
Result: soft many-to-many alignment weights.
Transformer lesson

The deep connection: both HMM decoding and Transformer attention solve the alignment problem — "which input positions are relevant to this output position?" The HMM does it with hard assignments via DP (one best path). Attention does it with soft weights via learned similarity (weighted combination of all positions). Early neural machine translation literally used attention to replace the HMM alignment model. The "attention is all you need" paper eliminated the recurrence (no sequential dependence), making alignment fully parallel — but the core operation (figure out which inputs matter for each output) is the same problem HMMs were solving in 1970.

In CTC (Connectionist Temporal Classification), the forward algorithm is used inside a neural network's loss function. How is this the literal fusion of HMM dynamic programming with neural emissions?

"The HMM is one of the most important statistical tools for modeling sequences. Master it, and you hold the key to speech, language, biology, and beyond."
— Lawrence Rabiner

You now understand the mathematics of hidden states. Every modern sequence model — from LSTMs to Transformers — carries the DNA of the HMM inside it.