microHMM — Hidden Markov Models from Scratch

Chapter 1: The HMM Tuple — (S, O, A, B, π)

An HMM is fully specified by five objects. Think of it as a recipe: once you have these five ingredients, the entire model is defined and you can do all the inference you want.

S — States

The set of hidden states. Casino: {Fair, Loaded}

↓

O — Observations

What you can see. Casino: {1, 2, 3, 4, 5, 6}

↓

A — Transition Matrix

P(next state | current state). How likely is a die switch?

↓

B — Emission Matrix

P(observation | state). Each die's roll probabilities.

↓

π — Initial Distribution

P(starting state). Which die does the dealer pick first?

Think of it this way: S and O define the vocabulary. A describes how hidden states evolve over time. B describes how hidden states produce observations. π tells you where you start. Together: a complete generative story.

Matrix shapes for implementation: Every HMM component has a concrete shape. Let N = number of states and M = number of observation symbols. Then:
• Transition matrix A is [N × N] — each row sums to 1.
• Emission matrix B is [N × M] — each row sums to 1.
• Initial distribution π is [N] — sums to 1.
For the dishonest casino: N=2 (Fair, Loaded), M=6 (dice faces 1–6). So A is [2×2], B is [2×6], π is [2]. That's 2+12+2 = 16 parameters to fully specify the model.

Python
import numpy as np

N, M = 2, 6    # 2 states, 6 observation symbols
A  = np.array([[0.95, 0.05],   # [N, N] transition matrix
               [0.10, 0.90]])
B  = np.array([[1/6]*6,           # [N, M] emission matrix
               [0.1, 0.1, 0.1, 0.1, 0.1, 0.5]])
pi = np.array([0.5, 0.5])       # [N] initial distribution

assert A.shape == (N, N)
assert B.shape == (N, M)
assert np.allclose(A.sum(axis=1), 1)  # each row sums to 1

# Sampling from the HMM (the generative story)
def sample_hmm(A, B, pi, T):
    states, obs = [], []
    s = np.random.choice(len(pi), p=pi)
    for t in range(T):
        states.append(s)
        obs.append(np.random.choice(B.shape[1], p=B[s]))
        s = np.random.choice(N, p=A[s])
    return states, obs

HMM Structure

The five components visualized as a graphical model. Hidden states on top, observations on bottom.

Check: Which component describes how likely the system is to switch from one hidden state to another?

B (Emission matrix) A (Transition matrix) π (Initial distribution)

Chapter 2: Transitions — A[i][j]

The transition matrix A captures the dynamics of the hidden states. A[i][j] is the probability of moving from state i to state j. Each row must sum to 1 (you have to go somewhere). In the casino: a high A[Fair][Fair] means the dealer tends to stick with the fair die.

A[i][j] = P(state_t+1 = j | state_t = i) ∑_j A[i][j] = 1

The Markov property is the key assumption: the next state depends only on the current state, not on the entire history. The dealer's decision to switch dice depends only on which die they're currently using — not on all previous switches.

Why "Markov"? Named after Andrey Markov, who studied chains of random events where the future depends only on the present. "Given the present, the future is independent of the past." This memoryless property makes HMMs tractable.

The stationary distribution tells you where the system spends its time in the long run. For our casino with A[F][F]=0.95, A[L][L]=0.90: the stationary distribution is π_∞ = [2/3, 1/3]. The dealer uses the fair die twice as often as the loaded die. You find it by solving π · A = π, or equivalently: π_Fair = 0.10 / (0.05 + 0.10) = 2/3. This is useful for sanity-checking — if your Baum-Welch output gives a stationary distribution that doesn't match the observed state frequencies, something is wrong.

Interactive Transition Matrix

Adjust P(stay with Fair die) to see how the transition dynamics change. The complementary probabilities update automatically.

P(Fair → Fair)0.95

P(Loaded → Loaded)0.90

	To Fair	To Loaded
From Fair	A[F][F] = 0.95	A[F][L] = 0.05
From Loaded	A[L][F] = 0.10	A[L][L] = 0.90

Check: What does the Markov property assume?

The next state depends only on the current state All states are equally likely Transitions are deterministic

Chapter 3: Emissions — B[j][k]

The emission matrix B describes what you see given the hidden state. B[j][k] = P(observing k | in state j). For the fair die, B[Fair] is uniform: each face has probability 1/6. For the loaded die, B[Loaded] is heavily biased toward 6.

B[j][k] = P(observation_t = k | state_t = j) ∑_k B[j][k] = 1

The emission distribution is what makes the hidden state indirectly observable. If you see many 6s in a row, the loaded die is more likely — but you can never be 100% sure, because the fair die can also produce 6s.

Fair vs Loaded: Fair die: [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]. Loaded die: [1/10, 1/10, 1/10, 1/10, 1/10, 1/2]. The loaded die rolls a 6 half the time — but it can still roll other numbers.

The emission likelihood ratio is what drives inference. When you observe a 6: P(6|Loaded)/P(6|Fair) = 0.5/(1/6) = 3.0. A 6 is 3 times more likely under the loaded die. When you observe a 3: P(3|Loaded)/P(3|Fair) = 0.1/(1/6) = 0.6. A 3 slightly favors the fair die. The larger the emission likelihood ratio, the more diagnostic an observation is. If both dice had identical emission distributions, observations would carry zero information about the hidden state — the problem would be unsolvable.

Emission Distributions

Adjust how biased the loaded die is toward rolling 6. Watch the probability bars shift.

Loaded P(6)0.50

Check: If you observe a 6, can you be certain the loaded die was used?

Yes — only the loaded die rolls 6 No — the fair die can also roll 6, just less often It depends on the transition matrix

Checkpoint — Before you move on

You observe the sequence [6, 6, 3, 6, 1, 6]. Without running any algorithm, explain intuitively which timesteps are most likely to be the loaded die and WHY. What role does the emission likelihood ratio P(obs|Loaded)/P(obs|Fair) play in your reasoning?

✓ Gate cleared

Model Answer

The timesteps with observation 6 (positions 1, 2, 4, 6) are most likely Loaded because the likelihood ratio is P(6|Loaded)/P(6|Fair) = 0.5/(1/6) = 3.0 — a 6 is 3x more likely under Loaded. Position 3 (observation 3) slightly favors Fair: ratio = 0.1/(1/6) = 0.6. Position 5 (observation 1) also favors Fair with the same 0.6 ratio.

But the key insight is that isolated likelihood ratios aren't enough. The transition matrix creates temporal coherence — if the dealer tends to stay with the same die (A[L][L] = 0.90), then a cluster of 6s is much stronger evidence than scattered 6s. The sequence [6,6] is more convincing than two isolated 6s because the first 6 shifts your belief toward Loaded, and the high self-transition probability keeps it there. This is exactly what the Forward algorithm formalizes: it combines emission evidence with transition dynamics at every timestep.

Chapter 4: The Three Problems

Given an HMM, there are three fundamental questions you can ask. Everything in HMM theory boils down to solving one of these three:

1. Evaluation (Forward Algorithm)

What is P(observations | model)? How well does this model explain the data?

↓

2. Decoding (Viterbi Algorithm)

What is the most likely sequence of hidden states given the observations?

↓

3. Learning (Baum-Welch / EM)

How do I learn A, B, π from data when the states are truly hidden?

Problem	Question	Algorithm	Complexity
Evaluation	P(O \| λ)	Forward	O(N²T)
Decoding	argmax P(Q \| O, λ)	Viterbi	O(N²T)
Learning	argmax_λ P(O \| λ)	Baum-Welch	O(N²T) per iter

N = number of states, T = length of observation sequence. Without dynamic programming, evaluation alone would be O(N^T) — exponential in sequence length. The Forward and Viterbi algorithms make HMMs practical by reducing this to O(N²T).

The Three Problems Visualized

Click each problem to see what question it answers and how information flows.

Check: Which algorithm finds the most likely sequence of hidden states?

Forward Algorithm Viterbi Algorithm Baum-Welch Algorithm

Chapter 5: The Forward Algorithm

The Forward algorithm answers: how likely is this observation sequence, given our model? It builds a trellis — a grid where each column is a timestep and each row is a state. Each cell holds the probability of being in that state at that time, having seen all observations so far.

α_t(j) = P(o₁, o₂, ..., o_t, state_t = j | model)

The recursion is elegant: to compute α_t(j), sum over all states i at time t−1, multiply by the transition A[i][j] and the emission B[j][o_t]:

α_t(j) = B[j][o_t] · ∑_i α_t-1(i) · A[i][j]

Base case: α₁(j) = π(j) · B[j][o₁]. Start with the initial probability of each state times the probability of emitting the first observation. Then recurse forward through time.

Think of α_t as a [N] vector — one entry per state. The recursion is a matrix-vector multiply followed by an element-wise multiply:

α_t = B[:, o_t] ⊙ (A^T · α_t-1)

A^T · α_t-1 costs O(N²) — one matrix-vector product. The element-wise multiply with the emission column costs O(N). Over T timesteps, the total cost is O(N² · T). For a speech recognizer with N=5,000 phoneme states and a 10-second utterance (T=1,000 frames), that's 25 billion operations — manageable, but it shows why HMM state counts matter.

The log-space trick: In practice, α values underflow to 0 for long sequences because you're multiplying many small probabilities. Solution: work in log space. Instead of α, track log(α). The product becomes a sum, and the sum-over-states becomes a logsumexp:

log(α_t(j)) = log(B[j,o_t]) + logsumexp_i(log(α_t-1(i)) + log(A[i,j]))

where logsumexp(a, b) = max(a,b) + log(1 + exp(−|a−b|)). This keeps everything numerically stable even for sequences of length 10,000+.

Interactive Forward Trellis

Click "Step" to advance through the trellis computation one column at a time. Watch how probabilities flow from left to right.

Step 0 / 6

Python
def forward(obs, A, B, pi):
    """Compute forward probabilities alpha[t][j]."""
    N = len(pi)            # number of states
    T = len(obs)            # sequence length
    alpha = [[0.0]*N for _ in range(T)]

    # Base case
    for j in range(N):
        alpha[0][j] = pi[j] * B[j][obs[0]]

    # Recursion
    for t in range(1, T):
        for j in range(N):
            alpha[t][j] = B[j][obs[t]] * sum(
                alpha[t-1][i] * A[i][j] for i in range(N))

    return alpha, sum(alpha[T-1])

Python — Log-space version
import math

def logsumexp(a, b):
    """Numerically stable log(exp(a) + exp(b))."""
    mx = max(a, b)
    return mx + math.log(1 + math.exp(-abs(a - b)))

def forward_log(obs, A, B, pi):
    """Forward algorithm in log space — no underflow."""
    N = len(pi)
    T = len(obs)
    logA = [[math.log(A[i][j]) for j in range(N)] for i in range(N)]
    logB = [[math.log(B[j][k]) for k in range(len(B[0]))] for j in range(N)]
    la = [math.log(pi[j]) + logB[j][obs[0]] for j in range(N)]

    for t in range(1, T):
        new_la = []
        for j in range(N):
            terms = [la[i] + logA[i][j] for i in range(N)]
            val = terms[0]
            for k in range(1, N):
                val = logsumexp(val, terms[k])
            new_la.append(val + logB[j][obs[t]])
        la = new_la

    # Total log-probability
    total = la[0]
    for j in range(1, N):
        total = logsumexp(total, la[j])
    return total  # log P(O | model)

Check: What does α_t(j) represent?

The joint probability of seeing observations o1..ot AND being in state j at time t The probability of transitioning from state i to state j The most likely state at time t

🔨 Derivation Derive the Forward recursion from first principles ▶ ✓ ATTEMPTED

We define α_t(j) = P(o₁, o₂, ..., o_t, state_t = j). We want to compute α_t(j) from values at time t−1.

Your task: Starting from the definition of α_t(j), use the law of total probability to derive the recursion α_t(j) = B[j][o_t] · ∑_i α_t-1(i) · A[i][j]. Explain why each factor appears.

Write α_t(j) = ∑_i P(o_1..t, state_t-1=i, state_t=j). This marginalizes over all possible previous states. Now apply the chain rule to factor this joint probability.

The Markov property gives P(state_t=j | state_t-1=i, o_1..t-1) = P(state_t=j | state_t-1=i) = A[i][j]. The emission independence gives P(o_t | state_t=j, everything else) = P(o_t | state_t=j) = B[j][o_t].

After factoring: P(o_1..t-1, state_t-1=i) · A[i][j] · B[j][o_t]. The first term IS α_t-1(i) by definition. Now factor out what doesn't depend on i.

Full derivation:

Start: α_t(j) = P(o₁, ..., o_t, s_t=j)

Marginalize over previous state: = ∑_i P(o_1..t, s_t-1=i, s_t=j)

Chain rule: = ∑_i P(o_1..t-1, s_t-1=i) · P(s_t=j | s_t-1=i) · P(o_t | s_t=j)

Substitute definitions: = ∑_i α_t-1(i) · A[i][j] · B[j][o_t]

Factor out emission (doesn't depend on i): = B[j][o_t] · ∑_i α_t-1(i) · A[i][j]

The key insight: The Markov property is what makes this recursion possible. Without it, P(s_t=j | s_t-1=i, o_1..t-1) would depend on the entire history, and we couldn't factor the joint probability so cleanly. The emission independence assumption lets us pull B[j][o_t] out of the sum. Together, they reduce an O(N^T) sum-over-all-paths to an O(N²) sum at each timestep.

💻 Build It Implement the Forward Algorithm from Scratch ▶ ✓ ATTEMPTED

You've seen the recursion. Now implement it yourself. Given a sequence of observations and HMM parameters, compute the forward probabilities α[t][j] for all timesteps and states, and return the total sequence probability P(O|model).

signature def forward(obs: list[int], A: list[list[float]], B: list[list[float]], pi: list[float]) -> tuple[list[list[float]], float]: """ Compute forward probabilities for an HMM. Args: obs: observation sequence (T integers, each in 0..M-1) A: transition matrix [N x N], A[i][j] = P(state j | state i) B: emission matrix [N x M], B[j][k] = P(obs k | state j) pi: initial distribution [N], pi[j] = P(start in state j) Returns: alpha: [T x N] forward probabilities prob: P(obs | model) = sum of alpha[T-1] """

Test case

A = [[0.95, 0.05], [0.10, 0.90]]
B = [[1/6]*6, [0.1, 0.1, 0.1, 0.1, 0.1, 0.5]]
pi = [0.5, 0.5]
obs = [5, 5, 2] # observations: 6, 6, 3

Expected: alpha[0] = [0.0833, 0.2500]
alpha[1] ≈ [0.0147, 0.1145]
P(O|model) ≈ sum(alpha[2]) ≈ 0.0076

For each timestep t from 1 to T-1, for each state j, you need to sum over ALL previous states i: alpha[t-1][i] * A[i][j]. Then multiply by B[j][obs[t]]. The tricky part is getting the base case right: alpha[0][j] = pi[j] * B[j][obs[0]].

python
def forward(obs, A, B, pi):
    N = len(pi)
    T = len(obs)
    alpha = [[0.0] * N for _ in range(T)]

    # Base case: first observation
    for j in range(N):
        alpha[0][j] = pi[j] * B[j][obs[0]]

    # Recursion: sum over all paths into state j
    for t in range(1, T):
        for j in range(N):
            alpha[t][j] = B[j][obs[t]] * sum(
                alpha[t-1][i] * A[i][j]
                for i in range(N)
            )

    # Total probability: sum over final states
    prob = sum(alpha[T-1])
    return alpha, prob

Bonus challenge: Modify this to work in log-space. Replace multiplication with addition and the sum-over-states with logsumexp. This prevents underflow for sequences longer than ~50 timesteps.

🔗 Pattern Recognition

Forward Algorithm = Bayes Predict-Update on Discrete States

HMM Forward

α_t(j) = B[j][o_t] · ∑_i α_t-1(i) · A[i][j]
Predict: propagate through transitions
Update: multiply by emission likelihood

Bayes Filter

bel(x_t) = η · P(z_t|x_t) · ∑_x P(x_t|x_t-1) · bel(x_t-1)
Predict: propagate through motion model
Update: multiply by sensor likelihood
→ Bayes Filter lesson

The Forward algorithm IS the Bayes filter, specialized to discrete states. The transition matrix A is the motion model. The emission matrix B is the sensor model. The α vector is the (unnormalized) belief. The only difference: the Bayes filter normalizes at each step (giving a posterior), while Forward accumulates the joint probability (giving P(observations AND state)). Divide α_t by its sum and you get the Bayes filter posterior exactly.

In a Kalman filter, what plays the role of the transition matrix A? What plays the role of the emission matrix B? (Hint: they're both matrices, but they operate on continuous vectors.)

Chapter 6: Viterbi Decoding — The Showcase

The Viterbi algorithm answers the most dramatic question: what is the single most likely sequence of hidden states? It's identical in structure to the Forward algorithm, but replaces the sum with a max and keeps backpointers to trace back the winning path.

δ_t(j) = max_i [ δ_t-1(i) · A[i][j] ] · B[j][o_t]

At each cell, instead of summing all paths (Forward), we keep only the best path into each state. The key data structure is the backpointer array ψ — a [T × N] matrix where ψ[t][j] stores which previous state led to the best score for state j at time t. After filling the trellis, you find the best final state, then follow backpointers backward to recover the entire optimal path.

The backpointer traceback: Start at the end: pick state* = argmax_j δ_T(j). Then walk backward: state at t = ψ[t+1][state at t+1]. This costs O(T) — you just follow the arrows. The dynamic programming trellis is O(N² · T) to fill, so the traceback is essentially free.

Forward vs Viterbi: Forward sums over all paths (total probability). Viterbi takes the max over all paths (best single path). The trellis structure is identical — only the aggregation changes.

The Dishonest Casino — Live Viterbi Decoding

Roll dice and watch Viterbi decode which die the dealer is using in real time. The trellis builds as you go. Green = Fair, Red = Loaded.

Rolls: 0 | Accuracy: --

P(switch die)0.05

Loaded P(6)0.50

Python
def viterbi(obs, A, B, pi):
    """Find the most likely hidden state sequence."""
    N = len(pi)
    T = len(obs)
    delta = [[0.0]*N for _ in range(T)]
    psi   = [[0]*N for _ in range(T)]

    # Base case
    for j in range(N):
        delta[0][j] = pi[j] * B[j][obs[0]]

    # Recursion: max instead of sum
    for t in range(1, T):
        for j in range(N):
            best_i, best_v = 0, 0
            for i in range(N):
                v = delta[t-1][i] * A[i][j]
                if v > best_v:
                    best_v, best_i = v, i
            delta[t][j] = best_v * B[j][obs[t]]
            psi[t][j] = best_i

    # Backtrace
    path = [0] * T
    path[T-1] = max(range(N), key=lambda j: delta[T-1][j])
    for t in range(T-2, -1, -1):
        path[t] = psi[t+1][path[t+1]]
    return path

Check: How does Viterbi differ from the Forward algorithm?

It replaces the sum with a max, keeping only the best path into each state It runs backward instead of forward It uses a different trellis structure

🔨 Derivation Why does replacing sum with max give the best path? ▶ ✓ ATTEMPTED

The Forward algorithm computes α_t(j) = P(o_1..t, s_t=j) using a sum. Viterbi computes δ_t(j) = max_{s_1..t-1} P(o_1..t, s_1..t-1, s_t=j) using a max. Both look similar, but they answer fundamentally different questions.

Your task: Prove that the Viterbi path (obtained by backtracing from argmax_j δ_T(j)) is indeed the globally optimal path — the one that maximizes P(s₁, ..., s_T | o₁, ..., o_T). Why doesn't greedily picking the best state at each timestep (argmax_j γ_t(j)) give the same answer?

δ_T(j) = max over ALL possible paths s_1..T-1 of P(o_1..T, s_1..T-1, s_T=j). So max_j δ_T(j) is the probability of the single best complete path through the entire trellis. The max has already considered all N^T-1 possible predecessors at each step.

The greedy approach picks argmax_j γ_t(j) independently at each timestep. But γ_t(j) = P(s_t=j | O) marginalizes over all paths, so the "best state" at time t might connect to the "best state" at time t+1 via a zero-probability transition. The greedy path may be individually optimal at each step but jointly impossible.

Consider: γ₁ = [0.6, 0.4], γ₂ = [0.4, 0.6]. Greedy picks path [0, 1]. But if A[0][1] = 0 (impossible transition), this path has zero probability. The Viterbi path might be [0, 0] or [1, 1] instead.

Proof of optimality:

We want: argmax_{s_1..T} P(s_1..T, o_1..T)

Write the joint: P(s_1..T, o_1..T) = π(s₁) · B[s₁][o₁] · ∏_t=2..T A[s_t-1][s_t] · B[s_t][o_t]

The max over all sequences decomposes recursively: max_{s_1..T} = max_{s_T} max_{s_1..T-1} because the max operator distributes over independent sub-problems (just like sum does in Forward, but max satisfies max(a·b, a·c) = a·max(b,c) for a>0).

This gives: δ_t(j) = B[j][o_t] · max_i [δ_t-1(i) · A[i][j]]

At each step, we only keep the BEST predecessor for each state (stored in ψ[t][j]). The backtrace recovers the globally optimal path because the Bellman principle of optimality holds: any sub-path of an optimal path must itself be optimal.

The key insight: Viterbi works because the max operator, like addition, distributes over multiplication of positive numbers. This is why it's called the "max-product" algorithm (vs "sum-product" for Forward). The greedy approach fails because it ignores transition constraints — the marginal posterior γ_t(j) = P(s_t=j|O) doesn't encode whether consecutive states are reachable from each other.

⚔ Adversarial: Your Viterbi path goes through a state with zero emission probability for the observed symbol. How is this possible?

You have a 3-state HMM. State C has B[C][x] = 0 for observation x. Yet after running Viterbi on a sequence containing x at timestep t, your decoded path has state_t = C. Your code has no bugs. What happened?

The transition probability compensated for the zero emission This is impossible — zero emission means zero path probability This is actually impossible — if B[C][x]=0, then δ_t(C)=0, so C can never be on the Viterbi path at time t

Chapter 7: Baum-Welch (EM)

What if you don't know the model parameters (A, B, π)? You only have observation sequences. The Baum-Welch algorithm learns the parameters from data using Expectation-Maximization (EM): iterate between estimating the hidden states (E-step) and updating the parameters (M-step).

E-Step: Estimate Hidden States

Run Forward-Backward to compute P(state_t = j | observations). Uses current A, B, π.

↓

M-Step: Update Parameters

Re-estimate A, B, π using the soft state assignments from E-step. Each parameter is a weighted average.

↓ repeat until convergence

The Backward variable: β_t(i) = P(o_t+1, ..., o_T | state_t = i). Combined with the forward variable: γ_t(i) = α_t(i) · β_t(i) / P(O) gives the posterior probability of being in state i at time t.

γ_t(i) = P(state_t = i | O, model) = α_t(i) · β_t(i) / P(O)

EM Convergence

Watch Baum-Welch learn the transition probability from data. Each iteration refines the estimate. The orange line is the learned value; the teal dashed line is the true value.

Iteration: 0 | A[F][F]: 0.50

Let's walk through the M-step numerically. The transition re-estimation formula is:

A_new[i,j] = ∑_t ξ_t(i,j) / ∑_t γ_t(i)

where ξ_t(i,j) = α_t(i) · A[i,j] · B[j,o_t+1] · β_t+1(j) / P(O). In plain English: the numerator counts how often we expected to transition from i to j, and the denominator counts how often we expected to be in state i at all. The ratio gives the new transition probability.

Worked example: Suppose after the E-step, the expected counts are: transitions Fair→Fair = 142.3, Fair→Loaded = 7.7, total time in Fair = 150. Then A_new[F,F] = 142.3 / 150 = 0.949 and A_new[F,L] = 7.7 / 150 = 0.051. Compare the true values: 0.95 and 0.05. One EM iteration already gets close. The emission matrix updates follow the same logic: B_new[j,k] = (expected count of observing k in state j) / (expected total time in state j).

Guaranteed to converge? Yes — EM always increases the likelihood (or keeps it the same). But it finds a local maximum, not necessarily the global one. Multiple random restarts help. In practice, Baum-Welch works remarkably well.

Check: What does the E-step of Baum-Welch compute?

The posterior probability of each hidden state at each timestep The maximum likelihood parameters The Viterbi path

🔨 Derivation Derive the M-step update for the transition matrix ▶ ✓ ATTEMPTED

In the M-step of Baum-Welch, we update A[i][j] using soft counts from the E-step. The update rule is: A_new[i][j] = ∑_t ξ_t(i,j) / ∑_t γ_t(i).

Your task: Derive why this is the maximum likelihood estimate. Start from the expected complete-data log-likelihood and show that the optimal A[i][j] has this form. Why does it look like "expected count of transitions from i to j" divided by "expected count of being in state i"?

The complete-data log-likelihood (if you knew the states) is: L = ∑_t log A[s_t][s_t+1] + ∑_t log B[s_t][o_t] + log π[s₁]. The expected version replaces indicator functions with posteriors: E[L] = ∑_t∑_i,j ξ_t(i,j) · log A[i][j] + ...

For row i of A, maximize ∑_j c_ij · log A[i][j] subject to ∑_j A[i][j] = 1, where c_ij = ∑_t ξ_t(i,j). Use a Lagrange multiplier λ for the constraint.

Set ∂/∂A[i][j] of [∑_j c_ij log A[i][j] − λ(∑_j A[i][j] − 1)] = 0. This gives c_ij/A[i][j] = λ. So A[i][j] = c_ij/λ. Use the constraint ∑_j A[i][j] = 1 to find λ = ∑_j c_ij.

Full derivation:

The expected complete-data log-likelihood with respect to A (ignoring terms not involving A):

Q(A) = ∑_t=1^T-1 ∑_i ∑_j ξ_t(i,j) · log A[i][j]

Subject to: ∑_j A[i][j] = 1 for each row i.

Lagrangian for row i: L_i = ∑_j c_ij log A[i][j] − λ_i(∑_j A[i][j] − 1)

where c_ij = ∑_t=1^T-1 ξ_t(i,j)

Take derivative and set to zero: ∂L_i/∂A[i][j] = c_ij/A[i][j] − λ_i = 0

So: A[i][j] = c_ij / λ_i

Apply constraint: ∑_j c_ij/λ_i = 1, so λ_i = ∑_j c_ij = ∑_t γ_t(i)

Final answer: A_new[i][j] = ∑_t ξ_t(i,j) / ∑_t γ_t(i)

The key insight: This is just counting! In the fully-observed case, you'd estimate A[i][j] by counting transitions i→j and dividing by total time in state i. With hidden states, you can't count directly, so you use expected counts from the posterior. The same logic applies to B: expected emission counts / expected state occupancy. EM always has this structure: compute expected sufficient statistics, then maximize as if they were real counts.

💥 Break-It Lab What Dies When You Break the HMM? ▶ ✓ ATTEMPTED

A working dishonest casino HMM: A[F][F]=0.95, A[L][L]=0.90, B[Loaded][6]=0.5, π=[0.5, 0.5]. The Viterbi decoder correctly identifies loaded-die segments. Now let's break things.

Wrong initial distribution (π = [1.0, 0.0]) ACTIVE

Failure mode: The model is absolutely certain it starts in Fair. For the first few timesteps, it refuses to decode Loaded even with strong evidence (runs of 6s), because the prior is infinitely confident. The effect fades as the transition matrix eventually allows switching — but the first ~5 timesteps are wrong. This shows how initial conditions create a "burn-in" period where the model's beliefs are dominated by its prior rather than evidence.

Identical emissions (B[Loaded] = B[Fair] = uniform) ACTIVE

Failure mode: Observations carry ZERO information about the hidden state. The emission likelihood ratio is 1.0 for every observation. Viterbi degrades to following the transition matrix's stationary distribution — it always predicts Fair (the more common state) regardless of what it observes. Accuracy drops to ~67% (just predicting the majority class). The HMM is blind.

Zero transition (A[F][L] = 0, stuck in Fair) ACTIVE

Failure mode: Once in Fair, the model can NEVER transition to Loaded. If π starts you in Fair, the Viterbi path stays in Fair forever — even when seeing 20 consecutive 6s. A zero transition is an absolute barrier. In practice, this creates an absorbing state. This is why Baum-Welch can get stuck in local optima: if any A[i][j] reaches zero during learning, it can never recover (the expected count ξ_t(i,j) will also be zero).

Checkpoint — Before you move on

You've now seen all three HMM algorithms. In your own words: why is Baum-Welch's convergence only guaranteed to a LOCAL maximum? Construct a scenario where two very different parameter settings (different A, B matrices) could both explain the same observation sequence reasonably well.

✓ Gate cleared

Model Answer

The log-likelihood surface of an HMM has multiple modes due to label symmetry: swapping state 0 and state 1 (permuting the rows/columns of A and B) gives a different parameter setting with identical likelihood. For N states, there are N! equivalent permutations. Beyond this trivial symmetry, genuinely different explanations can fit the data: a model with A[F][F]=0.95 and B[L][6]=0.5 might fit the data as well as one with A[F][F]=0.80 and B[L][6]=0.7 (fewer switches but weaker emission signal). EM gets trapped in whichever basin its initialization falls into. This is why multiple random restarts are standard practice — run Baum-Welch 10-50 times with different initial parameters and keep the solution with highest final log-likelihood.

Chapter 8: HMMs vs Modern Models

HMMs dominated sequence modeling from the 1970s through the 2000s. Then neural networks took over. Understanding this evolution reveals what each approach does well — and why Transformers ultimately won.

Model	Era	Hidden State	Key Limitation
HMM	1970s–2000s	Discrete, finite	Fixed number of states, Markov assumption
RNN/LSTM	2010s	Continuous vector	Sequential processing, vanishing gradients
Transformer	2017+	Contextual embeddings	Quadratic attention cost

Speech recognition: HMMs powered every major speech system for 30 years (Dragon, Siri v1, Google Voice). Each phoneme was an HMM state. Words were sequences of phoneme-HMMs. This worked — until deep learning produced a 30% error reduction overnight.

POS tagging: In NLP, HMMs were the standard for part-of-speech tagging. States = tags (noun, verb, adjective...), observations = words. "The dog runs" → [DET, NOUN, VERB]. Trigram HMMs achieved ~96% accuracy — a bar that stood for years.

What breaks — the first-order Markov assumption: HMMs assume the current state depends only on the previous state. For speech, this means a phoneme depends only on the immediately preceding phoneme — not on the word, the sentence, or the topic. For gene-finding, it means a codon depends only on the previous codon. This is clearly wrong for any task requiring long-range context. Neural models won precisely because they can condition on entire histories.

The scaling wall. Training complexity is O(N² · T · iterations). For a speech recognizer with N=5,000 phoneme states, T=1,000 frames per utterance, and 100 EM iterations: that's 5000² × 1000 × 100 = 2.5 trillion operations per training utterance. Modern speech corpora have millions of utterances. HMMs weren't replaced because they're wrong — they were replaced because they can't handle the state space needed for modern tasks. A Transformer with 768-dimensional hidden states effectively has a continuous state space of infinite cardinality.

System	States (N)	Seq Length (T)	Forward Pass Cost
Dishonest casino	2	100	400 ops
POS tagger	~45	~30	~61K ops
Gene finder	~100	~10K	~100M ops
Speech recognizer	~5,000	~1,000	~25B ops
Large-vocab speech	~50,000	~3,000	~7.5T ops

Evolution of Sequence Models

From discrete hidden states to continuous representations. Each generation relaxes a constraint of the previous one.

Check: What is the primary limitation of HMMs compared to RNNs?

HMMs require discrete, finite hidden states; RNNs use continuous vectors HMMs are slower to train HMMs can't handle sequences

🏗 Design Challenge You're the Architect: Design an HMM for Speech Recognition ▶ ✓ ATTEMPTED

It's 1995. You're building a dictation system that must recognize 20,000 English words from audio. Your input is a stream of MFCC feature vectors (13 coefficients, extracted every 10ms). Your output is the most likely word sequence. You must design the HMM topology.

Vocabulary

20,000 words

Input features

13-dim MFCC @ 100 Hz

RAM budget

64 MB (1995 workstation)

Latency target

<2x real-time decoding

Training data

100 hours of transcribed speech

1. How many states per phoneme? (3? 5? 10?) What topology — left-to-right only, or allow self-loops and skips?

2. Emission model: discrete (VQ codebook) or continuous (Gaussian mixtures)? How many Gaussians per state?

3. How do you handle silence, breathing, and variable word lengths?

4. With 20,000 words averaging 7 phonemes × 3 states = ~420,000 total states, how do you keep the Viterbi trellis tractable?

The classic CMU/HTK design (circa 1995):

1. Topology: 3 states per phoneme (beginning/middle/end), strictly left-to-right with self-loops. No skips — they add too many transitions. ~40 phonemes × 3 = 120 base states, but context-dependent triphones expand this to ~5,000–10,000 tied states using decision-tree clustering.

2. Emissions: Continuous GMMs (Gaussian Mixture Models), typically 8–32 mixtures per state. A VQ codebook (discrete) saves RAM but loses ~15% relative accuracy. Each mixture has a 13-dim mean + diagonal covariance = 26 parameters. With 10K states × 16 mixtures = 160K Gaussians × 26 params = 4.2M float parameters ≈ 17 MB. Fits in budget.

3. Silence/breathing: A special 1-state or 3-state silence model with a very "flat" Gaussian (high variance). Word boundaries allow optional silence insertion. Variable word lengths are handled naturally: each word is a concatenation of phoneme HMMs, and self-loops absorb duration variation.

4. Tractability: Beam search (Viterbi with pruning). At each timestep, only keep states with δ within a beam width of the current best δ. Typical beam keeps ~500–2000 active states (of 420K). Combined with a language model (bigram/trigram) that prunes unlikely word sequences, decoding runs at 0.5–3x real-time.

Chapter 9: Connections

HMMs don't exist in isolation. They're part of a rich family of models for reasoning about hidden states over time. Understanding these connections deepens your intuition for all of them.

Bayes Filter

The general framework: predict → update with Bayes' rule. HMMs are a special case where states and observations are discrete.

↓ continuous states

Kalman Filter

A "continuous HMM" with linear-Gaussian assumptions. States are real-valued vectors. The math is the same predict-update cycle.

↓ actions + rewards

Reinforcement Learning (POMDPs)

Add actions and rewards to an HMM. The agent must act optimally despite hidden states. Belief-state planning.

↓ learned representations

Deep Sequence Models

Replace discrete states with neural hidden representations. The forward pass is a learned version of the HMM recursion.

The key insight: HMM, Kalman filter, particle filter, RNN — they all share the same skeleton: maintain a belief about hidden states, update it as new observations arrive. The difference is the representation of that belief and how updates are computed.

The Family Tree

How HMMs connect to other models. Each arrow relaxes a constraint or adds a capability.

Check: A Kalman filter can be seen as...

A discrete HMM with more states A continuous-state HMM with linear-Gaussian assumptions A replacement for HMMs that uses neural networks

🔗 Pattern Recognition

Attention = Soft Viterbi Alignment

HMM (Viterbi)

For each output position, find the BEST (hardmax) alignment to input positions via dynamic programming.
Score = transition + emission (log-additive).
Result: hard 1-to-1 alignment path.

Transformer (Attention)

For each output position, compute a SOFT (softmax) weighting over all input positions.
Score = Q · K^T / √d (learned similarity).
Result: soft many-to-many alignment weights.
→ Transformer lesson

The deep connection: both HMM decoding and Transformer attention solve the alignment problem — "which input positions are relevant to this output position?" The HMM does it with hard assignments via DP (one best path). Attention does it with soft weights via learned similarity (weighted combination of all positions). Early neural machine translation literally used attention to replace the HMM alignment model. The "attention is all you need" paper eliminated the recurrence (no sequential dependence), making alignment fully parallel — but the core operation (figure out which inputs matter for each output) is the same problem HMMs were solving in 1970.

In CTC (Connectionist Temporal Classification), the forward algorithm is used inside a neural network's loss function. How is this the literal fusion of HMM dynamic programming with neural emissions?

"The HMM is one of the most important statistical tools for modeling sequences. Master it, and you hold the key to speech, language, biology, and beyond."

— Lawrence Rabiner

You now understand the mathematics of hidden states. Every modern sequence model — from LSTMs to Transformers — carries the DNA of the HMM inside it.

Understand Hidden
Markov Models

Chapter 0: Why Hidden States?

Chapter 1: The HMM Tuple — (S, O, A, B, π)

Chapter 2: Transitions — A[i][j]

Chapter 3: Emissions — B[j][k]

Chapter 4: The Three Problems

Chapter 5: The Forward Algorithm

Chapter 6: Viterbi Decoding — The Showcase

Chapter 7: Baum-Welch (EM)

Chapter 8: HMMs vs Modern Models

Chapter 9: Connections

Understand HiddenMarkov Models

Chapter 0: Why Hidden States?

Chapter 1: The HMM Tuple — (S, O, A, B, π)

Chapter 2: Transitions — A[i][j]

Chapter 3: Emissions — B[j][k]

Chapter 4: The Three Problems

Chapter 5: The Forward Algorithm

Chapter 6: Viterbi Decoding — The Showcase

Chapter 7: Baum-Welch (EM)

Chapter 8: HMMs vs Modern Models

Chapter 9: Connections

Understand Hidden
Markov Models