The framework behind early speech recognition, gene finding, and the conceptual ancestor of every sequence model in deep learning — from RNNs to Transformers.
A casino dealer has two dice: a fair die (each face equally likely) and a loaded die (heavily biased toward rolling a 6). The dealer secretly switches between them as the night goes on. You're sitting at the table. You can see every roll, but you never see which die is being used.
This is the fundamental HMM setup: there is a hidden state (which die) that you cannot observe directly, and an observation (the roll) that depends on that hidden state. Your goal: figure out the hidden sequence from the observations alone.
Watch the dealer switch between Fair and Loaded dice. You only see the rolls — can you guess which die is being used?
An HMM is fully specified by five objects. Think of it as a recipe: once you have these five ingredients, the entire model is defined and you can do all the inference you want.
The five components visualized as a graphical model. Hidden states on top, observations on bottom.
The transition matrix A captures the dynamics of the hidden states. A[i][j] is the probability of moving from state i to state j. Each row must sum to 1 (you have to go somewhere). In the casino: a high A[Fair][Fair] means the dealer tends to stick with the fair die.
The Markov property is the key assumption: the next state depends only on the current state, not on the entire history. The dealer's decision to switch dice depends only on which die they're currently using — not on all previous switches.
Adjust P(stay with Fair die) to see how the transition dynamics change. The complementary probabilities update automatically.
| To Fair | To Loaded | |
|---|---|---|
| From Fair | A[F][F] = 0.95 | A[F][L] = 0.05 |
| From Loaded | A[L][F] = 0.10 | A[L][L] = 0.90 |
The emission matrix B describes what you see given the hidden state. B[j][k] = P(observing k | in state j). For the fair die, B[Fair] is uniform: each face has probability 1/6. For the loaded die, B[Loaded] is heavily biased toward 6.
The emission distribution is what makes the hidden state indirectly observable. If you see many 6s in a row, the loaded die is more likely — but you can never be 100% sure, because the fair die can also produce 6s.
Adjust how biased the loaded die is toward rolling 6. Watch the probability bars shift.
Given an HMM, there are three fundamental questions you can ask. Everything in HMM theory boils down to solving one of these three:
| Problem | Question | Algorithm | Complexity |
|---|---|---|---|
| Evaluation | P(O | λ) | Forward | O(N²T) |
| Decoding | argmax P(Q | O, λ) | Viterbi | O(N²T) |
| Learning | argmaxλ P(O | λ) | Baum-Welch | O(N²T) per iter |
Click each problem to see what question it answers and how information flows.
The Forward algorithm answers: how likely is this observation sequence, given our model? It builds a trellis — a grid where each column is a timestep and each row is a state. Each cell holds the probability of being in that state at that time, having seen all observations so far.
The recursion is elegant: to compute αt(j), sum over all states i at time t−1, multiply by the transition A[i][j] and the emission B[j][ot]:
Click "Step" to advance through the trellis computation one column at a time. Watch how probabilities flow from left to right.
Python def forward(obs, A, B, pi): """Compute forward probabilities alpha[t][j].""" N = len(pi) # number of states T = len(obs) # sequence length alpha = [[0.0]*N for _ in range(T)] # Base case for j in range(N): alpha[0][j] = pi[j] * B[j][obs[0]] # Recursion for t in range(1, T): for j in range(N): alpha[t][j] = B[j][obs[t]] * sum( alpha[t-1][i] * A[i][j] for i in range(N)) return alpha, sum(alpha[T-1])
The Viterbi algorithm answers the most dramatic question: what is the single most likely sequence of hidden states? It's identical in structure to the Forward algorithm, but replaces the sum with a max and keeps backpointers to trace back the winning path.
At each cell, instead of summing all paths (Forward), we keep only the best path into each state. After filling the trellis, we trace back from the final best state to recover the entire optimal sequence. This is dynamic programming at its finest.
Roll dice and watch Viterbi decode which die the dealer is using in real time. The trellis builds as you go. Green = Fair, Red = Loaded.
Python def viterbi(obs, A, B, pi): """Find the most likely hidden state sequence.""" N = len(pi) T = len(obs) delta = [[0.0]*N for _ in range(T)] psi = [[0]*N for _ in range(T)] # Base case for j in range(N): delta[0][j] = pi[j] * B[j][obs[0]] # Recursion: max instead of sum for t in range(1, T): for j in range(N): best_i, best_v = 0, 0 for i in range(N): v = delta[t-1][i] * A[i][j] if v > best_v: best_v, best_i = v, i delta[t][j] = best_v * B[j][obs[t]] psi[t][j] = best_i # Backtrace path = [0] * T path[T-1] = max(range(N), key=lambda j: delta[T-1][j]) for t in range(T-2, -1, -1): path[t] = psi[t+1][path[t+1]] return path
What if you don't know the model parameters (A, B, π)? You only have observation sequences. The Baum-Welch algorithm learns the parameters from data using Expectation-Maximization (EM): iterate between estimating the hidden states (E-step) and updating the parameters (M-step).
Watch Baum-Welch learn the transition probability from data. Each iteration refines the estimate. The orange line is the learned value; the teal dashed line is the true value.
HMMs dominated sequence modeling from the 1970s through the 2000s. Then neural networks took over. Understanding this evolution reveals what each approach does well — and why Transformers ultimately won.
| Model | Era | Hidden State | Key Limitation |
|---|---|---|---|
| HMM | 1970s–2000s | Discrete, finite | Fixed number of states, Markov assumption |
| RNN/LSTM | 2010s | Continuous vector | Sequential processing, vanishing gradients |
| Transformer | 2017+ | Contextual embeddings | Quadratic attention cost |
From discrete hidden states to continuous representations. Each generation relaxes a constraint of the previous one.
HMMs don't exist in isolation. They're part of a rich family of models for reasoning about hidden states over time. Understanding these connections deepens your intuition for all of them.
How HMMs connect to other models. Each arrow relaxes a constraint or adds a capability.
You now understand the mathematics of hidden states. Every modern sequence model — from LSTMs to Transformers — carries the DNA of the HMM inside it.