The Complete Beginner's Path

Understand Hidden
Markov Models

The framework behind early speech recognition, gene finding, and the conceptual ancestor of every sequence model in deep learning — from RNNs to Transformers.

Prerequisites: Basic probability + Intuition for sequences. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Hidden States?

A casino dealer has two dice: a fair die (each face equally likely) and a loaded die (heavily biased toward rolling a 6). The dealer secretly switches between them as the night goes on. You're sitting at the table. You can see every roll, but you never see which die is being used.

This is the fundamental HMM setup: there is a hidden state (which die) that you cannot observe directly, and an observation (the roll) that depends on that hidden state. Your goal: figure out the hidden sequence from the observations alone.

The core idea: The world has states you cannot see. Those states produce signals you can see. An HMM gives you the mathematical machinery to reason backward from observations to hidden states — to see through the veil.
The Dishonest Casino

Watch the dealer switch between Fair and Loaded dice. You only see the rolls — can you guess which die is being used?

Check: In the casino analogy, what is the "hidden state"?

Chapter 1: The HMM Tuple — (S, O, A, B, π)

An HMM is fully specified by five objects. Think of it as a recipe: once you have these five ingredients, the entire model is defined and you can do all the inference you want.

S — States
The set of hidden states. Casino: {Fair, Loaded}
O — Observations
What you can see. Casino: {1, 2, 3, 4, 5, 6}
A — Transition Matrix
P(next state | current state). How likely is a die switch?
B — Emission Matrix
P(observation | state). Each die's roll probabilities.
π — Initial Distribution
P(starting state). Which die does the dealer pick first?
Think of it this way: S and O define the vocabulary. A describes how hidden states evolve over time. B describes how hidden states produce observations. π tells you where you start. Together: a complete generative story.
HMM Structure

The five components visualized as a graphical model. Hidden states on top, observations on bottom.

Check: Which component describes how likely the system is to switch from one hidden state to another?

Chapter 2: Transitions — A[i][j]

The transition matrix A captures the dynamics of the hidden states. A[i][j] is the probability of moving from state i to state j. Each row must sum to 1 (you have to go somewhere). In the casino: a high A[Fair][Fair] means the dealer tends to stick with the fair die.

A[i][j] = P(statet+1 = j | statet = i)      ∑j A[i][j] = 1

The Markov property is the key assumption: the next state depends only on the current state, not on the entire history. The dealer's decision to switch dice depends only on which die they're currently using — not on all previous switches.

Why "Markov"? Named after Andrey Markov, who studied chains of random events where the future depends only on the present. "Given the present, the future is independent of the past." This memoryless property makes HMMs tractable.
Interactive Transition Matrix

Adjust P(stay with Fair die) to see how the transition dynamics change. The complementary probabilities update automatically.

P(Fair → Fair)0.95
P(Loaded → Loaded)0.90
To FairTo Loaded
From FairA[F][F] = 0.95A[F][L] = 0.05
From LoadedA[L][F] = 0.10A[L][L] = 0.90
Check: What does the Markov property assume?

Chapter 3: Emissions — B[j][k]

The emission matrix B describes what you see given the hidden state. B[j][k] = P(observing k | in state j). For the fair die, B[Fair] is uniform: each face has probability 1/6. For the loaded die, B[Loaded] is heavily biased toward 6.

B[j][k] = P(observationt = k | statet = j)      ∑k B[j][k] = 1

The emission distribution is what makes the hidden state indirectly observable. If you see many 6s in a row, the loaded die is more likely — but you can never be 100% sure, because the fair die can also produce 6s.

Fair vs Loaded: Fair die: [1/6, 1/6, 1/6, 1/6, 1/6, 1/6]. Loaded die: [1/10, 1/10, 1/10, 1/10, 1/10, 1/2]. The loaded die rolls a 6 half the time — but it can still roll other numbers.
Emission Distributions

Adjust how biased the loaded die is toward rolling 6. Watch the probability bars shift.

Loaded P(6)0.50
Check: If you observe a 6, can you be certain the loaded die was used?

Chapter 4: The Three Problems

Given an HMM, there are three fundamental questions you can ask. Everything in HMM theory boils down to solving one of these three:

1. Evaluation (Forward Algorithm)
What is P(observations | model)? How well does this model explain the data?
2. Decoding (Viterbi Algorithm)
What is the most likely sequence of hidden states given the observations?
3. Learning (Baum-Welch / EM)
How do I learn A, B, π from data when the states are truly hidden?
ProblemQuestionAlgorithmComplexity
EvaluationP(O | λ)ForwardO(N²T)
Decodingargmax P(Q | O, λ)ViterbiO(N²T)
Learningargmaxλ P(O | λ)Baum-WelchO(N²T) per iter
N = number of states, T = length of observation sequence. Without dynamic programming, evaluation alone would be O(NT) — exponential in sequence length. The Forward and Viterbi algorithms make HMMs practical by reducing this to O(N²T).
The Three Problems Visualized

Click each problem to see what question it answers and how information flows.

Check: Which algorithm finds the most likely sequence of hidden states?

Chapter 5: The Forward Algorithm

The Forward algorithm answers: how likely is this observation sequence, given our model? It builds a trellis — a grid where each column is a timestep and each row is a state. Each cell holds the probability of being in that state at that time, having seen all observations so far.

αt(j) = P(o1, o2, ..., ot, statet = j | model)

The recursion is elegant: to compute αt(j), sum over all states i at time t−1, multiply by the transition A[i][j] and the emission B[j][ot]:

αt(j) = B[j][ot] · ∑i αt-1(i) · A[i][j]
Base case: α1(j) = π(j) · B[j][o1]. Start with the initial probability of each state times the probability of emitting the first observation. Then recurse forward through time.
Interactive Forward Trellis

Click "Step" to advance through the trellis computation one column at a time. Watch how probabilities flow from left to right.

Step 0 / 6
Python
def forward(obs, A, B, pi):
    """Compute forward probabilities alpha[t][j]."""
    N = len(pi)            # number of states
    T = len(obs)            # sequence length
    alpha = [[0.0]*N for _ in range(T)]

    # Base case
    for j in range(N):
        alpha[0][j] = pi[j] * B[j][obs[0]]

    # Recursion
    for t in range(1, T):
        for j in range(N):
            alpha[t][j] = B[j][obs[t]] * sum(
                alpha[t-1][i] * A[i][j] for i in range(N))

    return alpha, sum(alpha[T-1])
Check: What does αt(j) represent?

Chapter 6: Viterbi Decoding — The Showcase

The Viterbi algorithm answers the most dramatic question: what is the single most likely sequence of hidden states? It's identical in structure to the Forward algorithm, but replaces the sum with a max and keeps backpointers to trace back the winning path.

δt(j) = maxi [ δt-1(i) · A[i][j] ] · B[j][ot]

At each cell, instead of summing all paths (Forward), we keep only the best path into each state. After filling the trellis, we trace back from the final best state to recover the entire optimal sequence. This is dynamic programming at its finest.

Forward vs Viterbi: Forward sums over all paths (total probability). Viterbi takes the max over all paths (best single path). The trellis structure is identical — only the aggregation changes.
The Dishonest Casino — Live Viterbi Decoding

Roll dice and watch Viterbi decode which die the dealer is using in real time. The trellis builds as you go. Green = Fair, Red = Loaded.

Rolls: 0 | Accuracy: --
P(switch die)0.05
Loaded P(6)0.50
Python
def viterbi(obs, A, B, pi):
    """Find the most likely hidden state sequence."""
    N = len(pi)
    T = len(obs)
    delta = [[0.0]*N for _ in range(T)]
    psi   = [[0]*N for _ in range(T)]

    # Base case
    for j in range(N):
        delta[0][j] = pi[j] * B[j][obs[0]]

    # Recursion: max instead of sum
    for t in range(1, T):
        for j in range(N):
            best_i, best_v = 0, 0
            for i in range(N):
                v = delta[t-1][i] * A[i][j]
                if v > best_v:
                    best_v, best_i = v, i
            delta[t][j] = best_v * B[j][obs[t]]
            psi[t][j] = best_i

    # Backtrace
    path = [0] * T
    path[T-1] = max(range(N), key=lambda j: delta[T-1][j])
    for t in range(T-2, -1, -1):
        path[t] = psi[t+1][path[t+1]]
    return path
Check: How does Viterbi differ from the Forward algorithm?

Chapter 7: Baum-Welch (EM)

What if you don't know the model parameters (A, B, π)? You only have observation sequences. The Baum-Welch algorithm learns the parameters from data using Expectation-Maximization (EM): iterate between estimating the hidden states (E-step) and updating the parameters (M-step).

E-Step: Estimate Hidden States
Run Forward-Backward to compute P(statet = j | observations). Uses current A, B, π.
M-Step: Update Parameters
Re-estimate A, B, π using the soft state assignments from E-step. Each parameter is a weighted average.
↓ repeat until convergence
The Backward variable: βt(i) = P(ot+1, ..., oT | statet = i). Combined with the forward variable: γt(i) = αt(i) · βt(i) / P(O) gives the posterior probability of being in state i at time t.
γt(i) = P(statet = i | O, model) = αt(i) · βt(i) / P(O)
EM Convergence

Watch Baum-Welch learn the transition probability from data. Each iteration refines the estimate. The orange line is the learned value; the teal dashed line is the true value.

Iteration: 0 | A[F][F]: 0.50
Guaranteed to converge? Yes — EM always increases the likelihood (or keeps it the same). But it finds a local maximum, not necessarily the global one. Multiple random restarts help. In practice, Baum-Welch works remarkably well.
Check: What does the E-step of Baum-Welch compute?

Chapter 8: HMMs vs Modern Models

HMMs dominated sequence modeling from the 1970s through the 2000s. Then neural networks took over. Understanding this evolution reveals what each approach does well — and why Transformers ultimately won.

ModelEraHidden StateKey Limitation
HMM1970s–2000sDiscrete, finiteFixed number of states, Markov assumption
RNN/LSTM2010sContinuous vectorSequential processing, vanishing gradients
Transformer2017+Contextual embeddingsQuadratic attention cost
Speech recognition: HMMs powered every major speech system for 30 years (Dragon, Siri v1, Google Voice). Each phoneme was an HMM state. Words were sequences of phoneme-HMMs. This worked — until deep learning produced a 30% error reduction overnight.
POS tagging: In NLP, HMMs were the standard for part-of-speech tagging. States = tags (noun, verb, adjective...), observations = words. "The dog runs" → [DET, NOUN, VERB]. Trigram HMMs achieved ~96% accuracy — a bar that stood for years.
Evolution of Sequence Models

From discrete hidden states to continuous representations. Each generation relaxes a constraint of the previous one.

Check: What is the primary limitation of HMMs compared to RNNs?

Chapter 9: Connections

HMMs don't exist in isolation. They're part of a rich family of models for reasoning about hidden states over time. Understanding these connections deepens your intuition for all of them.

Bayes Filter
The general framework: predict → update with Bayes' rule. HMMs are a special case where states and observations are discrete.
↓ continuous states
Kalman Filter
A "continuous HMM" with linear-Gaussian assumptions. States are real-valued vectors. The math is the same predict-update cycle.
↓ actions + rewards
Reinforcement Learning (POMDPs)
Add actions and rewards to an HMM. The agent must act optimally despite hidden states. Belief-state planning.
↓ learned representations
Deep Sequence Models
Replace discrete states with neural hidden representations. The forward pass is a learned version of the HMM recursion.
The key insight: HMM, Kalman filter, particle filter, RNN — they all share the same skeleton: maintain a belief about hidden states, update it as new observations arrive. The difference is the representation of that belief and how updates are computed.
The Family Tree

How HMMs connect to other models. Each arrow relaxes a constraint or adds a capability.

Check: A Kalman filter can be seen as...
"The HMM is one of the most important statistical tools for modeling sequences. Master it, and you hold the key to speech, language, biology, and beyond."
— Lawrence Rabiner

You now understand the mathematics of hidden states. Every modern sequence model — from LSTMs to Transformers — carries the DNA of the HMM inside it.