Workbook — Probabilistic Reasoning

Probability & Bayes

Every calculation a probabilistic reasoning engineer needs to derive from scratch. Bayes rule, distributions, MLE/MAP, Kalman filters, HMMs, information theory — all solvable in-browser with instant feedback.

Prerequisites: Basic algebra (fractions, exponents) + Summation notation (∑). That's it.
10
Chapters
57
Exercises
5
Exercise Types
Mastery
0 / 57 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Bayes Rule from Scratch

A medical test for a rare disease comes back positive. The test is 95% accurate. Should you panic? Most people — including many doctors — get this wrong. The answer depends on something they forget to ask: how rare is the disease?

Bayes' theorem tells us how to update our belief about a hypothesis given new evidence:

Bayes' Theorem:
P(H | E) = P(E | H) × P(H) / P(E)

Where:
P(H | E) = posterior — probability of hypothesis after seeing evidence
P(E | H) = likelihood — probability of evidence given hypothesis is true
P(H) = prior — probability of hypothesis before evidence
P(E) = evidence — total probability of seeing this evidence

Expanding P(E) via the law of total probability:
P(E) = P(E | H) × P(H) + P(E | ¬H) × P(¬H)
The base rate trap. A 95% accurate test with 1% prevalence means most positive results are false positives. The prior P(H) = 0.01 is so small that even a good test can't overcome it in a single shot. This is Bayes' theorem in action — and why sequential testing matters.
Exercise 0.1: The Medical Test Derive

Disease prevalence: P(D) = 0.01. Test sensitivity: P(+|D) = 0.95. False positive rate: P(+|¬D) = 0.05. You test positive. What is P(D|+)?

Compute P(+) first, then apply Bayes' theorem. Express as a percentage rounded to one decimal.

% probability
Show derivation
P(+) = P(+|D) × P(D) + P(+|¬D) × P(¬D)
P(+) = 0.95 × 0.01 + 0.05 × 0.99 = 0.0095 + 0.0495 = 0.059
P(D|+) = P(+|D) × P(D) / P(+) = 0.0095 / 0.059 ≈ 0.161

Only 16.1%! Despite a 95% accurate test, the low prevalence means 84% of positive results are false positives. Out of every 1000 people: 10 have the disease (9.5 test positive), 990 don't (49.5 test positive). So 9.5 true positives out of 59 total positives = 16.1%.

Exercise 0.2: Sequential Update — Second Test Derive

After the first positive test (P(D|+) = 0.161), you take a second independent test with the same accuracy. It also comes back positive. What is P(D|++) now?

Use the posterior from the first test as the new prior. Apply Bayes again.

% probability
Show derivation
New prior: P(D) = 0.161, P(¬D) = 0.839
P(+) = 0.95 × 0.161 + 0.05 × 0.839 = 0.15295 + 0.04195 = 0.1949
P(D|++) = 0.15295 / 0.1949 ≈ 0.785

Two positive tests bring you to 78.5%. The posterior after each test becomes the prior for the next — this is sequential Bayesian updating. Each piece of evidence shifts the belief, and the updates are multiplicative in the odds ratio.

Exercise 0.3: Odds Form of Bayes Trace
Bayes' theorem can be written in odds form: posterior odds = prior odds × likelihood ratio. If the prior odds of disease are 1:99 and the likelihood ratio of a positive test is P(+|D)/P(+|¬D) = 0.95/0.05 = 19, what are the posterior odds after one positive test?
Show derivation
Prior odds = P(D)/P(¬D) = 0.01/0.99 = 1/99
Likelihood ratio = P(+|D)/P(+|¬D) = 0.95/0.05 = 19
Posterior odds = (1/99) × 19 = 19/99 ≈ 0.192
Convert to probability: 19/(19+99) = 19/118 ≈ 0.161 = 16.1%

The odds form is powerful because sequential updates become simple multiplication: just keep multiplying by the likelihood ratio for each new observation. After two positive tests: (1/99) × 19 × 19 = 361/99 ≈ 3.65, which is 3.65/(1+3.65) = 78.5%.

Exercise 0.4: Spam Filter Derive

A spam filter sees the word "free" in an email. P(spam) = 0.3. P("free"|spam) = 0.8. P("free"|not spam) = 0.1. What is P(spam|"free")?

probability (0 to 1)
Show derivation
P("free") = P("free"|spam) × P(spam) + P("free"|¬spam) × P(¬spam)
P("free") = 0.8 × 0.3 + 0.1 × 0.7 = 0.24 + 0.07 = 0.31
P(spam|"free") = 0.24 / 0.31 ≈ 0.774

The word "free" bumps spam probability from 30% to 77.4%. Naive Bayes spam filters apply this update for every word independently — each word's likelihood ratio multiplies the odds. With enough discriminative words, the posterior becomes very confident.

Exercise 0.5: Three Positive Tests Derive

Using the odds form: starting from prior odds 1:99, likelihood ratio 19 per positive test. After three consecutive positive tests, what is the probability of disease?

% probability
Show derivation
Posterior odds = (1/99) × 193 = 6859/99 ≈ 69.28
P(D|+++) = 69.28 / (1 + 69.28) = 69.28 / 70.28 ≈ 0.986

Three independent positive tests push probability to 98.6%. This shows the power of sequential updating: even with a low prior, repeated consistent evidence converges to near-certainty. The odds multiply by 19 each time: 0.010 → 0.192 → 3.65 → 69.3.

Exercise 0.6: Implement bayesUpdate() Build

Write a function that performs a single Bayesian update. Takes prior P(H), likelihood P(E|H), and false-positive P(E|¬H). Returns the posterior P(H|E).

Return a single number between 0 and 1.
Show solution
javascript
function bayesUpdate(prior, likelihood, falsePositive) {
  const pEvidence = likelihood * prior + falsePositive * (1 - prior);
  return (likelihood * prior) / pEvidence;
}

Chapter 1: Probability Distributions

Before you can reason about uncertainty, you need to speak the language of distributions. A probability distribution assigns a probability to every possible outcome. Three distributions cover most of what you'll encounter in ML: the Gaussian (continuous, bell-shaped), the Binomial (counting successes), and the Poisson (counting rare events).

Gaussian (Normal):
f(x) = (1 / √(2πσ²)) × exp(-(x - μ)² / (2σ²))
Mean = μ,   Variance = σ²

68-95-99.7 rule: 68% within ±1σ, 95% within ±2σ, 99.7% within ±3σ

Binomial: P(X = k) = C(n,k) × pk × (1-p)n-k
Mean = np,   Variance = np(1-p)

Poisson: P(X = k) = λk × e / k!
Mean = λ,   Variance = λ
The Gaussian is everywhere. By the central limit theorem, the sum of many independent random variables approaches a Gaussian — regardless of the original distribution. This is why measurement noise, estimation errors, and parameter initializations are almost always modeled as Gaussian. When you see a bell curve, the CLT is lurking.
Exercise 1.1: Gaussian Tail Probability Derive

For a standard normal distribution (μ=0, σ=1), what is P(X > 2)? Use the 68-95-99.7 rule.

95% of values fall within ±2σ. The remaining 5% is split equally between both tails.

% probability
Show derivation
P(|X| ≤ 2) ≈ 0.95
P(|X| > 2) = 1 - 0.95 = 0.05
P(X > 2) = 0.05 / 2 = 0.025 = 2.5%

The exact value is 2.28%, so the 68-95-99.7 rule gives a good approximation. This is used everywhere: 2σ events are "unlikely" (p ≈ 0.025 per tail), 3σ events are "rare" (p ≈ 0.0015).

Exercise 1.2: Binomial — Coin Flips Derive

A biased coin has P(heads) = 0.6. In 10 flips, what is P(exactly 7 heads)? Compute C(10,7) × 0.67 × 0.43.

probability (0 to 1)
Show derivation
C(10,7) = 10! / (7! × 3!) = (10 × 9 × 8) / (3 × 2 × 1) = 120
0.67 = 0.0279936
0.43 = 0.064
P(X=7) = 120 × 0.0279936 × 0.064 = 0.2150

The mean is np = 6, so getting 7 heads is only 1 above average — quite plausible at 21.5%. The binomial peaks near the mean and decays on both sides.

Exercise 1.3: Poisson — Server Crashes Derive

A server averages λ=3 crashes per day. What is P(exactly 5 crashes tomorrow)?

probability (0 to 1)
Show derivation
P(X=5) = λ5 × e / 5!
= 35 × e-3 / 120
= 243 × 0.04979 / 120
= 12.099 / 120 = 0.1008

About 10% chance. The Poisson distribution models the count of rare, independent events in a fixed interval. It's the go-to for traffic rates, mutation counts, and photon arrivals.

Exercise 1.4: Which Distribution? Trace
You're modeling the number of typos in a 500-word essay, where each word has a small independent probability of containing a typo. Which distribution fits best?
Show reasoning

Technically Binomial(500, p) works, but when n is large and p is small, the Poisson with λ = np is the standard approximation. The Poisson is the limit of the Binomial as n → ∞ and p → 0 with np = λ held constant. Both (B) and (C) are defensible, but Poisson is the canonical choice for "count of rare events in many trials."

Exercise 1.5: Binomial Mean and Variance Derive

For the biased coin with p=0.6 and n=10: compute the mean and standard deviation of the number of heads.

standard deviation
Show derivation
Mean = np = 10 × 0.6 = 6
Variance = np(1-p) = 10 × 0.6 × 0.4 = 2.4
Std dev = √2.4 ≈ 1.549

So we expect 6 ± 1.55 heads. Using the Gaussian approximation: ~68% of the time we'd get between 4.45 and 7.55 heads, i.e., 5-7 heads. The Gaussian approximation works well when np and n(1-p) are both ≥ 5.

Exercise 1.6: Implement gaussianPDF() Build

Write a function that computes the probability density of the Gaussian distribution at a point x, given mean μ and standard deviation σ.

Use Math.PI, Math.sqrt, Math.exp.
Show solution
javascript
function gaussianPDF(x, mu, sigma) {
  const coeff = 1 / Math.sqrt(2 * Math.PI * sigma * sigma);
  const exponent = -(x - mu) * (x - mu) / (2 * sigma * sigma);
  return coeff * Math.exp(exponent);
}

Chapter 2: Maximum Likelihood Estimation

You have data. You want to fit a model. Maximum Likelihood Estimation (MLE) asks: "What parameter values make the observed data most probable?" It's the foundation of nearly all statistical learning — from logistic regression to neural network training.

MLE Principle:
θ̂MLE = argmaxθ P(data | θ)

For independent data points x1, ..., xn:
L(θ) = ∏i P(xi | θ)
log L(θ) = ∑i log P(xi | θ)   // log-likelihood (easier to optimize)

Gaussian MLE:
μ̂ = (1/n) ∑ xi   // sample mean
σ̂² = (1/n) ∑ (xi - μ̂)²   // biased sample variance
MLE = the training objective you already know. Minimizing cross-entropy loss in a classifier is the same as maximizing the log-likelihood. Minimizing MSE for regression is MLE under a Gaussian noise assumption. When you train a neural network with these losses, you are doing MLE.
Exercise 2.1: Gaussian MLE — Mean Derive

Data: [2, 4, 6, 8, 10]. What is the MLE estimate for the Gaussian mean μ?

μ̂
Show derivation
μ̂ = (1/n) ∑ xi = (2 + 4 + 6 + 8 + 10) / 5 = 30 / 5 = 6

The MLE for the Gaussian mean is simply the sample mean. This follows from taking the derivative of the log-likelihood with respect to μ, setting it to zero, and solving.

Exercise 2.2: Gaussian MLE — Standard Deviation Derive

Same data: [2, 4, 6, 8, 10]. What is the MLE estimate for σ (standard deviation, not variance)? Remember: MLE uses 1/n, not 1/(n-1).

σ̂
Show derivation
σ̂² = (1/5) × [(2-6)² + (4-6)² + (6-6)² + (8-6)² + (10-6)²]
= (1/5) × [16 + 4 + 0 + 4 + 16] = 40/5 = 8
σ̂ = √8 = 2√2 ≈ 2.828

Note: the unbiased estimate uses 1/(n-1) = 1/4, giving σ² = 10 and σ = 3.16. MLE is slightly biased — it underestimates the true variance. For large n, the difference is negligible.

Exercise 2.3: Bernoulli MLE Derive

You flip a coin 10 times and get 7 heads. What is the MLE estimate for p (probability of heads)?

Show derivation
L(p) = p7 × (1-p)3
log L(p) = 7 log(p) + 3 log(1-p)
d/dp [log L] = 7/p - 3/(1-p) = 0
7(1-p) = 3p   ⇒   7 - 7p = 3p   ⇒   p = 7/10 = 0.7

The Bernoulli MLE is simply the fraction of successes: p̂ = k/n. Intuitively obvious, but the derivation proves it's optimal — no other estimate makes the observed data more probable.

Exercise 2.4: Poisson MLE Derive

Server crash counts over 5 days: [2, 5, 3, 1, 4]. What is the MLE for λ (the Poisson rate parameter)?

λ̂
Show derivation
log L(λ) = ∑ [xi log(λ) - λ - log(xi!)]
d/dλ [log L] = ∑ xi / λ - n = 0
λ̂ = (1/n) ∑ xi = (2+5+3+1+4)/5 = 15/5 = 3

The Poisson MLE is the sample mean — just like the Gaussian MLE for μ. This makes sense: λ IS the mean of the Poisson distribution, so the best estimate of the mean is... the sample mean.

Exercise 2.5: Implement mlEstimate() Build

Write a function that computes Gaussian MLE estimates for μ and σ from an array of data points. Return an object {mu, sigma}.

Return an object with mu and sigma fields.
Show solution
javascript
function mlEstimate(data) {
  const n = data.length;
  const mu = data.reduce((s, x) => s + x, 0) / n;
  const variance = data.reduce((s, x) => s + (x - mu) ** 2, 0) / n;
  return { mu, sigma: Math.sqrt(variance) };
}
Exercise 2.6: MLE vs Unbiased Trace
You have only n=2 data points: [3, 7]. The MLE variance is (1/2)∑(xi-5)² = 4. The unbiased estimate is (1/1)∑(xi-5)² = 8. The true variance is 8. Which statement is correct?
Show reasoning

MLE divides by n, unbiased by (n-1). The ratio is (n-1)/n. At n=2 the MLE is half the truth. At n=100 the MLE is 99% of the truth. At n=1000 it's 99.9%. In ML with millions of data points, the difference is utterly negligible — MLE is the standard choice.

Chapter 3: MAP vs MLE

MLE trusts the data completely. But what if you only have 3 coin flips and got 3 heads — does that mean p=1.0? MLE says yes. Your intuition says no. Maximum A Posteriori (MAP) estimation adds a prior belief that regularizes the answer, especially when data is scarce.

MAP Estimation:
θ̂MAP = argmaxθ P(θ | data) = argmaxθ P(data | θ) × P(θ)

For a Bernoulli with Beta(α, β) prior:
MAP estimate: p̂ = (k + α - 1) / (n + α + β - 2)
MLE estimate: p̂ = k / n

Key insight: MAP = MLE when the prior is uniform (flat).
MAP = MLE + regularization. In neural network terms: MLE is unregularized training. A Gaussian prior on weights N(0, σ²) gives a MAP objective equivalent to L2 regularization (weight decay). A Laplace prior gives L1 (sparsity). Every regularizer has a Bayesian interpretation as a prior.
Exercise 3.1: MAP with Beta Prior Derive

You observe 7 heads in 10 flips. Your prior is Beta(2,2) — a mild belief that the coin is roughly fair. What is the MAP estimate for p?

Use the formula: p̂MAP = (k + α - 1) / (n + α + β - 2)

MAP
Show derivation
MAP = (7 + 2 - 1) / (10 + 2 + 2 - 2) = 8 / 12 = 0.667
MLE = 7 / 10 = 0.700

The Beta(2,2) prior pulls the estimate from 0.70 toward 0.50 (the prior mean), landing at 0.667. The prior acts like 2 "phantom" observations of each outcome. Think of α-1 as prior heads and β-1 as prior tails.

Exercise 3.2: MAP with Strong Prior Derive

Same data (7/10 heads), but now your prior is Beta(20,20) — a strong belief the coin is fair. What is the MAP estimate?

MAP
Show derivation
MAP = (7 + 20 - 1) / (10 + 20 + 20 - 2) = 26 / 48 ≈ 0.542

The strong prior (equivalent to 38 phantom observations) overwhelms the 10 real observations, pulling the estimate close to 0.5. With Beta(20,20), you'd need ~40+ real observations to significantly move away from the prior belief of fairness.

Exercise 3.3: When Does MAP = MLE? Trace
Under which condition does the MAP estimate exactly equal the MLE estimate?
Show reasoning

MAP maximizes P(data|θ) × P(θ). If P(θ) is constant, the prior drops out and MAP reduces to maximizing P(data|θ) alone — which is exactly MLE. For the Beta prior: Beta(1,1) is uniform on [0,1], so MAP with Beta(1,1) gives p̂ = (k+0)/(n+0) = k/n = MLE.

Also: as n → ∞, the data overwhelms any finite prior, so MAP converges to MLE asymptotically. But they are only exactly equal when the prior is flat.

Exercise 3.4: L2 Regularization as MAP Derive

In neural network training with MSE loss and L2 weight decay λ=0.01, the total loss is L = (1/n)∑(yi - f(xi))² + 0.01 × ∑wj². If this corresponds to MAP with a Gaussian prior N(0, σ2prior) on weights, what is σprior?

Hint: The MAP penalty from a Gaussian prior is (1/(2σ²))∑wj². Set this equal to λ∑wj².

σprior
Show derivation
MAP penalty: (1 / 2σ²) × ∑wj² = λ × ∑wj²
1 / (2σ²) = λ = 0.01
σ² = 1 / (2 × 0.01) = 50
σ = √50 = 5√2 ≈ 7.07

Small λ (weak regularization) = large σ (wide prior = vague belief about weight magnitudes). Large λ (strong regularization) = small σ (tight prior = strong belief weights should be near zero). This is the Bayesian interpretation of every regularizer you've ever used.

Exercise 3.5: MAP Convergence Derive

With Beta(2,2) prior, how many heads out of n=100 flips would give a MAP estimate of exactly 0.70?

Solve: (k + 1) / (100 + 2) = 0.70

heads
Show derivation
(k + 2 - 1) / (100 + 2 + 2 - 2) = 0.70
(k + 1) / 102 = 0.70
k + 1 = 71.4
k = 70.4

With n=100 data points, you need about 70-71 heads to get a MAP of 0.70. Compare to MLE: you'd need exactly 70 heads (70/100). The prior barely shifts the answer — with enough data, MAP and MLE converge. The 2 phantom observations are negligible against 100 real ones.

Chapter 4: Kalman Filter Math

You're tracking a drone. Your physics model predicts where it should be, but it's approximate. Your GPS gives a noisy measurement. Neither alone is reliable. The Kalman filter optimally fuses prediction and measurement — and you need to be able to trace every step of the math by hand.

Predict step:
- = F · x̂ + B · u    // predicted state
P- = F · P · FT + Q    // predicted covariance

Update step:
K = P- · HT · (H · P- · HT + R)-1    // Kalman gain
x̂ = x̂- + K · (z - H · x̂-)    // updated state
P = (I - K · H) · P-    // updated covariance

Where:
F = state transition, B = control input, Q = process noise
H = measurement matrix, R = measurement noise, z = measurement
The predict-update rhythm. Predict: "Based on physics, I think the state is here, with this much uncertainty." Update: "I got a measurement — let me optimally blend it with my prediction." The Kalman gain K is the blending weight: K near 1 trusts the measurement, K near 0 trusts the prediction.
Exercise 4.1: 1D Predict Step Derive

1D constant-velocity tracker: state = [position, velocity]. F = [[1, dt], [0, 1]] with dt=1. Current state x̂ = [10, 2], P = [[4, 0], [0, 1]], Q = [[0.25, 0], [0, 0.25]], no control input. What is the predicted position x̂-[0]?

predicted position
Show derivation
- = F · x̂ = [[1,1],[0,1]] · [10, 2] = [10+2, 0+2] = [12, 2]

Position advances by velocity × dt: 10 + 2×1 = 12. Velocity stays at 2 (constant-velocity model). The predict step is just applying the physics model forward.

Exercise 4.2: Predicted Covariance Derive

Same tracker. Compute P-[0][0] — the predicted position variance. F = [[1,1],[0,1]], P = [[4,0],[0,1]], Q = [[0.25,0],[0,0.25]].

P- = FPFT + Q. Compute FP first, then (FP)FT, then add Q.

P-[0][0]
Show derivation
FP = [[1,1],[0,1]] · [[4,0],[0,1]] = [[4,1],[0,1]]
FPFT = [[4,1],[0,1]] · [[1,0],[1,1]] = [[5,1],[1,1]]
P- = FPFT + Q = [[5.25, 1],[1, 1.25]]

Position variance grew from 4 to 5.25. The extra 1 came from velocity uncertainty propagating into position uncertainty (the off-diagonal cross term), and 0.25 from process noise. Uncertainty always grows in the predict step — that's the whole reason we need measurements.

Exercise 4.3: Scalar Kalman Gain Derive

Simplified 1D scalar case: predicted variance P- = 5.25, measurement noise R = 2, H = 1 (we directly measure position). What is the Kalman gain K?

K
Show derivation
K = P- · HT / (H · P- · HT + R)
K = 5.25 × 1 / (1 × 5.25 × 1 + 2) = 5.25 / 7.25 ≈ 0.724

K = 0.724: we trust the measurement about 72% and our prediction about 28%. This makes sense — our prediction variance (5.25) is larger than our measurement noise (2), so we lean more toward the measurement.

Exercise 4.4: R → 0 vs R → ∞ Trace
In the scalar case K = P- / (P- + R). What happens as the measurement noise R approaches zero?
Show reasoning
K = P- / (P- + R)
As R → 0: K → P- / P- = 1
As R → ∞: K → P- / ∞ = 0

K is literally a trust slider. R = 0 means perfect measurements, so K = 1 (just use the measurement). R = ∞ means useless measurements, so K = 0 (ignore them, trust the model). This is the single most important intuition for the Kalman filter.

Exercise 4.5: Updated State Derive

Predicted position x̂- = 12, measurement z = 11.5, Kalman gain K = 0.724. What is the updated position estimate x̂?

updated position
Show derivation
x̂ = x̂- + K × (z - x̂-)
x̂ = 12 + 0.724 × (11.5 - 12)
x̂ = 12 + 0.724 × (-0.5) = 12 - 0.362 = 11.638

The innovation (z - x̂-) = -0.5 says the measurement is below the prediction. K = 0.724 says we trust the measurement more, so we move 72.4% of the way toward the measurement: 12 → 11.638 (closer to 11.5 than to 12).

Exercise 4.6: Arrange the Kalman Cycle Design

Put these Kalman filter operations in the correct order for one predict-update cycle.

?
?
?
?
?
Predict state: x̂- = Fx̂ Predict covariance: P- = FPFT+Q Compute Kalman gain K Update state: x̂ = x̂- + K(z-Hx̂-) Update covariance: P = (I-KH)P-
Show explanation

The correct order is: Predict statePredict covarianceKalman gainUpdate stateUpdate covariance.

Predict must happen before update (we need x̂- and P- before computing K). The Kalman gain must come before state/covariance update (K is needed for both). State and covariance updates can happen in either order since they're independent, but conventionally state comes first.

Chapter 5: Kalman Gain Derivations

The Kalman gain K is the heart of the filter — it determines how much weight to give measurement versus prediction. In this chapter, you'll compute K for specific numbers, trace what happens to uncertainty after the update, and implement the full 1D Kalman update.

Scalar (1D) Kalman gain:
K = P- / (P- + R)    // when H = 1

After update:
P = (1 - K) × P-    // always smaller than P-

Equivalently:
1/P = 1/P- + 1/R    // information form: uncertainties add in inverse!
Uncertainty always shrinks in the update. P = (1-K) × P- and 0 < K < 1, so P < P- always. Every measurement reduces uncertainty. This is the fundamental guarantee of the Kalman filter: fusion can never be worse than the best individual source.
Exercise 5.1: Compute K Derive

1D Kalman filter, H=1. Predicted variance P- = 4, measurement noise R = 1. What is K?

K
Show derivation
K = P- / (P- + R) = 4 / (4 + 1) = 4/5 = 0.8

K = 0.8 means we put 80% weight on the measurement and 20% on the prediction. Our prediction is 4x noisier than our measurement (P-/R = 4), so we lean heavily toward the measurement.

Exercise 5.2: Updated Variance Derive

With P- = 4, K = 0.8: what is the posterior variance P after the update?

P
Show derivation
P = (1 - K) × P- = (1 - 0.8) × 4 = 0.2 × 4 = 0.8

Variance dropped from 4 to 0.8 — an 80% reduction! Verify with information form: 1/P = 1/P- + 1/R = 1/4 + 1/1 = 0.25 + 1 = 1.25, so P = 1/1.25 = 0.8. Fusing two sources always produces less uncertainty than either alone (0.8 < min(4, 1)).

Exercise 5.3: Information Form Derive

A GPS has variance 25 (PGPS = 25). A LIDAR has variance 4 (PLIDAR = 4). Using the information form 1/P = 1/P1 + 1/P2, what is the fused variance?

fused variance
Show derivation
1/P = 1/25 + 1/4 = 0.04 + 0.25 = 0.29
P = 1/0.29 ≈ 3.448

The fused variance (3.45) is less than the better sensor alone (4). The GPS barely helps here (it's so noisy), but it still shaves off 0.55 from the LIDAR's variance. In information form, every additional sensor adds information (1/P), so fusion always helps.

Exercise 5.4: K Bounds Trace
For K = P- / (P- + R) with P- > 0 and R > 0, what range can K take?
Show reasoning

With P- > 0 and R > 0: the numerator P- is strictly positive, the denominator P- + R is strictly larger than P-, so K = P-/(P-+R) is in (0, 1). It approaches 0 as R → ∞ and approaches 1 as R → 0, but never exactly reaches either (since both P- and R are strictly positive, not zero or infinity).

Exercise 5.5: Implement kalmanUpdate1D() Build

Implement the Kalman filter update step for the 1D scalar case. Given predicted state, predicted variance, measurement, and measurement noise, return the updated state and variance.

Return an object with x and p fields.
Show solution
javascript
function kalmanUpdate1D(xPred, pPred, z, R) {
  const K = pPred / (pPred + R);    // Kalman gain
  const x = xPred + K * (z - xPred); // updated state
  const p = (1 - K) * pPred;         // updated variance
  return { x, p };
}
Exercise 5.6: Find the Bug Debug

This Kalman update gives correct state estimates but the variance diverges over time (keeps growing). Click the buggy line.

function kalmanUpdate1D(xPred, pPred, z, R) {
  const K = pPred / (pPred + R);
  const x = xPred + K * (z - xPred);
  const p = (1 + K) * pPred;
  return { x, p };
}
Show explanation

Line 4 is the bug. It says (1 + K) instead of (1 - K). With (1+K), the variance increases at every update instead of shrinking. Since 0 < K < 1, multiplying by (1+K) gives a value between 1x and 2x of P- — guaranteed divergence. The correct formula P = (1-K) × P- always shrinks variance because (1-K) ∈ (0,1).

Chapter 6: HMM Forward Algorithm

A Hidden Markov Model (HMM) has states you can't see (hidden) that produce observations you can see. A dishonest casino switches between a fair die and a loaded die — you see the rolls but not which die is in play. The forward algorithm computes P(observations) by summing over all possible hidden state sequences.

HMM Parameters:
A = transition matrix: A[i][j] = P(state j at t+1 | state i at t)
B = emission matrix: B[i][k] = P(observation k | state i)
π = initial distribution: π[i] = P(state i at t=0)

Forward variable:
αt(i) = P(o1, ..., ot, qt = i)

Initialization: α1(i) = π[i] × B[i][o1]
Recursion: αt+1(j) = [∑i αt(i) × A[i][j]] × B[j][ot+1]
Termination: P(O) = ∑i αT(i)
Why not enumerate all paths? With N states and T timesteps, there are NT possible state sequences. For N=3, T=100: that's 3100 ≈ 5 × 1047 paths. The forward algorithm computes the same result in O(N²T) using dynamic programming — summing at each step instead of enumerating paths.
Exercise 6.1: Forward Initialization Derive

Dishonest casino HMM: 2 states (Fair, Loaded), 2 observations (Low=0, High=1). π = [0.6, 0.4]. B = [[0.8, 0.2], [0.3, 0.7]] (Fair die: 80% Low; Loaded: 70% High). First observation: High (o1=1). Compute α1(Fair) and α1(Loaded).

α1(Loaded)
Show derivation
α1(Fair) = π[Fair] × B[Fair][High] = 0.6 × 0.2 = 0.12
α1(Loaded) = π[Loaded] × B[Loaded][High] = 0.4 × 0.7 = 0.28

Seeing a "High" observation makes the Loaded state more likely (0.28 vs 0.12), even though the prior favored Fair (0.6 vs 0.4). The emission probability flipped the balance because Loaded strongly favors High rolls.

Exercise 6.2: Forward Recursion Derive

Transition matrix A = [[0.7, 0.3], [0.4, 0.6]] (Fair stays Fair 70%, Loaded stays Loaded 60%). From α1 = [0.12, 0.28], the second observation is also High (o2=1). Compute α2(Loaded).

α2(j) = [∑i α1(i) × A[i][j]] × B[j][o2]

α2(Loaded)
Show derivation
i α1(i) × A[i][Loaded] = 0.12 × 0.3 + 0.28 × 0.6 = 0.036 + 0.168 = 0.204
α2(Loaded) = 0.204 × B[Loaded][High] = 0.204 × 0.7 = 0.1428

For completeness: α2(Fair) = (0.12 × 0.7 + 0.28 × 0.4) × 0.2 = (0.084 + 0.112) × 0.2 = 0.196 × 0.2 = 0.0392. Two consecutive High rolls: α(Loaded) = 0.1428 dominates α(Fair) = 0.0392.

Exercise 6.3: Total Probability Derive

Using α2 = [0.0392, 0.1428] from the exercise above, what is P(o1=High, o2=High) — the total probability of the observation sequence?

P(observations)
Show derivation
P(O) = ∑i α2(i) = 0.0392 + 0.1428 = 0.182

There is an 18.2% probability of seeing two consecutive High observations under this HMM. This value is used in model training (Baum-Welch) and model comparison (which HMM best explains the data?).

Exercise 6.4: Forward Algorithm Complexity Trace
An HMM has N=5 hidden states and you observe T=200 timesteps. How many multiplications does the forward algorithm need (approximately)?
Show reasoning

At each timestep t, for each of the N states j, we sum over all N previous states i: ∑i αt(i) × A[i][j]. That's N multiplications for each of N states = N² per timestep. Over T timesteps: O(N²T) = 25 × 200 = 5,000. Compare to brute-force enumeration of 5200 paths — a number with 140 digits.

Exercise 6.5: Implement hmmForward() Build

Implement the forward algorithm. Given initial distribution π, transition matrix A, emission matrix B, and observation sequence, return the total probability P(observations).

Initialize α from π and B, then recurse through observations.
Show solution
javascript
function hmmForward(pi, A, B, obs) {
  const N = pi.length;
  // Initialize
  let alpha = pi.map((p, i) => p * B[i][obs[0]]);

  // Recurse
  for (let t = 1; t < obs.length; t++) {
    const newAlpha = [];
    for (let j = 0; j < N; j++) {
      let sum = 0;
      for (let i = 0; i < N; i++) sum += alpha[i] * A[i][j];
      newAlpha.push(sum * B[j][obs[t]]);
    }
    alpha = newAlpha;
  }

  // Terminate
  return alpha.reduce((s, a) => s + a, 0);
}

Chapter 7: Viterbi Algorithm

The forward algorithm gives P(observations). But often you want to know: which state sequence is most likely? That's the Viterbi algorithm. Instead of summing over previous states (forward), Viterbi takes the max — it finds the single best path through the state space.

Viterbi variable:
δt(j) = max over all paths ending at state j at time t

Initialization: δ1(i) = π[i] × B[i][o1]
Recursion: δt+1(j) = maxit(i) × A[i][j]] × B[j][ot+1]
Track backpointers: ψt+1(j) = argmaxit(i) × A[i][j]]
Termination: best final state = argmaxi δT(i), then backtrace
Sum vs Max: Forward vs Viterbi. Forward: αt+1(j) = [i αt(i) × A[i][j]] × B. Viterbi: δt+1(j) = [maxi δt(i) × A[i][j]] × B. That's the only difference! Replace sum with max, and add backpointers to reconstruct the path. Same complexity: O(N²T).
Exercise 7.1: Viterbi Initialization Derive

Same casino HMM: π = [0.6, 0.4], B = [[0.8, 0.2], [0.3, 0.7]]. First observation: High. Compute δ1 for both states. Which state is the most likely starting state?

δ1(Loaded)
Show derivation
δ1(Fair) = π[Fair] × B[Fair][High] = 0.6 × 0.2 = 0.12
δ1(Loaded) = π[Loaded] × B[Loaded][High] = 0.4 × 0.7 = 0.28

Viterbi initialization is identical to forward initialization. The divergence happens at the recursion step (max vs sum). Loaded is the most likely starting state given a High observation.

Exercise 7.2: Viterbi Recursion Derive

From δ1 = [0.12, 0.28], A = [[0.7, 0.3], [0.4, 0.6]], second observation: High. Compute δ2(Loaded) and the backpointer ψ2(Loaded).

δ2(Loaded) = max(0.12×0.3, 0.28×0.6) × B[Loaded][High]

δ2(Loaded)
Show derivation
From Fair: δ1(Fair) × A[Fair][Loaded] = 0.12 × 0.3 = 0.036
From Loaded: δ1(Loaded) × A[Loaded][Loaded] = 0.28 × 0.6 = 0.168
max(0.036, 0.168) = 0.168   ⇒   ψ2(Loaded) = Loaded
δ2(Loaded) = 0.168 × B[Loaded][High] = 0.168 × 0.7 = 0.1176

The best path to Loaded at t=2 came from Loaded at t=1 (backpointer = Loaded). So the Viterbi path so far is [Loaded, Loaded]. Two High observations in a row — the loaded die is the most plausible explanation.

Exercise 7.3: Complete Viterbi Path Trace
For the observation sequence [High, High, Low] in our casino HMM, δ2 = [0.0392, 0.1176]. At t=3 with observation Low: δ3(Fair) = max(0.0392×0.7, 0.1176×0.4) × 0.8 = max(0.02744, 0.04704) × 0.8 = 0.0376. δ3(Loaded) = max(0.0392×0.3, 0.1176×0.6) × 0.3 = 0.02117. Which is the most likely full path?
Show reasoning

Best final state: argmax(δ3) = Fair (0.0376 > 0.0212). Backtrace: ψ3(Fair) = Loaded (0.04704 > 0.02744). ψ2(Loaded) = Loaded. So the Viterbi path is [Loaded, Loaded, Fair]. The casino used the loaded die for the two High rolls, then switched to the fair die for the Low roll. Intuitive!

Exercise 7.4: Find the Bug Debug

This Viterbi implementation gives P(observations) instead of the best-path probability. Click the buggy line.

function viterbi(pi, A, B, obs) {
  const N = pi.length;
  let delta = pi.map((p, i) => p * B[i][obs[0]]);
  for (let t = 1; t < obs.length; t++) {
    const newDelta = [];
    for (let j = 0; j < N; j++) {
      let val = 0;
      for (let i = 0; i < N; i++) val += delta[i] * A[i][j];
      newDelta.push(val * B[j][obs[t]]);
    }
    delta = newDelta;
  }
  return Math.max(...delta);
}
Show explanation

Line 8 is the bug. It uses val += delta[i] * A[i][j] (summing — that's the forward algorithm!). It should use val = Math.max(val, delta[i] * A[i][j]) to find the maximum. The crucial difference between Forward and Viterbi is sum vs max at the recursion step.

Exercise 7.5: Log-Domain Viterbi Trace
With a long observation sequence (T=1000), the δ values become astronomically small (products of hundreds of probabilities < 1). What is the standard fix?
Show reasoning

Log-domain Viterbi: log δt+1(j) = maxi[log δt(i) + log A[i][j]] + log B[j][ot+1]. Products become sums, max of products becomes max of sums. Values stay in a numerically stable range (e.g., -500 to 0 instead of 10-217 to 100). This is why log-probabilities are the default in all production HMM implementations.

Chapter 8: Information Theory

How surprised should you be by an outcome? How much information does a measurement carry? Information theory quantifies uncertainty with entropy, measures the distance between distributions with KL divergence, and connects directly to the cross-entropy loss you use in every classifier.

Entropy (uncertainty of a distribution):
H(X) = -∑i pi log2 pi   // bits

KL Divergence (distance from q to p):
KL(p ‖ q) = ∑i pi log(pi / qi)

Cross-Entropy:
CE(p, q) = -∑i pi log qi = H(p) + KL(p ‖ q)

Key properties:
H(X) ≥ 0    KL(p ‖ q) ≥ 0    KL(p ‖ q) ≠ KL(q ‖ p) in general
Cross-entropy loss IS information theory. When you train a classifier by minimizing -∑ yi log p̂i, you are minimizing the cross-entropy between the true labels p and your model's predictions q. Minimizing CE(p,q) = H(p) + KL(p‖q), and since H(p) is fixed, you're really minimizing KL(p‖q) — the gap between your model and reality.
Exercise 8.1: Fair Coin Entropy Derive

Compute the entropy (in bits) of a fair coin: P(H) = P(T) = 0.5.

bits
Show derivation
H = -(0.5 × log2(0.5) + 0.5 × log2(0.5))
= -(0.5 × (-1) + 0.5 × (-1))
= -(-1) = 1 bit

A fair coin has maximum entropy for a binary variable: 1 bit. This means you need exactly 1 yes/no question to determine the outcome. It's the most uncertain a 2-outcome distribution can be.

Exercise 8.2: Biased Coin Entropy Derive

Compute the entropy (in bits) of a biased coin: P(H) = 0.9, P(T) = 0.1.

bits
Show derivation
H = -(0.9 × log2(0.9) + 0.1 × log2(0.1))
= -(0.9 × (-0.1520) + 0.1 × (-3.3219))
= -(- 0.1368 - 0.3322) = 0.469 bits

Less than half a bit — much less uncertain than a fair coin. When outcomes are predictable (p near 0 or 1), entropy drops toward 0. Maximum entropy (1 bit) occurs at p = 0.5.

Exercise 8.3: KL Divergence Derive

Compute KL(p ‖ q) where p = [0.9, 0.1] and q = [0.5, 0.5]. Use natural log (ln).

nats
Show derivation
KL(p ‖ q) = ∑ pi ln(pi / qi)
= 0.9 × ln(0.9/0.5) + 0.1 × ln(0.1/0.5)
= 0.9 × ln(1.8) + 0.1 × ln(0.2)
ln(1.8) = 0.5878,   ln(0.2) = -1.6094
= 0.9 × 0.5878 + 0.1 × (-1.6094) = 0.5290 - 0.1609 = 0.368 nats

KL divergence measures how much information is lost when using q to approximate p. It's always ≥ 0 and equals 0 only when p = q.

Exercise 8.4: KL Asymmetry Derive

Now compute KL(q ‖ p) with the same p = [0.9, 0.1] and q = [0.5, 0.5]. Is it equal to KL(p ‖ q)?

nats
Show derivation
KL(q ‖ p) = ∑ qi ln(qi / pi)
= 0.5 × ln(0.5/0.9) + 0.5 × ln(0.5/0.1)
= 0.5 × ln(0.5556) + 0.5 × ln(5)
= 0.5 × (-0.5878) + 0.5 × 1.6094
= -0.2939 + 0.8047 = 0.5108 nats

KL(p‖q) = 0.368 but KL(q‖p) = 0.511 — they're different! KL divergence is not symmetric. This is why it's called a "divergence" not a "distance." The direction matters: KL(p‖q) penalizes places where p is high but q is low.

Exercise 8.5: Cross-Entropy = H + KL Derive

Compute the cross-entropy CE(p, q) with p = [0.9, 0.1], q = [0.5, 0.5] using natural log. Then verify: CE(p,q) = H(p) + KL(p‖q).

nats
Show derivation
CE(p, q) = -∑ pi ln(qi) = -(0.9 × ln(0.5) + 0.1 × ln(0.5))
= -(0.9 + 0.1) × ln(0.5) = -ln(0.5) = ln(2) ≈ 0.693 nats
Verify: H(p) = -(0.9 ln 0.9 + 0.1 ln 0.1) = 0.325 nats
H(p) + KL(p‖q) = 0.325 + 0.368 = 0.693 ✓

Since q = [0.5, 0.5], the cross-entropy simplifies to ln(2) regardless of p. When your model predicts uniform (maximum ignorance), the cross-entropy is always ln(number of classes). Training reduces this toward H(p), the irreducible entropy of the true distribution.

Exercise 8.6: Implement entropy() and klDivergence() Build

Write two functions: entropy(p) computes H(X) in nats (use ln), and klDivergence(p, q) computes KL(p ‖ q) in nats. Handle pi = 0 by treating 0 × ln(0) = 0.

Use Math.log for natural log. Guard against log(0).
Show solution
javascript
function entropy(p) {
  return -p.reduce((s, pi) =>
    s + (pi > 0 ? pi * Math.log(pi) : 0), 0);
}

function klDivergence(p, q) {
  return p.reduce((s, pi, i) =>
    s + (pi > 0 ? pi * Math.log(pi / q[i]) : 0), 0);
}

Chapter 9: Capstone: Sensor Fusion

You're building a positioning system with two sensors: a GPS (noisy but unbiased, σGPS = 5m) and an IMU (low per-step noise σIMU = 0.1m, but drifts over time). Design a 1D Kalman filter to fuse them, and trace five complete predict-update cycles.

Model:
State: position x (1D scalar)
Predict: x̂- = x̂ + u   // u = IMU displacement
P- = P + Q    // Q = IMU noise variance = 0.01 per step
Update: z = GPS reading, R = 25 // (5m)²
K = P- / (P- + R)

Initial conditions:
0 = 0,   P0 = 100 // very uncertain initially
The fundamental sensor fusion trade-off. The GPS is coarse (R=25) but doesn't drift. The IMU is precise per step (Q=0.01) but accumulates error. The Kalman filter combines them: it uses the IMU between GPS readings for smooth tracking, and snaps to GPS periodically to prevent drift. This is exactly how your phone determines its position.
Exercise 9.1: First Update Derive

Starting state: x̂=0, P=100. IMU says you moved u=1.0m. GPS reads z=1.2m. Compute the Kalman gain K and the updated state x̂ after one predict-update cycle.

Predict: x̂- = 0 + 1.0 = 1.0, P- = 100 + 0.01 = 100.01. Update with R=25.

updated position x̂
Show derivation
P- = 100 + 0.01 = 100.01
K = 100.01 / (100.01 + 25) = 100.01 / 125.01 ≈ 0.800
x̂ = 1.0 + 0.800 × (1.2 - 1.0) = 1.0 + 0.160 = 1.160
P = (1 - 0.800) × 100.01 = 0.200 × 100.01 = 20.00

The first GPS reading is powerful: K = 0.8 because our initial uncertainty (100) dwarfs the GPS noise (25). After the first update, P drops from 100 to 20 — an 80% reduction in uncertainty from a single measurement.

Exercise 9.2: Second Cycle Derive

From x̂=1.160, P=20.00. IMU: u=0.8m. GPS: z=2.1m. Compute updated P after the second cycle.

P (variance after update)
Show derivation
Predict: x̂- = 1.160 + 0.8 = 1.960, P- = 20.00 + 0.01 = 20.01
K = 20.01 / (20.01 + 25) = 20.01 / 45.01 ≈ 0.4446
x̂ = 1.960 + 0.4446 × (2.1 - 1.960) = 1.960 + 0.0622 = 2.022
P = (1 - 0.4446) × 20.01 = 0.5554 × 20.01 ≈ 11.12

K dropped from 0.80 to 0.44 — we're now trusting the prediction more relative to GPS. Variance continued to drop: 100 → 20 → 11.1. Each GPS fix reduces uncertainty further.

Exercise 9.3: Steady-State Kalman Gain Derive

At steady state, P converges to a constant Pss. The equation becomes: Pss = (1-Kss) × (Pss + Q) where Kss = (Pss + Q) / (Pss + Q + R). With Q=0.01, R=25, solve for Pss.

This simplifies to the quadratic: Pss² + Q × Pss - Q × R = 0. Substitute values and solve.

Pss
Show derivation
P² + 0.01P - 0.01 × 25 = 0
P² + 0.01P - 0.25 = 0
P = (-0.01 + √(0.0001 + 1)) / 2 = (-0.01 + √1.0001) / 2
≈ (-0.01 + 1.00005) / 2 = 0.99005 / 2 = 0.495

Steady-state variance = 0.495, much less than either sensor alone (GPS: 25, IMU accumulated drift). The steady-state Kalman gain is Kss = (0.495 + 0.01) / (0.495 + 0.01 + 25) ≈ 0.505/25.505 ≈ 0.0198. At steady state, each GPS update only nudges by ~2% — the filter is highly confident.

Exercise 9.4: What If GPS Disappears? Trace
After reaching steady state (P ≈ 0.5), GPS signal is lost. You run on IMU only for 100 steps (Q = 0.01 per step, no updates). What happens to P?
Show reasoning

Without GPS updates, each predict step adds Q to the variance: Pt = Pt-1 + Q. After 100 steps: P = 0.5 + 100 × 0.01 = 1.5. Uncertainty grows linearly with time — this is IMU drift. After 10,000 steps it would be 100.5. Without occasional GPS corrections, the estimate degrades. This is why real navigation systems have "GPS denied" alerts: the filter knows its own uncertainty is growing.

Exercise 9.5: Arrange the Sensor Fusion Pipeline Design

Put these real-world sensor fusion steps in order for a phone navigation system.

?
?
?
?
?
Initialize state & covariance IMU predict (100 Hz) GPS update (1 Hz) Compute Kalman gain Output fused position
Show explanation

The correct order is: InitializeIMU predictCompute Kalman gainGPS updateOutput fused position.

In practice, the IMU runs at ~100 Hz (predict 100x per second), while GPS arrives at ~1 Hz. Between GPS updates, you run 100 predict steps. When a GPS measurement arrives, you compute K and update. The output is the fused estimate after each step (predict-only between GPS, full update when GPS is available).

Exercise 9.6: Full 5-Step Fusion Build

Implement a function that runs a 1D Kalman filter for n steps. Takes initial state/variance, process noise Q, measurement noise R, and arrays of IMU displacements and GPS readings. Returns the final state and variance.

Loop through steps, predict then update each time.
Show solution
javascript
function runKalman1D(x0, P0, Q, R, imuDisplacements, gpsReadings) {
  let x = x0, p = P0;
  for (let i = 0; i < imuDisplacements.length; i++) {
    // Predict
    x = x + imuDisplacements[i];
    p = p + Q;
    // Update
    const K = p / (p + R);
    x = x + K * (gpsReadings[i] - x);
    p = (1 - K) * p;
  }
  return { x, p };
}
The proof of work. If you completed every exercise in this workbook — derived Bayes from scratch, computed MLEs, traced Kalman updates by hand, implemented HMM algorithms, and designed a sensor fusion pipeline — you understand the probabilistic foundations that underpin modern ML. Every loss function, every regularizer, every attention mechanism has Bayesian roots. "Probability theory is nothing but common sense reduced to calculation." — Laplace

Related Lessons

TopicLesson
Bayes filter fundamentalsBayes Filter — From Absolute Zero
Kalman filter deep diveKalman Filter — From Absolute Zero
Hidden Markov modelsHMM — From Absolute Zero
Bayesian estimationBayesian Estimation — From Absolute Zero
Extended Kalman filterEKF — From Absolute Zero