Every calculation a probabilistic reasoning engineer needs to derive from scratch. Bayes rule, distributions, MLE/MAP, Kalman filters, HMMs, information theory — all solvable in-browser with instant feedback.
A medical test for a rare disease comes back positive. The test is 95% accurate. Should you panic? Most people — including many doctors — get this wrong. The answer depends on something they forget to ask: how rare is the disease?
Bayes' theorem tells us how to update our belief about a hypothesis given new evidence:
Disease prevalence: P(D) = 0.01. Test sensitivity: P(+|D) = 0.95. False positive rate: P(+|¬D) = 0.05. You test positive. What is P(D|+)?
Compute P(+) first, then apply Bayes' theorem. Express as a percentage rounded to one decimal.
Only 16.1%! Despite a 95% accurate test, the low prevalence means 84% of positive results are false positives. Out of every 1000 people: 10 have the disease (9.5 test positive), 990 don't (49.5 test positive). So 9.5 true positives out of 59 total positives = 16.1%.
After the first positive test (P(D|+) = 0.161), you take a second independent test with the same accuracy. It also comes back positive. What is P(D|++) now?
Use the posterior from the first test as the new prior. Apply Bayes again.
Two positive tests bring you to 78.5%. The posterior after each test becomes the prior for the next — this is sequential Bayesian updating. Each piece of evidence shifts the belief, and the updates are multiplicative in the odds ratio.
The odds form is powerful because sequential updates become simple multiplication: just keep multiplying by the likelihood ratio for each new observation. After two positive tests: (1/99) × 19 × 19 = 361/99 ≈ 3.65, which is 3.65/(1+3.65) = 78.5%.
A spam filter sees the word "free" in an email. P(spam) = 0.3. P("free"|spam) = 0.8. P("free"|not spam) = 0.1. What is P(spam|"free")?
The word "free" bumps spam probability from 30% to 77.4%. Naive Bayes spam filters apply this update for every word independently — each word's likelihood ratio multiplies the odds. With enough discriminative words, the posterior becomes very confident.
Using the odds form: starting from prior odds 1:99, likelihood ratio 19 per positive test. After three consecutive positive tests, what is the probability of disease?
Three independent positive tests push probability to 98.6%. This shows the power of sequential updating: even with a low prior, repeated consistent evidence converges to near-certainty. The odds multiply by 19 each time: 0.010 → 0.192 → 3.65 → 69.3.
Write a function that performs a single Bayesian update. Takes prior P(H), likelihood P(E|H), and false-positive P(E|¬H). Returns the posterior P(H|E).
javascript function bayesUpdate(prior, likelihood, falsePositive) { const pEvidence = likelihood * prior + falsePositive * (1 - prior); return (likelihood * prior) / pEvidence; }
Before you can reason about uncertainty, you need to speak the language of distributions. A probability distribution assigns a probability to every possible outcome. Three distributions cover most of what you'll encounter in ML: the Gaussian (continuous, bell-shaped), the Binomial (counting successes), and the Poisson (counting rare events).
For a standard normal distribution (μ=0, σ=1), what is P(X > 2)? Use the 68-95-99.7 rule.
95% of values fall within ±2σ. The remaining 5% is split equally between both tails.
The exact value is 2.28%, so the 68-95-99.7 rule gives a good approximation. This is used everywhere: 2σ events are "unlikely" (p ≈ 0.025 per tail), 3σ events are "rare" (p ≈ 0.0015).
A biased coin has P(heads) = 0.6. In 10 flips, what is P(exactly 7 heads)? Compute C(10,7) × 0.67 × 0.43.
The mean is np = 6, so getting 7 heads is only 1 above average — quite plausible at 21.5%. The binomial peaks near the mean and decays on both sides.
A server averages λ=3 crashes per day. What is P(exactly 5 crashes tomorrow)?
About 10% chance. The Poisson distribution models the count of rare, independent events in a fixed interval. It's the go-to for traffic rates, mutation counts, and photon arrivals.
Technically Binomial(500, p) works, but when n is large and p is small, the Poisson with λ = np is the standard approximation. The Poisson is the limit of the Binomial as n → ∞ and p → 0 with np = λ held constant. Both (B) and (C) are defensible, but Poisson is the canonical choice for "count of rare events in many trials."
For the biased coin with p=0.6 and n=10: compute the mean and standard deviation of the number of heads.
So we expect 6 ± 1.55 heads. Using the Gaussian approximation: ~68% of the time we'd get between 4.45 and 7.55 heads, i.e., 5-7 heads. The Gaussian approximation works well when np and n(1-p) are both ≥ 5.
Write a function that computes the probability density of the Gaussian distribution at a point x, given mean μ and standard deviation σ.
javascript function gaussianPDF(x, mu, sigma) { const coeff = 1 / Math.sqrt(2 * Math.PI * sigma * sigma); const exponent = -(x - mu) * (x - mu) / (2 * sigma * sigma); return coeff * Math.exp(exponent); }
You have data. You want to fit a model. Maximum Likelihood Estimation (MLE) asks: "What parameter values make the observed data most probable?" It's the foundation of nearly all statistical learning — from logistic regression to neural network training.
Data: [2, 4, 6, 8, 10]. What is the MLE estimate for the Gaussian mean μ?
The MLE for the Gaussian mean is simply the sample mean. This follows from taking the derivative of the log-likelihood with respect to μ, setting it to zero, and solving.
Same data: [2, 4, 6, 8, 10]. What is the MLE estimate for σ (standard deviation, not variance)? Remember: MLE uses 1/n, not 1/(n-1).
Note: the unbiased estimate uses 1/(n-1) = 1/4, giving σ² = 10 and σ = 3.16. MLE is slightly biased — it underestimates the true variance. For large n, the difference is negligible.
You flip a coin 10 times and get 7 heads. What is the MLE estimate for p (probability of heads)?
The Bernoulli MLE is simply the fraction of successes: p̂ = k/n. Intuitively obvious, but the derivation proves it's optimal — no other estimate makes the observed data more probable.
Server crash counts over 5 days: [2, 5, 3, 1, 4]. What is the MLE for λ (the Poisson rate parameter)?
The Poisson MLE is the sample mean — just like the Gaussian MLE for μ. This makes sense: λ IS the mean of the Poisson distribution, so the best estimate of the mean is... the sample mean.
Write a function that computes Gaussian MLE estimates for μ and σ from an array of data points. Return an object {mu, sigma}.
javascript function mlEstimate(data) { const n = data.length; const mu = data.reduce((s, x) => s + x, 0) / n; const variance = data.reduce((s, x) => s + (x - mu) ** 2, 0) / n; return { mu, sigma: Math.sqrt(variance) }; }
MLE divides by n, unbiased by (n-1). The ratio is (n-1)/n. At n=2 the MLE is half the truth. At n=100 the MLE is 99% of the truth. At n=1000 it's 99.9%. In ML with millions of data points, the difference is utterly negligible — MLE is the standard choice.
MLE trusts the data completely. But what if you only have 3 coin flips and got 3 heads — does that mean p=1.0? MLE says yes. Your intuition says no. Maximum A Posteriori (MAP) estimation adds a prior belief that regularizes the answer, especially when data is scarce.
You observe 7 heads in 10 flips. Your prior is Beta(2,2) — a mild belief that the coin is roughly fair. What is the MAP estimate for p?
Use the formula: p̂MAP = (k + α - 1) / (n + α + β - 2)
The Beta(2,2) prior pulls the estimate from 0.70 toward 0.50 (the prior mean), landing at 0.667. The prior acts like 2 "phantom" observations of each outcome. Think of α-1 as prior heads and β-1 as prior tails.
Same data (7/10 heads), but now your prior is Beta(20,20) — a strong belief the coin is fair. What is the MAP estimate?
The strong prior (equivalent to 38 phantom observations) overwhelms the 10 real observations, pulling the estimate close to 0.5. With Beta(20,20), you'd need ~40+ real observations to significantly move away from the prior belief of fairness.
MAP maximizes P(data|θ) × P(θ). If P(θ) is constant, the prior drops out and MAP reduces to maximizing P(data|θ) alone — which is exactly MLE. For the Beta prior: Beta(1,1) is uniform on [0,1], so MAP with Beta(1,1) gives p̂ = (k+0)/(n+0) = k/n = MLE.
Also: as n → ∞, the data overwhelms any finite prior, so MAP converges to MLE asymptotically. But they are only exactly equal when the prior is flat.
In neural network training with MSE loss and L2 weight decay λ=0.01, the total loss is L = (1/n)∑(yi - f(xi))² + 0.01 × ∑wj². If this corresponds to MAP with a Gaussian prior N(0, σ2prior) on weights, what is σprior?
Hint: The MAP penalty from a Gaussian prior is (1/(2σ²))∑wj². Set this equal to λ∑wj².
Small λ (weak regularization) = large σ (wide prior = vague belief about weight magnitudes). Large λ (strong regularization) = small σ (tight prior = strong belief weights should be near zero). This is the Bayesian interpretation of every regularizer you've ever used.
With Beta(2,2) prior, how many heads out of n=100 flips would give a MAP estimate of exactly 0.70?
Solve: (k + 1) / (100 + 2) = 0.70
With n=100 data points, you need about 70-71 heads to get a MAP of 0.70. Compare to MLE: you'd need exactly 70 heads (70/100). The prior barely shifts the answer — with enough data, MAP and MLE converge. The 2 phantom observations are negligible against 100 real ones.
You're tracking a drone. Your physics model predicts where it should be, but it's approximate. Your GPS gives a noisy measurement. Neither alone is reliable. The Kalman filter optimally fuses prediction and measurement — and you need to be able to trace every step of the math by hand.
1D constant-velocity tracker: state = [position, velocity]. F = [[1, dt], [0, 1]] with dt=1. Current state x̂ = [10, 2], P = [[4, 0], [0, 1]], Q = [[0.25, 0], [0, 0.25]], no control input. What is the predicted position x̂-[0]?
Position advances by velocity × dt: 10 + 2×1 = 12. Velocity stays at 2 (constant-velocity model). The predict step is just applying the physics model forward.
Same tracker. Compute P-[0][0] — the predicted position variance. F = [[1,1],[0,1]], P = [[4,0],[0,1]], Q = [[0.25,0],[0,0.25]].
P- = FPFT + Q. Compute FP first, then (FP)FT, then add Q.
Position variance grew from 4 to 5.25. The extra 1 came from velocity uncertainty propagating into position uncertainty (the off-diagonal cross term), and 0.25 from process noise. Uncertainty always grows in the predict step — that's the whole reason we need measurements.
Simplified 1D scalar case: predicted variance P- = 5.25, measurement noise R = 2, H = 1 (we directly measure position). What is the Kalman gain K?
K = 0.724: we trust the measurement about 72% and our prediction about 28%. This makes sense — our prediction variance (5.25) is larger than our measurement noise (2), so we lean more toward the measurement.
K is literally a trust slider. R = 0 means perfect measurements, so K = 1 (just use the measurement). R = ∞ means useless measurements, so K = 0 (ignore them, trust the model). This is the single most important intuition for the Kalman filter.
Predicted position x̂- = 12, measurement z = 11.5, Kalman gain K = 0.724. What is the updated position estimate x̂?
The innovation (z - x̂-) = -0.5 says the measurement is below the prediction. K = 0.724 says we trust the measurement more, so we move 72.4% of the way toward the measurement: 12 → 11.638 (closer to 11.5 than to 12).
Put these Kalman filter operations in the correct order for one predict-update cycle.
The correct order is: Predict state → Predict covariance → Kalman gain → Update state → Update covariance.
Predict must happen before update (we need x̂- and P- before computing K). The Kalman gain must come before state/covariance update (K is needed for both). State and covariance updates can happen in either order since they're independent, but conventionally state comes first.
The Kalman gain K is the heart of the filter — it determines how much weight to give measurement versus prediction. In this chapter, you'll compute K for specific numbers, trace what happens to uncertainty after the update, and implement the full 1D Kalman update.
1D Kalman filter, H=1. Predicted variance P- = 4, measurement noise R = 1. What is K?
K = 0.8 means we put 80% weight on the measurement and 20% on the prediction. Our prediction is 4x noisier than our measurement (P-/R = 4), so we lean heavily toward the measurement.
With P- = 4, K = 0.8: what is the posterior variance P after the update?
Variance dropped from 4 to 0.8 — an 80% reduction! Verify with information form: 1/P = 1/P- + 1/R = 1/4 + 1/1 = 0.25 + 1 = 1.25, so P = 1/1.25 = 0.8. Fusing two sources always produces less uncertainty than either alone (0.8 < min(4, 1)).
A GPS has variance 25 (PGPS = 25). A LIDAR has variance 4 (PLIDAR = 4). Using the information form 1/P = 1/P1 + 1/P2, what is the fused variance?
The fused variance (3.45) is less than the better sensor alone (4). The GPS barely helps here (it's so noisy), but it still shaves off 0.55 from the LIDAR's variance. In information form, every additional sensor adds information (1/P), so fusion always helps.
With P- > 0 and R > 0: the numerator P- is strictly positive, the denominator P- + R is strictly larger than P-, so K = P-/(P-+R) is in (0, 1). It approaches 0 as R → ∞ and approaches 1 as R → 0, but never exactly reaches either (since both P- and R are strictly positive, not zero or infinity).
Implement the Kalman filter update step for the 1D scalar case. Given predicted state, predicted variance, measurement, and measurement noise, return the updated state and variance.
javascript function kalmanUpdate1D(xPred, pPred, z, R) { const K = pPred / (pPred + R); // Kalman gain const x = xPred + K * (z - xPred); // updated state const p = (1 - K) * pPred; // updated variance return { x, p }; }
This Kalman update gives correct state estimates but the variance diverges over time (keeps growing). Click the buggy line.
function kalmanUpdate1D(xPred, pPred, z, R) { const K = pPred / (pPred + R); const x = xPred + K * (z - xPred); const p = (1 + K) * pPred; return { x, p }; }
Line 4 is the bug. It says (1 + K) instead of (1 - K). With (1+K), the variance increases at every update instead of shrinking. Since 0 < K < 1, multiplying by (1+K) gives a value between 1x and 2x of P- — guaranteed divergence. The correct formula P = (1-K) × P- always shrinks variance because (1-K) ∈ (0,1).
A Hidden Markov Model (HMM) has states you can't see (hidden) that produce observations you can see. A dishonest casino switches between a fair die and a loaded die — you see the rolls but not which die is in play. The forward algorithm computes P(observations) by summing over all possible hidden state sequences.
Dishonest casino HMM: 2 states (Fair, Loaded), 2 observations (Low=0, High=1). π = [0.6, 0.4]. B = [[0.8, 0.2], [0.3, 0.7]] (Fair die: 80% Low; Loaded: 70% High). First observation: High (o1=1). Compute α1(Fair) and α1(Loaded).
Seeing a "High" observation makes the Loaded state more likely (0.28 vs 0.12), even though the prior favored Fair (0.6 vs 0.4). The emission probability flipped the balance because Loaded strongly favors High rolls.
Transition matrix A = [[0.7, 0.3], [0.4, 0.6]] (Fair stays Fair 70%, Loaded stays Loaded 60%). From α1 = [0.12, 0.28], the second observation is also High (o2=1). Compute α2(Loaded).
α2(j) = [∑i α1(i) × A[i][j]] × B[j][o2]
For completeness: α2(Fair) = (0.12 × 0.7 + 0.28 × 0.4) × 0.2 = (0.084 + 0.112) × 0.2 = 0.196 × 0.2 = 0.0392. Two consecutive High rolls: α(Loaded) = 0.1428 dominates α(Fair) = 0.0392.
Using α2 = [0.0392, 0.1428] from the exercise above, what is P(o1=High, o2=High) — the total probability of the observation sequence?
There is an 18.2% probability of seeing two consecutive High observations under this HMM. This value is used in model training (Baum-Welch) and model comparison (which HMM best explains the data?).
At each timestep t, for each of the N states j, we sum over all N previous states i: ∑i αt(i) × A[i][j]. That's N multiplications for each of N states = N² per timestep. Over T timesteps: O(N²T) = 25 × 200 = 5,000. Compare to brute-force enumeration of 5200 paths — a number with 140 digits.
Implement the forward algorithm. Given initial distribution π, transition matrix A, emission matrix B, and observation sequence, return the total probability P(observations).
javascript function hmmForward(pi, A, B, obs) { const N = pi.length; // Initialize let alpha = pi.map((p, i) => p * B[i][obs[0]]); // Recurse for (let t = 1; t < obs.length; t++) { const newAlpha = []; for (let j = 0; j < N; j++) { let sum = 0; for (let i = 0; i < N; i++) sum += alpha[i] * A[i][j]; newAlpha.push(sum * B[j][obs[t]]); } alpha = newAlpha; } // Terminate return alpha.reduce((s, a) => s + a, 0); }
The forward algorithm gives P(observations). But often you want to know: which state sequence is most likely? That's the Viterbi algorithm. Instead of summing over previous states (forward), Viterbi takes the max — it finds the single best path through the state space.
Same casino HMM: π = [0.6, 0.4], B = [[0.8, 0.2], [0.3, 0.7]]. First observation: High. Compute δ1 for both states. Which state is the most likely starting state?
Viterbi initialization is identical to forward initialization. The divergence happens at the recursion step (max vs sum). Loaded is the most likely starting state given a High observation.
From δ1 = [0.12, 0.28], A = [[0.7, 0.3], [0.4, 0.6]], second observation: High. Compute δ2(Loaded) and the backpointer ψ2(Loaded).
δ2(Loaded) = max(0.12×0.3, 0.28×0.6) × B[Loaded][High]
The best path to Loaded at t=2 came from Loaded at t=1 (backpointer = Loaded). So the Viterbi path so far is [Loaded, Loaded]. Two High observations in a row — the loaded die is the most plausible explanation.
Best final state: argmax(δ3) = Fair (0.0376 > 0.0212). Backtrace: ψ3(Fair) = Loaded (0.04704 > 0.02744). ψ2(Loaded) = Loaded. So the Viterbi path is [Loaded, Loaded, Fair]. The casino used the loaded die for the two High rolls, then switched to the fair die for the Low roll. Intuitive!
This Viterbi implementation gives P(observations) instead of the best-path probability. Click the buggy line.
function viterbi(pi, A, B, obs) { const N = pi.length; let delta = pi.map((p, i) => p * B[i][obs[0]]); for (let t = 1; t < obs.length; t++) { const newDelta = []; for (let j = 0; j < N; j++) { let val = 0; for (let i = 0; i < N; i++) val += delta[i] * A[i][j]; newDelta.push(val * B[j][obs[t]]); } delta = newDelta; } return Math.max(...delta); }
Line 8 is the bug. It uses val += delta[i] * A[i][j] (summing — that's the forward algorithm!). It should use val = Math.max(val, delta[i] * A[i][j]) to find the maximum. The crucial difference between Forward and Viterbi is sum vs max at the recursion step.
Log-domain Viterbi: log δt+1(j) = maxi[log δt(i) + log A[i][j]] + log B[j][ot+1]. Products become sums, max of products becomes max of sums. Values stay in a numerically stable range (e.g., -500 to 0 instead of 10-217 to 100). This is why log-probabilities are the default in all production HMM implementations.
How surprised should you be by an outcome? How much information does a measurement carry? Information theory quantifies uncertainty with entropy, measures the distance between distributions with KL divergence, and connects directly to the cross-entropy loss you use in every classifier.
Compute the entropy (in bits) of a fair coin: P(H) = P(T) = 0.5.
A fair coin has maximum entropy for a binary variable: 1 bit. This means you need exactly 1 yes/no question to determine the outcome. It's the most uncertain a 2-outcome distribution can be.
Compute the entropy (in bits) of a biased coin: P(H) = 0.9, P(T) = 0.1.
Less than half a bit — much less uncertain than a fair coin. When outcomes are predictable (p near 0 or 1), entropy drops toward 0. Maximum entropy (1 bit) occurs at p = 0.5.
Compute KL(p ‖ q) where p = [0.9, 0.1] and q = [0.5, 0.5]. Use natural log (ln).
KL divergence measures how much information is lost when using q to approximate p. It's always ≥ 0 and equals 0 only when p = q.
Now compute KL(q ‖ p) with the same p = [0.9, 0.1] and q = [0.5, 0.5]. Is it equal to KL(p ‖ q)?
KL(p‖q) = 0.368 but KL(q‖p) = 0.511 — they're different! KL divergence is not symmetric. This is why it's called a "divergence" not a "distance." The direction matters: KL(p‖q) penalizes places where p is high but q is low.
Compute the cross-entropy CE(p, q) with p = [0.9, 0.1], q = [0.5, 0.5] using natural log. Then verify: CE(p,q) = H(p) + KL(p‖q).
Since q = [0.5, 0.5], the cross-entropy simplifies to ln(2) regardless of p. When your model predicts uniform (maximum ignorance), the cross-entropy is always ln(number of classes). Training reduces this toward H(p), the irreducible entropy of the true distribution.
Write two functions: entropy(p) computes H(X) in nats (use ln), and klDivergence(p, q) computes KL(p ‖ q) in nats. Handle pi = 0 by treating 0 × ln(0) = 0.
javascript function entropy(p) { return -p.reduce((s, pi) => s + (pi > 0 ? pi * Math.log(pi) : 0), 0); } function klDivergence(p, q) { return p.reduce((s, pi, i) => s + (pi > 0 ? pi * Math.log(pi / q[i]) : 0), 0); }
You're building a positioning system with two sensors: a GPS (noisy but unbiased, σGPS = 5m) and an IMU (low per-step noise σIMU = 0.1m, but drifts over time). Design a 1D Kalman filter to fuse them, and trace five complete predict-update cycles.
Starting state: x̂=0, P=100. IMU says you moved u=1.0m. GPS reads z=1.2m. Compute the Kalman gain K and the updated state x̂ after one predict-update cycle.
Predict: x̂- = 0 + 1.0 = 1.0, P- = 100 + 0.01 = 100.01. Update with R=25.
The first GPS reading is powerful: K = 0.8 because our initial uncertainty (100) dwarfs the GPS noise (25). After the first update, P drops from 100 to 20 — an 80% reduction in uncertainty from a single measurement.
From x̂=1.160, P=20.00. IMU: u=0.8m. GPS: z=2.1m. Compute updated P after the second cycle.
K dropped from 0.80 to 0.44 — we're now trusting the prediction more relative to GPS. Variance continued to drop: 100 → 20 → 11.1. Each GPS fix reduces uncertainty further.
At steady state, P converges to a constant Pss. The equation becomes: Pss = (1-Kss) × (Pss + Q) where Kss = (Pss + Q) / (Pss + Q + R). With Q=0.01, R=25, solve for Pss.
This simplifies to the quadratic: Pss² + Q × Pss - Q × R = 0. Substitute values and solve.
Steady-state variance = 0.495, much less than either sensor alone (GPS: 25, IMU accumulated drift). The steady-state Kalman gain is Kss = (0.495 + 0.01) / (0.495 + 0.01 + 25) ≈ 0.505/25.505 ≈ 0.0198. At steady state, each GPS update only nudges by ~2% — the filter is highly confident.
Without GPS updates, each predict step adds Q to the variance: Pt = Pt-1 + Q. After 100 steps: P = 0.5 + 100 × 0.01 = 1.5. Uncertainty grows linearly with time — this is IMU drift. After 10,000 steps it would be 100.5. Without occasional GPS corrections, the estimate degrades. This is why real navigation systems have "GPS denied" alerts: the filter knows its own uncertainty is growing.
Put these real-world sensor fusion steps in order for a phone navigation system.
The correct order is: Initialize → IMU predict → Compute Kalman gain → GPS update → Output fused position.
In practice, the IMU runs at ~100 Hz (predict 100x per second), while GPS arrives at ~1 Hz. Between GPS updates, you run 100 predict steps. When a GPS measurement arrives, you compute K and update. The output is the fused estimate after each step (predict-only between GPS, full update when GPS is available).
Implement a function that runs a 1D Kalman filter for n steps. Takes initial state/variance, process noise Q, measurement noise R, and arrays of IMU displacements and GPS readings. Returns the final state and variance.
javascript function runKalman1D(x0, P0, Q, R, imuDisplacements, gpsReadings) { let x = x0, p = P0; for (let i = 0; i < imuDisplacements.length; i++) { // Predict x = x + imuDisplacements[i]; p = p + Q; // Update const K = p / (p + R); x = x + K * (gpsReadings[i] - x); p = (1 - K) * p; } return { x, p }; }
| Topic | Lesson |
|---|---|
| Bayes filter fundamentals | Bayes Filter — From Absolute Zero |
| Kalman filter deep dive | Kalman Filter — From Absolute Zero |
| Hidden Markov models | HMM — From Absolute Zero |
| Bayesian estimation | Bayesian Estimation — From Absolute Zero |
| Extended Kalman filter | EKF — From Absolute Zero |