How to combine prior knowledge with observed data to form optimal estimates. The mathematical foundation behind Kalman filters, machine learning, and scientific inference.
Estimation is the art of inferring unknown quantities from noisy data. You flip a coin 10 times and see 7 heads — what's the true bias? You measure a patient's temperature with a noisy thermometer — what's the true temperature? You observe stock returns over a year — what's the true expected return?
In all these cases, there's a hidden quantity (the parameter or state) that you cannot directly observe, and noisy data that gives you clues. Bayesian estimation provides a principled framework for combining prior knowledge with data to form the best possible estimate.
The teal line is the true value. The red dots are noisy measurements. Your job: estimate the teal line from only the red dots.
Bayesian estimation has three ingredients. The prior p(θ) encodes what you believed before seeing data. The likelihood p(data | θ) says how probable the observed data is for each possible value of θ. The posterior p(θ | data) is your updated belief after seeing the data.
This is Bayes' theorem. The denominator p(data) = ∫ p(data|θ) · p(θ) dθ is just a normalizing constant (it ensures the posterior integrates to 1). In practice, you almost never compute it directly — either it cancels out (conjugate priors) or you approximate it (MCMC). The key insight: the posterior is proportional to the likelihood times the prior. Data and prior beliefs get multiplied together.
For Gaussians, the compromise has a clean formula. If the prior is N(μ0, σ0²) and the likelihood centers at x with variance σ², the posterior mean is:
This is a precision-weighted average. Precision = 1/variance. The source with higher precision (lower variance) gets more weight. If σ0 = 1 and σ = 2: the prior has precision 1, the data has precision 0.25, so the prior gets 4 times the weight. The posterior variance is always smaller than either input variance — combining information always reduces uncertainty.
Drag the sliders to move the prior and likelihood. Watch how the posterior (green) shifts between them.
Bayes' theorem says p(θ|data) = p(data|θ) · p(θ) / p(data). The numerator is just the likelihood times the prior — a product of two functions. But dividing by p(data) = ∫ p(data|θ) · p(θ) dθ magically makes the result integrate to 1.
Your task: Prove that ∫ p(θ|data) dθ = 1 by expanding the definition and showing the cancellation.
Full derivation:
∫ p(θ|data) dθ = ∫ p(data|θ) · p(θ) / p(data) dθ
Since p(data) is a constant w.r.t. θ:
= [1/p(data)] · ∫ p(data|θ) · p(θ) dθ
But by definition, p(data) = ∫ p(data|θ) · p(θ) dθ (marginalization over θ).
= [1/p(data)] · p(data) = 1. □
The key insight: The evidence p(data) isn't some mysterious quantity — it's literally defined as "whatever makes the posterior integrate to 1." That's why we can often ignore it and write p(θ|data) ∝ p(data|θ) · p(θ). The proportionality constant is fixed by the requirement that probabilities sum to 1.
The posterior precision = prior precision + likelihood precision. Precision is 1/variance, so adding precisions means the posterior variance is ALWAYS smaller than either input: 1/(1/σ1² + 1/σ2²) < min(σ1², σ2²). Multiplying two Gaussians and renormalizing creates a distribution that's narrower than either factor because you're combining INDEPENDENT sources of information. Each source rules out different parts of the parameter space. The regions that survive are only those consistent with BOTH sources — a smaller set than either alone.
Once you have the posterior distribution, how do you get a single number estimate? One natural choice: pick the value of θ that maximizes the posterior. This is Maximum A Posteriori (MAP) estimation.
MAP finds the peak (mode) of the posterior distribution. It's like the most likely value given everything you know. For Gaussians, the MAP estimate has a clean formula — it's the precision-weighted average of the prior mean and the data.
The orange dot marks the MAP estimate (posterior mode). Notice how it lies between the prior and likelihood peaks, weighted by their precisions.
Another estimator: instead of the peak, take the mean of the posterior. This is the Minimum Mean Square Error (MMSE) estimator. It minimizes the expected squared error — on average, no other estimator gets closer to the true value.
For symmetric distributions (like Gaussians), MAP and MMSE give the same answer because the mode equals the mean. But for skewed distributions, they differ — sometimes dramatically.
Adjust skewness. For symmetric distributions, MAP = MMSE. As skew increases, they diverge.
| Estimator | Definition | Loss Function | Optimal When |
|---|---|---|---|
| MAP | Posterior mode | 0-1 loss (hit or miss) | You want most probable value |
| MMSE | Posterior mean | Squared error | You want minimum average error |
| Median | Posterior median | Absolute error | You want robustness to outliers |
For Gaussian posteriors, all three estimates are identical (mode = mean = median). This is why Gaussians are so convenient — you don't need to think about which estimator to use. For non-Gaussian posteriors, the choice matters. In practice, MMSE (the posterior mean) is the most commonly used because squared error is the default loss function in engineering and science.
Given a posterior Beta(α, β) with PDF p(θ) ∝ θα−1(1−θ)β−1, derive both the MAP (mode) and the MMSE (mean) in closed form.
Your task: (1) Find the mode by taking the derivative, setting it to zero, and solving. (2) State the mean (use the known formula). (3) Show that they're equal only when α = β.
MAP (mode):
d/dθ [(α−1)lnθ + (β−1)ln(1−θ)] = 0
(α−1)/θ = (β−1)/(1−θ)
(α−1)(1−θ) = (β−1)θ
α−1 = θ(α−1 + β−1) = θ(α+β−2)
θ̂MAP = (α−1)/(α+β−2)
MMSE (mean): For Beta(α,β), the mean is E[θ] = α/(α+β).
When are they equal? Set (α−1)/(α+β−2) = α/(α+β). Cross-multiply: (α−1)(α+β) = α(α+β−2). Expand: α²+αβ−α−β = α²+αβ−2α. Simplify: −α−β = −2α, so α = β. They're equal exactly when the distribution is symmetric.
The key insight: The mode and mean of a Beta distribution are pulled in opposite directions by skewness. With α > β, the mode is farther right than the mean because the long left tail pulls the mean leftward. The gap (α−1)/(α+β−2) − α/(α+β) vanishes only at symmetry or in the large-sample limit where both approach the true value.
Computing the posterior can be hard — you need to multiply two functions and normalize. But for certain prior-likelihood pairs, the posterior has the same functional form as the prior. These are conjugate priors, and they make Bayesian updates trivially easy.
| Likelihood | Conjugate Prior | Posterior | Use Case |
|---|---|---|---|
| Bernoulli/Binomial | Beta(α, β) | Beta(α+k, β+n-k) | Coin bias estimation |
| Normal (known σ) | Normal(μ0, σ0) | Normal(μn, σn) | Estimating a mean |
| Poisson | Gamma(a, b) | Gamma(a+Σx, b+n) | Rate estimation |
| Multinomial | Dirichlet | Dirichlet | Category probabilities |
Python import numpy as np def normal_normal_update(mu0, sig0, sig, observations): """Sequential Normal-Normal conjugate update.""" mu, var = mu0, sig0**2 for x in observations: prec_prior = 1 / var prec_obs = 1 / sig**2 var = 1 / (prec_prior + prec_obs) mu = var * (prec_prior * mu + prec_obs * x) return mu, np.sqrt(var) # Example: prior N(0, 10), noise sig=1, 5 observations obs = [2.8, 3.2, 2.9, 3.1, 3.0] mu_post, sig_post = normal_normal_update(0, 10, 1, obs) # mu_post ≈ 2.99, sig_post ≈ 0.45
Click "Flip" to flip a coin with the hidden bias. The purple Beta distribution is your posterior belief about the coin's bias. Watch it sharpen with data.
Prior: Beta(α, β) → p(θ) ∝ θα−1(1−θ)β−1. Likelihood: Binomial(n, k) → p(k|θ) ∝ θk(1−θ)n−k.
Your task: Multiply them together and show the posterior is Beta(α+k, β+n−k).
Full derivation:
p(θ|data) ∝ θk(1−θ)n−k · θα−1(1−θ)β−1
= θk + α − 1(1−θ)(n−k) + β − 1
= θ(α+k) − 1(1−θ)(β+n−k) − 1
This is exactly the kernel of a Beta(α+k, β+n−k) distribution! The normalization constant (Beta function) is determined by the requirement that the posterior integrates to 1.
The key insight: Conjugacy works because the prior and likelihood have the SAME functional form in θ. When you multiply θa · θb = θa+b, the exponents simply add. The hyperparameters α and β can be interpreted as "pseudo-observations": α−1 prior heads and β−1 prior tails. Real data just adds to the count. This is why Bayesian updating is so natural for counting problems.
python def beta_binomial_update(alpha_prior, beta_prior, observations): history = [] a, b = alpha_prior, beta_prior for obs in observations: # Update: head (1) increments alpha, tail (0) increments beta if obs == 1: a += 1 else: b += 1 # MAP: mode of Beta(a, b) if a + b - 2 > 0: map_est = (a - 1) / (a + b - 2) else: map_est = 0.5 # undefined for uniform # MMSE: mean of Beta(a, b) mmse_est = a / (a + b) history.append({ 'alpha': a, 'beta': b, 'map': map_est, 'mmse': mmse_est }) return history
What if data arrives one observation at a time, rather than all at once? You don't need to redo the entire computation. Thanks to conjugacy and Bayes' rule, you can update sequentially: today's posterior becomes tomorrow's prior.
This is not just a mathematical curiosity — it's the reason online learning works. Each update uses constant memory (just the current posterior parameters) and constant time (one multiply and one add for Normal-Normal). You never need to store or revisit old data. This sequential update IS the Kalman filter when the model is linear-Gaussian with a time-varying state.
Python — Verify batch = sequential import numpy as np # Sequential update mu, var = 0.0, 4.0 # prior N(0, 2) → var=4 sig2 = 1.0 # observation noise variance for x in [2.5, 3.1, 2.8]: new_prec = 1/var + 1/sig2 mu = (1/var * mu + 1/sig2 * x) / new_prec var = 1 / new_prec print(f"Sequential: μ={mu:.6f}, σ={np.sqrt(var):.6f}") # Batch update n, xbar = 3, np.mean([2.5, 3.1, 2.8]) var_batch = 1 / (1/4.0 + n/sig2) mu_batch = var_batch * (0/4.0 + n * xbar / sig2) print(f"Batch: μ={mu_batch:.6f}, σ={np.sqrt(var_batch):.6f}") # Both print identical values!
Each observation (red dot) arrives sequentially. Watch the green posterior sharpen as data accumulates. The blue dashed line is the prior.
The Kalman filter is nothing more than recursive Bayesian estimation with Gaussians. The state is the unknown parameter. The prior is the predicted state (Gaussian). The likelihood comes from the measurement model (also Gaussian). The posterior is the updated state estimate (still Gaussian, thanks to conjugacy!).
| Bayesian Estimation | Kalman Filter |
|---|---|
| Prior p(θ) | Predicted state N(x̂¯, P¯) |
| Likelihood p(z|θ) | Measurement model N(Hx, R) |
| Posterior p(θ|z) | Updated state N(x̂, P) |
| Posterior mean (MMSE) | x̂ = x̂¯ + K(z − Hx̂¯) |
| Posterior precision = prior prec. + likelihood prec. | P¯¹ = P¯¯¹ + H¹R¯¹H |
The blue Gaussian is the prediction (prior). The red is the measurement (likelihood). The green is the Kalman estimate (posterior) — always the narrowest.
These are the SAME equation. The Kalman gain K is just the ratio of prediction uncertainty to total uncertainty — exactly the precision-weighted average from Bayesian estimation, but generalized to matrices. When you learn the Kalman filter, you already know 90% of it from this lesson. The "predict" step propagates the Gaussian prior forward in time; the "update" step applies Bayes' theorem with the measurement as likelihood.
Attention in transformers also computes a weighted average where weights depend on "relevance" (similarity). Could you frame attention as precision-weighted fusion? What would "precision" mean for tokens?
Bayesian and frequentist are two philosophies of statistics. They often agree on the numbers but disagree profoundly on what those numbers mean.
| Aspect | Bayesian | Frequentist |
|---|---|---|
| Parameters | Random variables with distributions | Fixed but unknown constants |
| Probability means | Degree of belief | Long-run frequency |
| Prior knowledge | Encoded in the prior | Not used formally |
| Result | Posterior distribution p(θ|data) | Point estimate + confidence interval |
| Small samples | Naturally handles (prior regularizes) | Can be unreliable |
| Computation | Can be expensive (integrals) | Usually simple formulas |
With few flips, the Bayesian estimate (with uniform prior) is more conservative than MLE. With many flips, they converge. Adjust the number of flips.
Conjugate priors are elegant but limited — most real problems don't have conjugate forms. Modern Bayesian computation uses powerful algorithms to approximate posteriors for arbitrary models.
| Method | Accuracy | Speed | Use Case |
|---|---|---|---|
| Conjugate priors | Exact | Instant | Simple models (coin, Gaussian) |
| MCMC (HMC, NUTS) | Asymptotically exact | Slow | Complex models, small-medium data |
| Variational Inference | Approximate | Fast | Large-scale, deep learning |
| Laplace Approximation | Gaussian approx. | Fast | Well-peaked posteriors |
| Particle Filters | Approximate | Medium | Nonlinear, non-Gaussian sequential |
The curse of dimensionality. Even with MCMC, high-dimensional parameter spaces are brutal. A small neural network might have 10,000 parameters. MCMC needs to explore a 10,000-dimensional posterior — the volume of that space is unimaginably large. This is why Bayesian neural networks remain expensive: the posterior is complex, multimodal, and high-dimensional. Variational inference trades accuracy for speed by fitting a simple distribution (usually a diagonal Gaussian) to the true posterior.
| Scenario | Parameters | Conjugate? | Method |
|---|---|---|---|
| Coin bias | 1 | Yes (Beta) | Closed-form update |
| Sensor mean | 1 | Yes (Normal) | Closed-form / Kalman |
| Logistic regression | 10–100 | No | MCMC or Laplace approx. |
| Gaussian process | ~1,000 | Partially | Sparse VI, inducing points |
| Neural network | 10K–1B | No | VI, MC Dropout, ensembles |
You now understand Bayesian estimation. Every time you update a belief with evidence, you're doing Bayes. Now you know the mathematics behind it.
Industry standard (VWO, Optimizely, Google):
Prior: Beta(1, 1) or weakly informative Beta(32, 968) centered at historical 3.2% CVR. The weak prior (equivalent to ~1000 pseudo-observations) prevents extreme early estimates but washes out after ~5000 real observations. Too-strong priors make tests insensitive to real differences.
Stopping rule: "Expected loss" framework. Stop when E[loss | choosing B] < ε (e.g., < 0.01% conversion rate). This naturally handles peeking because it's a PROPERTY of the current posterior, not a p-value that gets inflated. Alternatively: P(B > A | data) > 0.95 AND at least 1000 samples per arm (the minimum prevents lucky streaks).
Multi-variant: Draw 10,000 samples from each variant's Beta posterior. For each draw, record which variant was highest. P(i is best) = fraction of draws where i won. Cost: O(K × N_samples), trivially fast for <10 variants.
Minimum sample: With Beta(1,1) prior: ~100 conversions per variant before peeking meaningfully (prior influence <2%). First meaningful peek at ~3000 visitors per variant (given 3.2% CVR = ~96 conversions). Daily peeking with this minimum is safe because the Bayesian approach doesn't accumulate Type I error.
Thompson Sampling is literally "sample from the posterior, act as if the sample is truth." Arms with high uncertainty have wide posteriors → their samples occasionally beat the current best → they get explored. Arms with tight posteriors centered low never get sampled high → they get ignored. The Bayesian posterior IS the exploration strategy. No epsilon-greedy, no UCB formula needed — just sample and act.
The Bayes filter (for robot localization) also maintains a posterior over states. Could you design a "Thompson Sampling" version of robot navigation that samples a position from the belief and acts as if it's there? What would go wrong?