How to combine prior knowledge with observed data to form optimal estimates. The mathematical foundation behind Kalman filters, machine learning, and scientific inference.
Estimation is the art of inferring unknown quantities from noisy data. You flip a coin 10 times and see 7 heads — what's the true bias? You measure a patient's temperature with a noisy thermometer — what's the true temperature? You observe stock returns over a year — what's the true expected return?
In all these cases, there's a hidden quantity (the parameter or state) that you cannot directly observe, and noisy data that gives you clues. Bayesian estimation provides a principled framework for combining prior knowledge with data to form the best possible estimate.
The teal line is the true value. The red dots are noisy measurements. Your job: estimate the teal line from only the red dots.
Bayesian estimation has three ingredients. The prior p(θ) encodes what you believed before seeing data. The likelihood p(data | θ) says how probable the observed data is for each possible value of θ. The posterior p(θ | data) is your updated belief after seeing the data.
This is Bayes' theorem. The denominator p(data) is just a normalizing constant. The key insight: the posterior is proportional to the likelihood times the prior. Data and prior beliefs get multiplied together.
Drag the sliders to move the prior and likelihood. Watch how the posterior (green) shifts between them.
Once you have the posterior distribution, how do you get a single number estimate? One natural choice: pick the value of θ that maximizes the posterior. This is Maximum A Posteriori (MAP) estimation.
MAP finds the peak (mode) of the posterior distribution. It's like the most likely value given everything you know. For Gaussians, the MAP estimate has a clean formula — it's the precision-weighted average of the prior mean and the data.
The orange dot marks the MAP estimate (posterior mode). Notice how it lies between the prior and likelihood peaks, weighted by their precisions.
Another estimator: instead of the peak, take the mean of the posterior. This is the Minimum Mean Square Error (MMSE) estimator. It minimizes the expected squared error — on average, no other estimator gets closer to the true value.
For symmetric distributions (like Gaussians), MAP and MMSE give the same answer because the mode equals the mean. But for skewed distributions, they differ — sometimes dramatically.
Adjust skewness. For symmetric distributions, MAP = MMSE. As skew increases, they diverge.
| Estimator | Definition | Loss Function | Optimal When |
|---|---|---|---|
| MAP | Posterior mode | 0-1 loss (hit or miss) | You want most probable value |
| MMSE | Posterior mean | Squared error | You want minimum average error |
Computing the posterior can be hard — you need to multiply two functions and normalize. But for certain prior-likelihood pairs, the posterior has the same functional form as the prior. These are conjugate priors, and they make Bayesian updates trivially easy.
| Likelihood | Conjugate Prior | Posterior | Use Case |
|---|---|---|---|
| Bernoulli/Binomial | Beta(α, β) | Beta(α+k, β+n-k) | Coin bias estimation |
| Normal (known σ) | Normal(μ0, σ0) | Normal(μn, σn) | Estimating a mean |
| Poisson | Gamma(a, b) | Gamma(a+Σx, b+n) | Rate estimation |
| Multinomial | Dirichlet | Dirichlet | Category probabilities |
Click "Flip" to flip a coin with the hidden bias. The purple Beta distribution is your posterior belief about the coin's bias. Watch it sharpen with data.
What if data arrives one observation at a time, rather than all at once? You don't need to redo the entire computation. Thanks to conjugacy and Bayes' rule, you can update sequentially: today's posterior becomes tomorrow's prior.
Each observation (red dot) arrives sequentially. Watch the green posterior sharpen as data accumulates. The blue dashed line is the prior.
The Kalman filter is nothing more than recursive Bayesian estimation with Gaussians. The state is the unknown parameter. The prior is the predicted state (Gaussian). The likelihood comes from the measurement model (also Gaussian). The posterior is the updated state estimate (still Gaussian, thanks to conjugacy!).
| Bayesian Estimation | Kalman Filter |
|---|---|
| Prior p(θ) | Predicted state N(x̂¯, P¯) |
| Likelihood p(z|θ) | Measurement model N(Hx, R) |
| Posterior p(θ|z) | Updated state N(x̂, P) |
| Posterior mean (MMSE) | x̂ = x̂¯ + K(z − Hx̂¯) |
| Posterior precision = prior prec. + likelihood prec. | P¯¹ = P¯¯¹ + H¹R¯¹H |
The blue Gaussian is the prediction (prior). The red is the measurement (likelihood). The green is the Kalman estimate (posterior) — always the narrowest.
Bayesian and frequentist are two philosophies of statistics. They often agree on the numbers but disagree profoundly on what those numbers mean.
| Aspect | Bayesian | Frequentist |
|---|---|---|
| Parameters | Random variables with distributions | Fixed but unknown constants |
| Probability means | Degree of belief | Long-run frequency |
| Prior knowledge | Encoded in the prior | Not used formally |
| Result | Posterior distribution p(θ|data) | Point estimate + confidence interval |
| Small samples | Naturally handles (prior regularizes) | Can be unreliable |
| Computation | Can be expensive (integrals) | Usually simple formulas |
With few flips, the Bayesian estimate (with uniform prior) is more conservative than MLE. With many flips, they converge. Adjust the number of flips.
Conjugate priors are elegant but limited — most real problems don't have conjugate forms. Modern Bayesian computation uses powerful algorithms to approximate posteriors for arbitrary models.
| Method | Accuracy | Speed | Use Case |
|---|---|---|---|
| Conjugate priors | Exact | Instant | Simple models (coin, Gaussian) |
| MCMC (HMC, NUTS) | Asymptotically exact | Slow | Complex models, small-medium data |
| Variational Inference | Approximate | Fast | Large-scale, deep learning |
| Laplace Approximation | Gaussian approx. | Fast | Well-peaked posteriors |
| Particle Filters | Approximate | Medium | Nonlinear, non-Gaussian sequential |
You now understand Bayesian estimation. Every time you update a belief with evidence, you're doing Bayes. Now you know the mathematics behind it.