The Complete Beginner's Path

Understand Bayesian
Estimation

How to combine prior knowledge with observed data to form optimal estimates. The mathematical foundation behind Kalman filters, machine learning, and scientific inference.

Prerequisites: Basic probability + Intuition for uncertainty. No measure theory required.
9
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Parameters vs States

Estimation is the art of inferring unknown quantities from noisy data. You flip a coin 10 times and see 7 heads — what's the true bias? You measure a patient's temperature with a noisy thermometer — what's the true temperature? You observe stock returns over a year — what's the true expected return?

In all these cases, there's a hidden quantity (the parameter or state) that you cannot directly observe, and noisy data that gives you clues. Bayesian estimation provides a principled framework for combining prior knowledge with data to form the best possible estimate.

The core question: Given what I knew before (my prior) and what I just observed (my data), what should I now believe about the unknown quantity? Bayesian estimation answers this question using probability theory.
Noisy Observations of a Hidden Value

The teal line is the true value. The red dots are noisy measurements. Your job: estimate the teal line from only the red dots.

Noise level1.0
Check: What is the goal of estimation?

Chapter 1: Prior, Likelihood, Posterior

Bayesian estimation has three ingredients. The prior p(θ) encodes what you believed before seeing data. The likelihood p(data | θ) says how probable the observed data is for each possible value of θ. The posterior p(θ | data) is your updated belief after seeing the data.

p(θ | data) = p(data | θ) · p(θ) / p(data)

This is Bayes' theorem. The denominator p(data) is just a normalizing constant. The key insight: the posterior is proportional to the likelihood times the prior. Data and prior beliefs get multiplied together.

Key insight: The posterior is a compromise between the prior and the data. Strong prior + weak data = posterior close to prior. Weak prior + strong data = posterior close to the likelihood. As you gather more data, the prior matters less and the data dominates.
Interactive: Prior × Likelihood = Posterior

Drag the sliders to move the prior and likelihood. Watch how the posterior (green) shifts between them.

Prior mean-1.0
Prior σ1.5
Likelihood mean1.5
Likelihood σ1.0
Check: What does the posterior represent?

Chapter 2: MAP Estimation

Once you have the posterior distribution, how do you get a single number estimate? One natural choice: pick the value of θ that maximizes the posterior. This is Maximum A Posteriori (MAP) estimation.

θ̂MAP = argmaxθ p(θ | data) = argmaxθ p(data | θ) · p(θ)

MAP finds the peak (mode) of the posterior distribution. It's like the most likely value given everything you know. For Gaussians, the MAP estimate has a clean formula — it's the precision-weighted average of the prior mean and the data.

MAP vs MLE: Maximum Likelihood Estimation (MLE) ignores the prior entirely — it just maximizes p(data | θ). MAP includes the prior, which acts as a regularizer. With a Gaussian prior, MAP is equivalent to L2 regularization in machine learning.
MAP: Finding the Posterior Peak

The orange dot marks the MAP estimate (posterior mode). Notice how it lies between the prior and likelihood peaks, weighted by their precisions.

Prior σ1.5
Likelihood σ1.0
θ̂MAP = 0.00
Check: What does MAP estimation find?

Chapter 3: MMSE — The Posterior Mean

Another estimator: instead of the peak, take the mean of the posterior. This is the Minimum Mean Square Error (MMSE) estimator. It minimizes the expected squared error — on average, no other estimator gets closer to the true value.

θ̂MMSE = E[θ | data] = ∫ θ · p(θ | data) dθ

For symmetric distributions (like Gaussians), MAP and MMSE give the same answer because the mode equals the mean. But for skewed distributions, they differ — sometimes dramatically.

When they differ: Imagine a posterior that's skewed right (long tail to the right). The MAP finds the peak on the left side. The MMSE (mean) is pulled rightward by the tail. Which is "better" depends on your loss function: squared error favors MMSE, zero-one loss favors MAP.
MAP vs MMSE on Skewed Posteriors

Adjust skewness. For symmetric distributions, MAP = MMSE. As skew increases, they diverge.

Skewness0.0
MAP = 2.00 MMSE = 2.00
EstimatorDefinitionLoss FunctionOptimal When
MAPPosterior mode0-1 loss (hit or miss)You want most probable value
MMSEPosterior meanSquared errorYou want minimum average error
Check: When does the MMSE estimate differ from the MAP estimate?

Chapter 4: Conjugate Priors

Computing the posterior can be hard — you need to multiply two functions and normalize. But for certain prior-likelihood pairs, the posterior has the same functional form as the prior. These are conjugate priors, and they make Bayesian updates trivially easy.

LikelihoodConjugate PriorPosteriorUse Case
Bernoulli/BinomialBeta(α, β)Beta(α+k, β+n-k)Coin bias estimation
Normal (known σ)Normal(μ0, σ0)Normal(μn, σn)Estimating a mean
PoissonGamma(a, b)Gamma(a+Σx, b+n)Rate estimation
MultinomialDirichletDirichletCategory probabilities
Why this matters: With conjugate priors, you just update the parameters. Beta(α, β) + k heads in n flips = Beta(α+k, β+n-k). No integrals, no computation. The parameters (α, β) are called hyperparameters and they accumulate evidence like a running tally.
Interactive Coin Estimation (Beta-Binomial)

Click "Flip" to flip a coin with the hidden bias. The purple Beta distribution is your posterior belief about the coin's bias. Watch it sharpen with data.

True bias: ??? Heads: 0 Tails: 0 MAP: 0.50
Check: What makes a conjugate prior special?

Chapter 5: Recursive Bayesian Estimation

What if data arrives one observation at a time, rather than all at once? You don't need to redo the entire computation. Thanks to conjugacy and Bayes' rule, you can update sequentially: today's posterior becomes tomorrow's prior.

Start
Prior p(θ) — your initial belief
Observe x1
Posterior1 ∝ p(x1|θ) · Prior
Observe x2
Posterior2 ∝ p(x2|θ) · Posterior1
↓ ...
Observe xn
Posteriorn ∝ p(xn|θ) · Posteriorn-1
Beautiful property: The final posterior is identical whether you process data all at once or one at a time. Order doesn't matter either. This makes Bayesian estimation naturally suited to streaming data, sensor processing, and online learning.
Sequential Updates: Normal-Normal Model

Each observation (red dot) arrives sequentially. Watch the green posterior sharpen as data accumulates. The blue dashed line is the prior.

n=0 | Prior: μ=0, σ=2.0
Check: In recursive Bayesian estimation, what role does the previous posterior play?

Chapter 6: Connection to the Kalman Filter

The Kalman filter is nothing more than recursive Bayesian estimation with Gaussians. The state is the unknown parameter. The prior is the predicted state (Gaussian). The likelihood comes from the measurement model (also Gaussian). The posterior is the updated state estimate (still Gaussian, thanks to conjugacy!).

Bayesian EstimationKalman Filter
Prior p(θ)Predicted state N(x̂¯, P¯)
Likelihood p(z|θ)Measurement model N(Hx, R)
Posterior p(θ|z)Updated state N(x̂, P)
Posterior mean (MMSE)x̂ = x̂¯ + K(z − Hx̂¯)
Posterior precision = prior prec. + likelihood prec.P¯¹ = P¯¯¹ + H¹R¯¹H
Key insight: The Kalman gain K is just the ratio of precisions — how much to trust data vs prediction. K = P¯H¹(HP¯H¹+R)¯¹. This is Bayes' rule in matrix form. The "predict" step propagates the prior forward in time; the "update" step applies Bayes' theorem.
Kalman as Bayesian Update (1D)

The blue Gaussian is the prediction (prior). The red is the measurement (likelihood). The green is the Kalman estimate (posterior) — always the narrowest.

Predict σ1.5
Measure σ1.0
Check: The Kalman filter is best described as:

Chapter 7: Bayesian vs Frequentist

Bayesian and frequentist are two philosophies of statistics. They often agree on the numbers but disagree profoundly on what those numbers mean.

AspectBayesianFrequentist
ParametersRandom variables with distributionsFixed but unknown constants
Probability meansDegree of beliefLong-run frequency
Prior knowledgeEncoded in the priorNot used formally
ResultPosterior distribution p(θ|data)Point estimate + confidence interval
Small samplesNaturally handles (prior regularizes)Can be unreliable
ComputationCan be expensive (integrals)Usually simple formulas
The honest truth: Neither approach is universally better. Use Bayesian methods when you have genuine prior knowledge, small datasets, or need full uncertainty quantification. Use frequentist methods when you need simplicity, have large datasets (where the prior washes out), or need guaranteed coverage properties.
Bayesian vs Frequentist: Coin Estimation

With few flips, the Bayesian estimate (with uniform prior) is more conservative than MLE. With many flips, they converge. Adjust the number of flips.

Number of flips5
Heads observed4
Check: In Bayesian statistics, what is a parameter?

Chapter 8: Modern Bayesian Methods

Conjugate priors are elegant but limited — most real problems don't have conjugate forms. Modern Bayesian computation uses powerful algorithms to approximate posteriors for arbitrary models.

MCMC
Markov Chain Monte Carlo: draw samples from the posterior by constructing a Markov chain. Gold standard for accuracy. Can be slow.
Variational Inference
Approximate the posterior with a simpler distribution by minimizing KL divergence. Fast but approximate.
Bayesian Neural Nets
Put distributions over neural network weights. Get uncertainty estimates for predictions. Computationally expensive.
MethodAccuracySpeedUse Case
Conjugate priorsExactInstantSimple models (coin, Gaussian)
MCMC (HMC, NUTS)Asymptotically exactSlowComplex models, small-medium data
Variational InferenceApproximateFastLarge-scale, deep learning
Laplace ApproximationGaussian approx.FastWell-peaked posteriors
Particle FiltersApproximateMediumNonlinear, non-Gaussian sequential
Connections: Bayesian estimation connects to everything. Kalman filters = recursive Bayesian with Gaussians. POMDPs maintain a Bayesian belief over states. Regularization in ML = implicit Bayesian prior. Ensemble methods approximate Bayesian model averaging. Understanding Bayes' theorem is understanding the heart of inference.
"Probability theory is nothing but common sense reduced to calculation."
— Pierre-Simon Laplace

You now understand Bayesian estimation. Every time you update a belief with evidence, you're doing Bayes. Now you know the mathematics behind it.

Check: When conjugate priors are unavailable, which method draws samples from the posterior?