The Complete Beginner's Path

Understand Bayesian
Estimation

How to combine prior knowledge with observed data to form optimal estimates. The mathematical foundation behind Kalman filters, machine learning, and scientific inference.

Prerequisites: Basic probability + Intuition for uncertainty. No measure theory required.
9
Chapters
7+
Simulations
0
Assumed Knowledge

Chapter 0: Parameters vs States

Estimation is the art of inferring unknown quantities from noisy data. You flip a coin 10 times and see 7 heads — what's the true bias? You measure a patient's temperature with a noisy thermometer — what's the true temperature? You observe stock returns over a year — what's the true expected return?

In all these cases, there's a hidden quantity (the parameter or state) that you cannot directly observe, and noisy data that gives you clues. Bayesian estimation provides a principled framework for combining prior knowledge with data to form the best possible estimate.

The core question: Given what I knew before (my prior) and what I just observed (my data), what should I now believe about the unknown quantity? Bayesian estimation answers this question using probability theory.
Noisy Observations of a Hidden Value

The teal line is the true value. The red dots are noisy measurements. Your job: estimate the teal line from only the red dots.

Noise level1.0
Check: What is the goal of estimation?

Chapter 1: Prior, Likelihood, Posterior

Bayesian estimation has three ingredients. The prior p(θ) encodes what you believed before seeing data. The likelihood p(data | θ) says how probable the observed data is for each possible value of θ. The posterior p(θ | data) is your updated belief after seeing the data.

p(θ | data) = p(data | θ) · p(θ) / p(data)

This is Bayes' theorem. The denominator p(data) = ∫ p(data|θ) · p(θ) dθ is just a normalizing constant (it ensures the posterior integrates to 1). In practice, you almost never compute it directly — either it cancels out (conjugate priors) or you approximate it (MCMC). The key insight: the posterior is proportional to the likelihood times the prior. Data and prior beliefs get multiplied together.

Key insight: The posterior is a compromise between the prior and the data. Strong prior + weak data = posterior close to prior. Weak prior + strong data = posterior close to the likelihood. As you gather more data, the prior matters less and the data dominates.

For Gaussians, the compromise has a clean formula. If the prior is N(μ0, σ0²) and the likelihood centers at x with variance σ², the posterior mean is:

μpost = (σ² · μ0 + σ0² · x) / (σ² + σ0²)

This is a precision-weighted average. Precision = 1/variance. The source with higher precision (lower variance) gets more weight. If σ0 = 1 and σ = 2: the prior has precision 1, the data has precision 0.25, so the prior gets 4 times the weight. The posterior variance is always smaller than either input variance — combining information always reduces uncertainty.

Interactive: Prior × Likelihood = Posterior

Drag the sliders to move the prior and likelihood. Watch how the posterior (green) shifts between them.

Prior mean-1.0
Prior σ1.5
Likelihood mean1.5
Likelihood σ1.0
Check: What does the posterior represent?
🔨 Derivation Why does dividing by p(data) make it a valid distribution? ✓ ATTEMPTED

Bayes' theorem says p(θ|data) = p(data|θ) · p(θ) / p(data). The numerator is just the likelihood times the prior — a product of two functions. But dividing by p(data) = ∫ p(data|θ) · p(θ) dθ magically makes the result integrate to 1.

Your task: Prove that ∫ p(θ|data) dθ = 1 by expanding the definition and showing the cancellation.

Write out ∫ [p(data|θ) · p(θ) / p(data)] dθ. Notice that p(data) does not depend on θ — it's a constant you can pull out of the integral.
You get (1/p(data)) · ∫ p(data|θ) · p(θ) dθ. But that integral IS the definition of p(data) (the marginal likelihood / evidence).
You have p(data)/p(data) = 1. Done! The denominator was constructed specifically to be the normalization constant.

Full derivation:

∫ p(θ|data) dθ = ∫ p(data|θ) · p(θ) / p(data) dθ

Since p(data) is a constant w.r.t. θ:

= [1/p(data)] · ∫ p(data|θ) · p(θ) dθ

But by definition, p(data) = ∫ p(data|θ) · p(θ) dθ (marginalization over θ).

= [1/p(data)] · p(data) = 1. □

The key insight: The evidence p(data) isn't some mysterious quantity — it's literally defined as "whatever makes the posterior integrate to 1." That's why we can often ignore it and write p(θ|data) ∝ p(data|θ) · p(θ). The proportionality constant is fixed by the requirement that probabilities sum to 1.

⚔ Adversarial: Your Bayesian estimator assigns 95% posterior probability to the WRONG hypothesis. The prior was 50/50. How is this possible?
You have two hypotheses: H0 (θ=0.3) and H1 (θ=0.7). Your prior is P(H0) = P(H1) = 0.5. You observe 10 coin flips. The Bayesian posterior gives P(H1|data) = 0.95. But the TRUE bias is 0.3 (H0 is correct). What caused this?
Checkpoint — Before you move on
In your own words: why is the posterior ALWAYS narrower (more certain) than both the prior and the likelihood individually? What mathematical operation causes uncertainty to decrease?
✓ Gate cleared
Model Answer

The posterior precision = prior precision + likelihood precision. Precision is 1/variance, so adding precisions means the posterior variance is ALWAYS smaller than either input: 1/(1/σ1² + 1/σ2²) < min(σ1², σ2²). Multiplying two Gaussians and renormalizing creates a distribution that's narrower than either factor because you're combining INDEPENDENT sources of information. Each source rules out different parts of the parameter space. The regions that survive are only those consistent with BOTH sources — a smaller set than either alone.

Chapter 2: MAP Estimation

Once you have the posterior distribution, how do you get a single number estimate? One natural choice: pick the value of θ that maximizes the posterior. This is Maximum A Posteriori (MAP) estimation.

θ̂MAP = argmaxθ p(θ | data) = argmaxθ p(data | θ) · p(θ)

MAP finds the peak (mode) of the posterior distribution. It's like the most likely value given everything you know. For Gaussians, the MAP estimate has a clean formula — it's the precision-weighted average of the prior mean and the data.

MAP vs MLE: Maximum Likelihood Estimation (MLE) ignores the prior entirely — it just maximizes p(data | θ). MAP includes the prior, which acts as a regularizer. With a Gaussian prior, MAP is equivalent to L2 regularization in machine learning.
MAP with numbers (Beta posterior): Suppose you flip a coin 10 times and get 7 heads. With a Beta(1,1) uniform prior, your posterior is Beta(8, 4). The MAP estimate is:

θ̂MAP = (α−1) / (α+β−2) = 7/10 = 0.700

This is the same as MLE here because the uniform prior adds no bias. Now try a Beta(5,5) prior (biased toward 0.5). Posterior becomes Beta(12, 8). MAP = 11/18 = 0.611 — pulled toward 0.5 by the prior. The prior acts exactly like adding fake observations: Beta(5,5) is like having already seen 4 heads and 4 tails before the experiment started.
MAP: Finding the Posterior Peak

The orange dot marks the MAP estimate (posterior mode). Notice how it lies between the prior and likelihood peaks, weighted by their precisions.

Prior σ1.5
Likelihood σ1.0
θ̂MAP = 0.00
Check: What does MAP estimation find?

Chapter 3: MMSE — The Posterior Mean

Another estimator: instead of the peak, take the mean of the posterior. This is the Minimum Mean Square Error (MMSE) estimator. It minimizes the expected squared error — on average, no other estimator gets closer to the true value.

θ̂MMSE = E[θ | data] = ∫ θ · p(θ | data) dθ

For symmetric distributions (like Gaussians), MAP and MMSE give the same answer because the mode equals the mean. But for skewed distributions, they differ — sometimes dramatically.

When they differ — with numbers: Take a Beta(3, 2) posterior. The MAP (mode) is (α−1)/(α+β−2) = 2/3 = 0.667. The MMSE (mean) is α/(α+β) = 3/5 = 0.600. They differ by 0.067. For a symmetric Beta(5, 5), both equal 0.5 exactly. The takeaway: skewness drives the wedge. Which estimate is "better" depends on your loss function: squared error favors MMSE, zero-one loss (hit or miss) favors MAP.
MAP vs MMSE on Skewed Posteriors

Adjust skewness. For symmetric distributions, MAP = MMSE. As skew increases, they diverge.

Skewness0.0
MAP = 2.00 MMSE = 2.00
EstimatorDefinitionLoss FunctionOptimal When
MAPPosterior mode0-1 loss (hit or miss)You want most probable value
MMSEPosterior meanSquared errorYou want minimum average error
MedianPosterior medianAbsolute errorYou want robustness to outliers

For Gaussian posteriors, all three estimates are identical (mode = mean = median). This is why Gaussians are so convenient — you don't need to think about which estimator to use. For non-Gaussian posteriors, the choice matters. In practice, MMSE (the posterior mean) is the most commonly used because squared error is the default loss function in engineering and science.

Check: When does the MMSE estimate differ from the MAP estimate?
🔨 Derivation Derive the MAP and MMSE for a Beta(α, β) posterior ✓ ATTEMPTED

Given a posterior Beta(α, β) with PDF p(θ) ∝ θα−1(1−θ)β−1, derive both the MAP (mode) and the MMSE (mean) in closed form.

Your task: (1) Find the mode by taking the derivative, setting it to zero, and solving. (2) State the mean (use the known formula). (3) Show that they're equal only when α = β.

Take d/dθ of the PDF (or equivalently, the log-PDF for easier calculus), set it to zero, and solve for θ. For Beta, differentiate θα−1(1−θ)β−1.
log p(θ) = (α−1)log(θ) + (β−1)log(1−θ) + const. Differentiate: (α−1)/θ − (β−1)/(1−θ) = 0. Now solve for θ.
(α−1)(1−θ) = (β−1)θ. Expand: α−1 − (α−1)θ = (β−1)θ. Collect θ terms on one side.

MAP (mode):

d/dθ [(α−1)lnθ + (β−1)ln(1−θ)] = 0

(α−1)/θ = (β−1)/(1−θ)

(α−1)(1−θ) = (β−1)θ

α−1 = θ(α−1 + β−1) = θ(α+β−2)

θ̂MAP = (α−1)/(α+β−2)

MMSE (mean): For Beta(α,β), the mean is E[θ] = α/(α+β).

When are they equal? Set (α−1)/(α+β−2) = α/(α+β). Cross-multiply: (α−1)(α+β) = α(α+β−2). Expand: α²+αβ−α−β = α²+αβ−2α. Simplify: −α−β = −2α, so α = β. They're equal exactly when the distribution is symmetric.

The key insight: The mode and mean of a Beta distribution are pulled in opposite directions by skewness. With α > β, the mode is farther right than the mean because the long left tail pulls the mean leftward. The gap (α−1)/(α+β−2) − α/(α+β) vanishes only at symmetry or in the large-sample limit where both approach the true value.

Chapter 4: Conjugate Priors

Computing the posterior can be hard — you need to multiply two functions and normalize. But for certain prior-likelihood pairs, the posterior has the same functional form as the prior. These are conjugate priors, and they make Bayesian updates trivially easy.

LikelihoodConjugate PriorPosteriorUse Case
Bernoulli/BinomialBeta(α, β)Beta(α+k, β+n-k)Coin bias estimation
Normal (known σ)Normal(μ0, σ0)Normal(μn, σn)Estimating a mean
PoissonGamma(a, b)Gamma(a+Σx, b+n)Rate estimation
MultinomialDirichletDirichletCategory probabilities
Why this matters: With conjugate priors, you just update the parameters. Beta(α, β) + k heads in n flips = Beta(α+k, β+n-k). No integrals, no computation. The parameters (α, β) are called hyperparameters and they accumulate evidence like a running tally.
Normal-Normal worked example: You're estimating a temperature θ. Your prior is N(μ0=0, σ0=10) — very uncertain. You take n=5 measurements with known noise σ=1, getting sample mean x̄=3.0. The posterior mean is:

μn = (σ² · μ0 + n · σ0² · x̄) / (σ² + n · σ0²) = (1·0 + 5·100·3) / (1 + 5·100) = 1500/501 = 2.994

The data overwhelms the vague prior — 5 measurements with σ=1 carry far more information than a prior with σ0=10. Now try σ0=0.5 (strong prior): μn = (1·0 + 5·0.25·3)/(1+5·0.25) = 3.75/2.25 = 1.667. The prior fights back.
Python
import numpy as np

def normal_normal_update(mu0, sig0, sig, observations):
    """Sequential Normal-Normal conjugate update."""
    mu, var = mu0, sig0**2
    for x in observations:
        prec_prior = 1 / var
        prec_obs   = 1 / sig**2
        var = 1 / (prec_prior + prec_obs)
        mu  = var * (prec_prior * mu + prec_obs * x)
    return mu, np.sqrt(var)

# Example: prior N(0, 10), noise sig=1, 5 observations
obs = [2.8, 3.2, 2.9, 3.1, 3.0]
mu_post, sig_post = normal_normal_update(0, 10, 1, obs)
# mu_post ≈ 2.99, sig_post ≈ 0.45
Interactive Coin Estimation (Beta-Binomial)

Click "Flip" to flip a coin with the hidden bias. The purple Beta distribution is your posterior belief about the coin's bias. Watch it sharpen with data.

True bias: ??? Heads: 0 Tails: 0 MAP: 0.50
Check: What makes a conjugate prior special?
🔨 Derivation Prove Beta is conjugate to Binomial ✓ ATTEMPTED

Prior: Beta(α, β) → p(θ) ∝ θα−1(1−θ)β−1. Likelihood: Binomial(n, k) → p(k|θ) ∝ θk(1−θ)n−k.

Your task: Multiply them together and show the posterior is Beta(α+k, β+n−k).

p(θ|data) ∝ p(data|θ) · p(θ) = θk(1−θ)n−k · θα−1(1−θ)β−1. Now combine the exponents.
θk · θα−1 = θk+α−1. Similarly for the (1−θ) terms. What does this look like?

Full derivation:

p(θ|data) ∝ θk(1−θ)n−k · θα−1(1−θ)β−1

= θk + α − 1(1−θ)(n−k) + β − 1

= θ(α+k) − 1(1−θ)(β+n−k) − 1

This is exactly the kernel of a Beta(α+k, β+n−k) distribution! The normalization constant (Beta function) is determined by the requirement that the posterior integrates to 1.

The key insight: Conjugacy works because the prior and likelihood have the SAME functional form in θ. When you multiply θa · θb = θa+b, the exponents simply add. The hyperparameters α and β can be interpreted as "pseudo-observations": α−1 prior heads and β−1 prior tails. Real data just adds to the count. This is why Bayesian updating is so natural for counting problems.

💻 Build It Implement the Beta-Binomial Posterior Update ✓ ATTEMPTED
Write a function that takes a Beta prior and a sequence of binary observations (coin flips), performs sequential Bayesian updates, and returns the posterior parameters plus the MAP and MMSE estimates at each step.
signature def beta_binomial_update(alpha_prior, beta_prior, observations): """ Sequential Bayesian update for Beta-Binomial model. Args: alpha_prior: float, initial alpha parameter of Beta prior beta_prior: float, initial beta parameter of Beta prior observations: list of 0s and 1s (tails and heads) Returns: history: list of dicts, one per observation, each containing: {'alpha': float, 'beta': float, 'map': float, 'mmse': float} """
Test case
history = beta_binomial_update(1, 1, [1, 1, 0, 1, 0]) # After all 5 flips: alpha=4, beta=3 # MAP = (4-1)/(4+3-2) = 3/5 = 0.600 # MMSE = 4/(4+3) = 4/7 ≈ 0.571 assert history[-1]['alpha'] == 4 assert history[-1]['beta'] == 3 assert abs(history[-1]['map'] - 0.6) < 0.001 assert abs(history[-1]['mmse'] - 4/7) < 0.001
When alpha=1 and beta=1 (uniform prior, no data yet), the MAP formula gives 0/0. In this case, any value in [0,1] is equally likely — return 0.5 as a convention. The MMSE formula alpha/(alpha+beta) always works.
python
def beta_binomial_update(alpha_prior, beta_prior, observations):
    history = []
    a, b = alpha_prior, beta_prior
    for obs in observations:
        # Update: head (1) increments alpha, tail (0) increments beta
        if obs == 1:
            a += 1
        else:
            b += 1

        # MAP: mode of Beta(a, b)
        if a + b - 2 > 0:
            map_est = (a - 1) / (a + b - 2)
        else:
            map_est = 0.5  # undefined for uniform

        # MMSE: mean of Beta(a, b)
        mmse_est = a / (a + b)

        history.append({
            'alpha': a,
            'beta': b,
            'map': map_est,
            'mmse': mmse_est
        })
    return history
Bonus challenge: Extend this to compute the 95% credible interval at each step using scipy.stats.beta.ppf(0.025, a, b) and scipy.stats.beta.ppf(0.975, a, b). Plot how the interval width shrinks as n grows.

Chapter 5: Recursive Bayesian Estimation

What if data arrives one observation at a time, rather than all at once? You don't need to redo the entire computation. Thanks to conjugacy and Bayes' rule, you can update sequentially: today's posterior becomes tomorrow's prior.

Start
Prior p(θ) — your initial belief
Observe x1
Posterior1 ∝ p(x1|θ) · Prior
Observe x2
Posterior2 ∝ p(x2|θ) · Posterior1
↓ ...
Observe xn
Posteriorn ∝ p(xn|θ) · Posteriorn-1
Beautiful property: The final posterior is identical whether you process data all at once or one at a time. Order doesn't matter either. This makes Bayesian estimation naturally suited to streaming data, sensor processing, and online learning.

This is not just a mathematical curiosity — it's the reason online learning works. Each update uses constant memory (just the current posterior parameters) and constant time (one multiply and one add for Normal-Normal). You never need to store or revisit old data. This sequential update IS the Kalman filter when the model is linear-Gaussian with a time-varying state.

Precision accumulates: Each observation adds its precision (1/σ²) to the posterior precision. After n observations: posterior precision = prior precision + n · observation precision. The posterior variance is 1/(prior precision + n/σ²), which shrinks as 1/n for large n. This is why more data always helps — and why the rate of improvement slows down (diminishing returns).
Batch = Sequential, verified with numbers: Prior N(0, 2). Three observations with σ=1: x1=2.5, x2=3.1, x3=2.8.

Sequential: After x1: μ=2.0, σ=0.894. After x2: μ=2.489, σ=0.707. After x3: μ=2.590, σ=0.632.

Batch: n=3, x̄=2.8. μn = (1·0 + 3·4·2.8)/(1 + 3·4) = 33.6/13 = 2.585. σn = 1/√(0.25 + 3) = 0.555.

The tiny difference (~0.005) is only because we're rounding at each step. With exact arithmetic, they match perfectly. This equivalence is what makes the Kalman filter possible — you can process one measurement at a time and get the same answer as if you'd seen them all at once.
Python — Verify batch = sequential
import numpy as np

# Sequential update
mu, var = 0.0, 4.0     # prior N(0, 2) → var=4
sig2 = 1.0               # observation noise variance
for x in [2.5, 3.1, 2.8]:
    new_prec = 1/var + 1/sig2
    mu = (1/var * mu + 1/sig2 * x) / new_prec
    var = 1 / new_prec
print(f"Sequential: μ={mu:.6f}, σ={np.sqrt(var):.6f}")

# Batch update
n, xbar = 3, np.mean([2.5, 3.1, 2.8])
var_batch = 1 / (1/4.0 + n/sig2)
mu_batch = var_batch * (0/4.0 + n * xbar / sig2)
print(f"Batch:      μ={mu_batch:.6f}, σ={np.sqrt(var_batch):.6f}")
# Both print identical values!
Sequential Updates: Normal-Normal Model

Each observation (red dot) arrives sequentially. Watch the green posterior sharpen as data accumulates. The blue dashed line is the prior.

n=0 | Prior: μ=0, σ=2.0
Check: In recursive Bayesian estimation, what role does the previous posterior play?
💥 Break-It Lab What Dies When You Break the Estimation? ✓ ATTEMPTED
A Normal-Normal Bayesian estimator is tracking a true value of θ=3.0 with measurement noise σ=1.0 and a moderately informative prior N(0, 2). Watch what happens when you remove key components.
Use flat (infinite-variance) prior OFF
Failure mode: With a flat prior, the posterior equals the likelihood exactly. You lose all regularization — with 1-2 observations, your estimate is entirely at the mercy of noise. The MAP becomes identical to the MLE. Small samples give wildly varying estimates.
Use wrong likelihood model (σ=0.1 assumed, actual σ=1.0) OFF
Failure mode: Assuming measurements are 10x more precise than they really are. The estimator over-trusts noisy data. Each observation yanks the posterior wildly because the model says "this measurement is basically exact." The posterior becomes overconfident — very narrow but centered at the wrong value.
Reduce to 2 observations (from 20) OFF
Failure mode: With only 2 noisy observations, the posterior barely moves from the prior. You might be far from the truth if your prior was wrong. The posterior width stays large — high uncertainty. This is CORRECT behavior (admitting ignorance) but shows why small samples are dangerous if you ignore that uncertainty.

Chapter 6: Connection to the Kalman Filter

The Kalman filter is nothing more than recursive Bayesian estimation with Gaussians. The state is the unknown parameter. The prior is the predicted state (Gaussian). The likelihood comes from the measurement model (also Gaussian). The posterior is the updated state estimate (still Gaussian, thanks to conjugacy!).

Bayesian EstimationKalman Filter
Prior p(θ)Predicted state N(x̂¯, P¯)
Likelihood p(z|θ)Measurement model N(Hx, R)
Posterior p(θ|z)Updated state N(x̂, P)
Posterior mean (MMSE)x̂ = x̂¯ + K(z − Hx̂¯)
Posterior precision = prior prec. + likelihood prec.P¯¹ = P¯¯¹ + H¹R¯¹H
Key insight: The Kalman gain K is just the ratio of precisions — how much to trust data vs prediction. K = P¯H¹(HP¯H¹+R)¯¹. This is Bayes' rule in matrix form. The "predict" step propagates the prior forward in time; the "update" step applies Bayes' theorem.
In 1D, the Kalman gain has a beautifully simple form: K = σpred² / (σpred² + σmeas²). When prediction is very uncertain (σpred ≫ σmeas), K → 1: trust the measurement. When measurement is very noisy (σmeas ≫ σpred), K → 0: trust the prediction. K is literally a trust slider between 0 and 1, set by the ratio of uncertainties. The posterior mean = prediction + K · (measurement − prediction).
Kalman as Bayesian Update (1D)

The blue Gaussian is the prediction (prior). The red is the measurement (likelihood). The green is the Kalman estimate (posterior) — always the narrowest.

Predict σ1.5
Measure σ1.0
Check: The Kalman filter is best described as:
🔗 Pattern Recognition
Precision-Weighted Fusion is Universal
This Lesson (Bayes)
μpost = (σ2²·μ1 + σ1²·μ2) / (σ1² + σ2²)
"Weight each source by the OTHER's variance"
Kalman Filter
x̂ = x̄ + K(z − Hx̄), K = P̄HT(HP̄HT+R)−1
"Trust slider between prediction and measurement" → Kalman Filter lesson

These are the SAME equation. The Kalman gain K is just the ratio of prediction uncertainty to total uncertainty — exactly the precision-weighted average from Bayesian estimation, but generalized to matrices. When you learn the Kalman filter, you already know 90% of it from this lesson. The "predict" step propagates the Gaussian prior forward in time; the "update" step applies Bayes' theorem with the measurement as likelihood.

Attention in transformers also computes a weighted average where weights depend on "relevance" (similarity). Could you frame attention as precision-weighted fusion? What would "precision" mean for tokens?

⚔ Adversarial: Your Kalman filter's state estimate is diverging (getting worse over time). The measurement noise R is correct. What's the most likely Bayesian explanation?
You're tracking a 1D position. True dynamics: xt+1 = xt + vt (constant velocity). Your model assumes: xt+1 = xt (zero velocity). Measurements arrive at zt = xt + noise. After 50 timesteps, the estimate is far from truth.

Chapter 7: Bayesian vs Frequentist

Bayesian and frequentist are two philosophies of statistics. They often agree on the numbers but disagree profoundly on what those numbers mean.

AspectBayesianFrequentist
ParametersRandom variables with distributionsFixed but unknown constants
Probability meansDegree of beliefLong-run frequency
Prior knowledgeEncoded in the priorNot used formally
ResultPosterior distribution p(θ|data)Point estimate + confidence interval
Small samplesNaturally handles (prior regularizes)Can be unreliable
ComputationCan be expensive (integrals)Usually simple formulas
The honest truth: Neither approach is universally better. Use Bayesian methods when you have genuine prior knowledge, small datasets, or need full uncertainty quantification. Use frequentist methods when you need simplicity, have large datasets (where the prior washes out), or need guaranteed coverage properties.
Bayesian vs Frequentist: Coin Estimation

With few flips, the Bayesian estimate (with uniform prior) is more conservative than MLE. With many flips, they converge. Adjust the number of flips.

Number of flips5
Heads observed4
Check: In Bayesian statistics, what is a parameter?

Chapter 8: Modern Bayesian Methods

Conjugate priors are elegant but limited — most real problems don't have conjugate forms. Modern Bayesian computation uses powerful algorithms to approximate posteriors for arbitrary models.

MCMC
Markov Chain Monte Carlo: draw samples from the posterior by constructing a Markov chain. Gold standard for accuracy. Can be slow.
Variational Inference
Approximate the posterior with a simpler distribution by minimizing KL divergence. Fast but approximate.
Bayesian Neural Nets
Put distributions over neural network weights. Get uncertainty estimates for predictions. Computationally expensive.
MethodAccuracySpeedUse Case
Conjugate priorsExactInstantSimple models (coin, Gaussian)
MCMC (HMC, NUTS)Asymptotically exactSlowComplex models, small-medium data
Variational InferenceApproximateFastLarge-scale, deep learning
Laplace ApproximationGaussian approx.FastWell-peaked posteriors
Particle FiltersApproximateMediumNonlinear, non-Gaussian sequential
What breaks — non-conjugate priors: Conjugate priors are the exception, not the rule. Most real models — logistic regression, neural networks, hierarchical models — have no closed-form posterior. The prior × likelihood product doesn't simplify. You're left with an intractable integral for the normalizing constant. This is exactly when you need MCMC (draw samples) or variational inference (optimize an approximation). The entire field of modern Bayesian computation exists because conjugacy breaks down.

The curse of dimensionality. Even with MCMC, high-dimensional parameter spaces are brutal. A small neural network might have 10,000 parameters. MCMC needs to explore a 10,000-dimensional posterior — the volume of that space is unimaginably large. This is why Bayesian neural networks remain expensive: the posterior is complex, multimodal, and high-dimensional. Variational inference trades accuracy for speed by fitting a simple distribution (usually a diagonal Gaussian) to the true posterior.

ScenarioParametersConjugate?Method
Coin bias1Yes (Beta)Closed-form update
Sensor mean1Yes (Normal)Closed-form / Kalman
Logistic regression10–100NoMCMC or Laplace approx.
Gaussian process~1,000PartiallySparse VI, inducing points
Neural network10K–1BNoVI, MC Dropout, ensembles
Connections: Bayesian estimation connects to everything. Kalman filters = recursive Bayesian with Gaussians. POMDPs maintain a Bayesian belief over states. Regularization in ML = implicit Bayesian prior (L2 = Gaussian prior, L1 = Laplace prior). Dropout in neural networks approximates Bayesian model averaging. Understanding Bayes' theorem is understanding the heart of inference.
"Probability theory is nothing but common sense reduced to calculation."
— Pierre-Simon Laplace

You now understand Bayesian estimation. Every time you update a belief with evidence, you're doing Bayes. Now you know the mathematics behind it.

Check: When conjugate priors are unavailable, which method draws samples from the posterior?
🏗 Design Challenge You're the Architect: Bayesian A/B Testing System ✓ ATTEMPTED
Your e-commerce company runs 50 A/B tests per month. The current frequentist approach requires fixed sample sizes (10,000 per variant) and produces binary "significant/not" verdicts. The VP of Product wants: (1) ability to peek at results early without inflating false positives, (2) probability statements ("95% chance B is better"), (3) a way to incorporate historical data from similar tests, and (4) support for tests with 3+ variants.
Traffic
500K daily visitors, split across ~5 concurrent tests
Baseline CVR
3.2% conversion rate (historical)
MDE
Must detect 5% relative lifts (3.2% → 3.36%)
Decision speed
Want to call winners within 7 days, not 14
1. What prior do you use for each variant's conversion rate? Flat Beta(1,1)? Informative prior from historical 3.2%? How do you set the strength?
2. What stopping rule do you use? P(B > A) > 0.95? Expected loss < $X? How do you handle "peeking" (checking results daily)?
3. With 3+ variants, how do you compute P(variant i is best)? Direct posterior sampling? What's the computational cost?
4. How many samples do you need before your first "peek" is meaningful? What's the minimum sample per variant?

Industry standard (VWO, Optimizely, Google):

Prior: Beta(1, 1) or weakly informative Beta(32, 968) centered at historical 3.2% CVR. The weak prior (equivalent to ~1000 pseudo-observations) prevents extreme early estimates but washes out after ~5000 real observations. Too-strong priors make tests insensitive to real differences.

Stopping rule: "Expected loss" framework. Stop when E[loss | choosing B] < ε (e.g., < 0.01% conversion rate). This naturally handles peeking because it's a PROPERTY of the current posterior, not a p-value that gets inflated. Alternatively: P(B > A | data) > 0.95 AND at least 1000 samples per arm (the minimum prevents lucky streaks).

Multi-variant: Draw 10,000 samples from each variant's Beta posterior. For each draw, record which variant was highest. P(i is best) = fraction of draws where i won. Cost: O(K × N_samples), trivially fast for <10 variants.

Minimum sample: With Beta(1,1) prior: ~100 conversions per variant before peeking meaningfully (prior influence <2%). First meaningful peek at ~3000 visitors per variant (given 3.2% CVR = ~96 conversions). Daily peeking with this minimum is safe because the Bayesian approach doesn't accumulate Type I error.

🔗 Pattern Recognition
Posterior Sampling Solves Exploration-Exploitation
This Lesson (Bayesian Estimation)
Maintain Beta(α, β) posterior for each arm's reward probability.
Posterior encodes BOTH the best estimate AND the uncertainty.
RL (Thompson Sampling)
At each step: sample θi ~ Beta(αi, βi) for each arm. Play the arm with highest sample.
Naturally explores uncertain arms. → RL Algorithms lesson

Thompson Sampling is literally "sample from the posterior, act as if the sample is truth." Arms with high uncertainty have wide posteriors → their samples occasionally beat the current best → they get explored. Arms with tight posteriors centered low never get sampled high → they get ignored. The Bayesian posterior IS the exploration strategy. No epsilon-greedy, no UCB formula needed — just sample and act.

The Bayes filter (for robot localization) also maintains a posterior over states. Could you design a "Thompson Sampling" version of robot navigation that samples a position from the belief and acts as if it's there? What would go wrong?