microBayes — From Prior Beliefs to Optimal Estimates

Chapter 1: Prior, Likelihood, Posterior

Bayesian estimation has three ingredients. The prior p(θ) encodes what you believed before seeing data. The likelihood p(data | θ) says how probable the observed data is for each possible value of θ. The posterior p(θ | data) is your updated belief after seeing the data.

p(θ | data) = p(data | θ) · p(θ) / p(data)

This is Bayes' theorem. The denominator p(data) = ∫ p(data|θ) · p(θ) dθ is just a normalizing constant (it ensures the posterior integrates to 1). In practice, you almost never compute it directly — either it cancels out (conjugate priors) or you approximate it (MCMC). The key insight: the posterior is proportional to the likelihood times the prior. Data and prior beliefs get multiplied together.

Key insight: The posterior is a compromise between the prior and the data. Strong prior + weak data = posterior close to prior. Weak prior + strong data = posterior close to the likelihood. As you gather more data, the prior matters less and the data dominates.

For Gaussians, the compromise has a clean formula. If the prior is N(μ₀, σ₀²) and the likelihood centers at x with variance σ², the posterior mean is:

μ_post = (σ² · μ₀ + σ₀² · x) / (σ² + σ₀²)

This is a precision-weighted average. Precision = 1/variance. The source with higher precision (lower variance) gets more weight. If σ₀ = 1 and σ = 2: the prior has precision 1, the data has precision 0.25, so the prior gets 4 times the weight. The posterior variance is always smaller than either input variance — combining information always reduces uncertainty.

Interactive: Prior × Likelihood = Posterior

Drag the sliders to move the prior and likelihood. Watch how the posterior (green) shifts between them.

Prior mean-1.0

Prior σ1.5

Likelihood mean1.5

Likelihood σ1.0

Check: What does the posterior represent?

Your belief before seeing any data How likely the data is Your updated belief about the parameter after seeing data

🔨 Derivation Why does dividing by p(data) make it a valid distribution? ▶ ✓ ATTEMPTED

Bayes' theorem says p(θ|data) = p(data|θ) · p(θ) / p(data). The numerator is just the likelihood times the prior — a product of two functions. But dividing by p(data) = ∫ p(data|θ) · p(θ) dθ magically makes the result integrate to 1.

Your task: Prove that ∫ p(θ|data) dθ = 1 by expanding the definition and showing the cancellation.

Write out ∫ [p(data|θ) · p(θ) / p(data)] dθ. Notice that p(data) does not depend on θ — it's a constant you can pull out of the integral.

You get (1/p(data)) · ∫ p(data|θ) · p(θ) dθ. But that integral IS the definition of p(data) (the marginal likelihood / evidence).

You have p(data)/p(data) = 1. Done! The denominator was constructed specifically to be the normalization constant.

Full derivation:

∫ p(θ|data) dθ = ∫ p(data|θ) · p(θ) / p(data) dθ

Since p(data) is a constant w.r.t. θ:

= [1/p(data)] · ∫ p(data|θ) · p(θ) dθ

But by definition, p(data) = ∫ p(data|θ) · p(θ) dθ (marginalization over θ).

= [1/p(data)] · p(data) = 1. □

The key insight: The evidence p(data) isn't some mysterious quantity — it's literally defined as "whatever makes the posterior integrate to 1." That's why we can often ignore it and write p(θ|data) ∝ p(data|θ) · p(θ). The proportionality constant is fixed by the requirement that probabilities sum to 1.

⚔ Adversarial: Your Bayesian estimator assigns 95% posterior probability to the WRONG hypothesis. The prior was 50/50. How is this possible?

You have two hypotheses: H₀ (θ=0.3) and H₁ (θ=0.7). Your prior is P(H₀) = P(H₁) = 0.5. You observe 10 coin flips. The Bayesian posterior gives P(H₁|data) = 0.95. But the TRUE bias is 0.3 (H₀ is correct). What caused this?

The prior was wrong The data was an unlikely but possible realization (e.g., 8/10 heads from a 0.3-bias coin) Bayes' theorem is wrong in discrete cases

Checkpoint — Before you move on

In your own words: why is the posterior ALWAYS narrower (more certain) than both the prior and the likelihood individually? What mathematical operation causes uncertainty to decrease?

✓ Gate cleared

Model Answer

The posterior precision = prior precision + likelihood precision. Precision is 1/variance, so adding precisions means the posterior variance is ALWAYS smaller than either input: 1/(1/σ₁² + 1/σ₂²) < min(σ₁², σ₂²). Multiplying two Gaussians and renormalizing creates a distribution that's narrower than either factor because you're combining INDEPENDENT sources of information. Each source rules out different parts of the parameter space. The regions that survive are only those consistent with BOTH sources — a smaller set than either alone.

Chapter 2: MAP Estimation

Once you have the posterior distribution, how do you get a single number estimate? One natural choice: pick the value of θ that maximizes the posterior. This is Maximum A Posteriori (MAP) estimation.

θ̂_MAP = argmax_θ p(θ | data) = argmax_θ p(data | θ) · p(θ)

MAP finds the peak (mode) of the posterior distribution. It's like the most likely value given everything you know. For Gaussians, the MAP estimate has a clean formula — it's the precision-weighted average of the prior mean and the data.

MAP vs MLE: Maximum Likelihood Estimation (MLE) ignores the prior entirely — it just maximizes p(data | θ). MAP includes the prior, which acts as a regularizer. With a Gaussian prior, MAP is equivalent to L2 regularization in machine learning.

MAP with numbers (Beta posterior): Suppose you flip a coin 10 times and get 7 heads. With a Beta(1,1) uniform prior, your posterior is Beta(8, 4). The MAP estimate is:

θ̂_MAP = (α−1) / (α+β−2) = 7/10 = 0.700

This is the same as MLE here because the uniform prior adds no bias. Now try a Beta(5,5) prior (biased toward 0.5). Posterior becomes Beta(12, 8). MAP = 11/18 = 0.611 — pulled toward 0.5 by the prior. The prior acts exactly like adding fake observations: Beta(5,5) is like having already seen 4 heads and 4 tails before the experiment started.

MAP: Finding the Posterior Peak

The orange dot marks the MAP estimate (posterior mode). Notice how it lies between the prior and likelihood peaks, weighted by their precisions.

Prior σ1.5

Likelihood σ1.0

      θ̂MAP =
      0.00
    

Check: What does MAP estimation find?

The mode (peak) of the posterior distribution The mean of the prior distribution The most likely data point

Chapter 3: MMSE — The Posterior Mean

Another estimator: instead of the peak, take the mean of the posterior. This is the Minimum Mean Square Error (MMSE) estimator. It minimizes the expected squared error — on average, no other estimator gets closer to the true value.

θ̂_MMSE = E[θ | data] = ∫ θ · p(θ | data) dθ

For symmetric distributions (like Gaussians), MAP and MMSE give the same answer because the mode equals the mean. But for skewed distributions, they differ — sometimes dramatically.

When they differ — with numbers: Take a Beta(3, 2) posterior. The MAP (mode) is (α−1)/(α+β−2) = 2/3 = 0.667. The MMSE (mean) is α/(α+β) = 3/5 = 0.600. They differ by 0.067. For a symmetric Beta(5, 5), both equal 0.5 exactly. The takeaway: skewness drives the wedge. Which estimate is "better" depends on your loss function: squared error favors MMSE, zero-one loss (hit or miss) favors MAP.

MAP vs MMSE on Skewed Posteriors

Adjust skewness. For symmetric distributions, MAP = MMSE. As skew increases, they diverge.

Skewness0.0

      MAP = 2.00
      MMSE = 2.00
    

Estimator	Definition	Loss Function	Optimal When
MAP	Posterior mode	0-1 loss (hit or miss)	You want most probable value
MMSE	Posterior mean	Squared error	You want minimum average error
Median	Posterior median	Absolute error	You want robustness to outliers

For Gaussian posteriors, all three estimates are identical (mode = mean = median). This is why Gaussians are so convenient — you don't need to think about which estimator to use. For non-Gaussian posteriors, the choice matters. In practice, MMSE (the posterior mean) is the most commonly used because squared error is the default loss function in engineering and science.

Check: When does the MMSE estimate differ from the MAP estimate?

When the posterior is skewed (asymmetric) When there's too much data Never — they're always the same

🔨 Derivation Derive the MAP and MMSE for a Beta(α, β) posterior ▶ ✓ ATTEMPTED

Given a posterior Beta(α, β) with PDF p(θ) ∝ θ^α−1(1−θ)^β−1, derive both the MAP (mode) and the MMSE (mean) in closed form.

Your task: (1) Find the mode by taking the derivative, setting it to zero, and solving. (2) State the mean (use the known formula). (3) Show that they're equal only when α = β.

Take d/dθ of the PDF (or equivalently, the log-PDF for easier calculus), set it to zero, and solve for θ. For Beta, differentiate θ^α−1(1−θ)^β−1.

log p(θ) = (α−1)log(θ) + (β−1)log(1−θ) + const. Differentiate: (α−1)/θ − (β−1)/(1−θ) = 0. Now solve for θ.

(α−1)(1−θ) = (β−1)θ. Expand: α−1 − (α−1)θ = (β−1)θ. Collect θ terms on one side.

MAP (mode):

d/dθ [(α−1)lnθ + (β−1)ln(1−θ)] = 0

(α−1)/θ = (β−1)/(1−θ)

(α−1)(1−θ) = (β−1)θ

α−1 = θ(α−1 + β−1) = θ(α+β−2)

θ̂_MAP = (α−1)/(α+β−2)

MMSE (mean): For Beta(α,β), the mean is E[θ] = α/(α+β).

When are they equal? Set (α−1)/(α+β−2) = α/(α+β). Cross-multiply: (α−1)(α+β) = α(α+β−2). Expand: α²+αβ−α−β = α²+αβ−2α. Simplify: −α−β = −2α, so α = β. They're equal exactly when the distribution is symmetric.

The key insight: The mode and mean of a Beta distribution are pulled in opposite directions by skewness. With α > β, the mode is farther right than the mean because the long left tail pulls the mean leftward. The gap (α−1)/(α+β−2) − α/(α+β) vanishes only at symmetry or in the large-sample limit where both approach the true value.

Chapter 4: Conjugate Priors

Computing the posterior can be hard — you need to multiply two functions and normalize. But for certain prior-likelihood pairs, the posterior has the same functional form as the prior. These are conjugate priors, and they make Bayesian updates trivially easy.

Likelihood	Conjugate Prior	Posterior	Use Case
Bernoulli/Binomial	Beta(α, β)	Beta(α+k, β+n-k)	Coin bias estimation
Normal (known σ)	Normal(μ₀, σ₀)	Normal(μ_n, σ_n)	Estimating a mean
Poisson	Gamma(a, b)	Gamma(a+Σx, b+n)	Rate estimation
Multinomial	Dirichlet	Dirichlet	Category probabilities

Why this matters: With conjugate priors, you just update the parameters. Beta(α, β) + k heads in n flips = Beta(α+k, β+n-k). No integrals, no computation. The parameters (α, β) are called hyperparameters and they accumulate evidence like a running tally.

Normal-Normal worked example: You're estimating a temperature θ. Your prior is N(μ₀=0, σ₀=10) — very uncertain. You take n=5 measurements with known noise σ=1, getting sample mean x̄=3.0. The posterior mean is:

μ_n = (σ² · μ₀ + n · σ₀² · x̄) / (σ² + n · σ₀²) = (1·0 + 5·100·3) / (1 + 5·100) = 1500/501 = 2.994

The data overwhelms the vague prior — 5 measurements with σ=1 carry far more information than a prior with σ₀=10. Now try σ₀=0.5 (strong prior): μ_n = (1·0 + 5·0.25·3)/(1+5·0.25) = 3.75/2.25 = 1.667. The prior fights back.

Python
import numpy as np

def normal_normal_update(mu0, sig0, sig, observations):
    """Sequential Normal-Normal conjugate update."""
    mu, var = mu0, sig0**2
    for x in observations:
        prec_prior = 1 / var
        prec_obs   = 1 / sig**2
        var = 1 / (prec_prior + prec_obs)
        mu  = var * (prec_prior * mu + prec_obs * x)
    return mu, np.sqrt(var)

# Example: prior N(0, 10), noise sig=1, 5 observations
obs = [2.8, 3.2, 2.9, 3.1, 3.0]
mu_post, sig_post = normal_normal_update(0, 10, 1, obs)
# mu_post ≈ 2.99, sig_post ≈ 0.45

Interactive Coin Estimation (Beta-Binomial)

Click "Flip" to flip a coin with the hidden bias. The purple Beta distribution is your posterior belief about the coin's bias. Watch it sharpen with data.

      True bias: ???
      Heads: 0
      Tails: 0
      MAP: 0.50
    

Check: What makes a conjugate prior special?

The posterior has the same functional form as the prior It always gives a symmetric posterior It eliminates the need for data

🔨 Derivation Prove Beta is conjugate to Binomial ▶ ✓ ATTEMPTED

Prior: Beta(α, β) → p(θ) ∝ θ^α−1(1−θ)^β−1. Likelihood: Binomial(n, k) → p(k|θ) ∝ θ^k(1−θ)^n−k.

Your task: Multiply them together and show the posterior is Beta(α+k, β+n−k).

p(θ|data) ∝ p(data|θ) · p(θ) = θ^k(1−θ)^n−k · θ^α−1(1−θ)^β−1. Now combine the exponents.

θ^k · θ^α−1 = θ^k+α−1. Similarly for the (1−θ) terms. What does this look like?

Full derivation:

p(θ|data) ∝ θ^k(1−θ)^n−k · θ^α−1(1−θ)^β−1

= θ^{k + α − 1}(1−θ)^{(n−k) + β − 1}

= θ^{(α+k) − 1}(1−θ)^{(β+n−k) − 1}

This is exactly the kernel of a Beta(α+k, β+n−k) distribution! The normalization constant (Beta function) is determined by the requirement that the posterior integrates to 1.

The key insight: Conjugacy works because the prior and likelihood have the SAME functional form in θ. When you multiply θ^a · θ^b = θ^a+b, the exponents simply add. The hyperparameters α and β can be interpreted as "pseudo-observations": α−1 prior heads and β−1 prior tails. Real data just adds to the count. This is why Bayesian updating is so natural for counting problems.

💻 Build It Implement the Beta-Binomial Posterior Update ▶ ✓ ATTEMPTED

Write a function that takes a Beta prior and a sequence of binary observations (coin flips), performs sequential Bayesian updates, and returns the posterior parameters plus the MAP and MMSE estimates at each step.

signature def beta_binomial_update(alpha_prior, beta_prior, observations): """ Sequential Bayesian update for Beta-Binomial model. Args: alpha_prior: float, initial alpha parameter of Beta prior beta_prior: float, initial beta parameter of Beta prior observations: list of 0s and 1s (tails and heads) Returns: history: list of dicts, one per observation, each containing: {'alpha': float, 'beta': float, 'map': float, 'mmse': float} """

Test case

history = beta_binomial_update(1, 1, [1, 1, 0, 1, 0]) # After all 5 flips: alpha=4, beta=3 # MAP = (4-1)/(4+3-2) = 3/5 = 0.600 # MMSE = 4/(4+3) = 4/7 ≈ 0.571 assert history[-1]['alpha'] == 4 assert history[-1]['beta'] == 3 assert abs(history[-1]['map'] - 0.6) < 0.001 assert abs(history[-1]['mmse'] - 4/7) < 0.001

When alpha=1 and beta=1 (uniform prior, no data yet), the MAP formula gives 0/0. In this case, any value in [0,1] is equally likely — return 0.5 as a convention. The MMSE formula alpha/(alpha+beta) always works.

python
def beta_binomial_update(alpha_prior, beta_prior, observations):
    history = []
    a, b = alpha_prior, beta_prior
    for obs in observations:
        # Update: head (1) increments alpha, tail (0) increments beta
        if obs == 1:
            a += 1
        else:
            b += 1

        # MAP: mode of Beta(a, b)
        if a + b - 2 > 0:
            map_est = (a - 1) / (a + b - 2)
        else:
            map_est = 0.5  # undefined for uniform

        # MMSE: mean of Beta(a, b)
        mmse_est = a / (a + b)

        history.append({
            'alpha': a,
            'beta': b,
            'map': map_est,
            'mmse': mmse_est
        })
    return history

Bonus challenge: Extend this to compute the 95% credible interval at each step using scipy.stats.beta.ppf(0.025, a, b) and scipy.stats.beta.ppf(0.975, a, b). Plot how the interval width shrinks as n grows.

Chapter 5: Recursive Bayesian Estimation

What if data arrives one observation at a time, rather than all at once? You don't need to redo the entire computation. Thanks to conjugacy and Bayes' rule, you can update sequentially: today's posterior becomes tomorrow's prior.

Start

Prior p(θ) — your initial belief

↓

Observe x₁

Posterior₁ ∝ p(x₁|θ) · Prior

↓

Observe x₂

Posterior₂ ∝ p(x₂|θ) · Posterior₁

↓ ...

Observe x_n

Posterior_n ∝ p(x_n|θ) · Posterior_n-1

Beautiful property: The final posterior is identical whether you process data all at once or one at a time. Order doesn't matter either. This makes Bayesian estimation naturally suited to streaming data, sensor processing, and online learning.

This is not just a mathematical curiosity — it's the reason online learning works. Each update uses constant memory (just the current posterior parameters) and constant time (one multiply and one add for Normal-Normal). You never need to store or revisit old data. This sequential update IS the Kalman filter when the model is linear-Gaussian with a time-varying state.

Precision accumulates: Each observation adds its precision (1/σ²) to the posterior precision. After n observations: posterior precision = prior precision + n · observation precision. The posterior variance is 1/(prior precision + n/σ²), which shrinks as 1/n for large n. This is why more data always helps — and why the rate of improvement slows down (diminishing returns).

Batch = Sequential, verified with numbers: Prior N(0, 2). Three observations with σ=1: x₁=2.5, x₂=3.1, x₃=2.8.

Sequential: After x₁: μ=2.0, σ=0.894. After x₂: μ=2.489, σ=0.707. After x₃: μ=2.590, σ=0.632.

Batch: n=3, x̄=2.8. μ_n = (1·0 + 3·4·2.8)/(1 + 3·4) = 33.6/13 = 2.585. σ_n = 1/√(0.25 + 3) = 0.555.

The tiny difference (~0.005) is only because we're rounding at each step. With exact arithmetic, they match perfectly. This equivalence is what makes the Kalman filter possible — you can process one measurement at a time and get the same answer as if you'd seen them all at once.

Python — Verify batch = sequential
import numpy as np

# Sequential update
mu, var = 0.0, 4.0     # prior N(0, 2) → var=4
sig2 = 1.0               # observation noise variance
for x in [2.5, 3.1, 2.8]:
    new_prec = 1/var + 1/sig2
    mu = (1/var * mu + 1/sig2 * x) / new_prec
    var = 1 / new_prec
print(f"Sequential: μ={mu:.6f}, σ={np.sqrt(var):.6f}")

# Batch update
n, xbar = 3, np.mean([2.5, 3.1, 2.8])
var_batch = 1 / (1/4.0 + n/sig2)
mu_batch = var_batch * (0/4.0 + n * xbar / sig2)
print(f"Batch:      μ={mu_batch:.6f}, σ={np.sqrt(var_batch):.6f}")
# Both print identical values!

Sequential Updates: Normal-Normal Model

Each observation (red dot) arrives sequentially. Watch the green posterior sharpen as data accumulates. The blue dashed line is the prior.

n=0 | Prior: μ=0, σ=2.0

Check: In recursive Bayesian estimation, what role does the previous posterior play?

It becomes the prior for the next update It is discarded after each step It is averaged with all previous posteriors

💥 Break-It Lab What Dies When You Break the Estimation? ▶ ✓ ATTEMPTED

A Normal-Normal Bayesian estimator is tracking a true value of θ=3.0 with measurement noise σ=1.0 and a moderately informative prior N(0, 2). Watch what happens when you remove key components.

Use flat (infinite-variance) prior OFF

Failure mode: With a flat prior, the posterior equals the likelihood exactly. You lose all regularization — with 1-2 observations, your estimate is entirely at the mercy of noise. The MAP becomes identical to the MLE. Small samples give wildly varying estimates.

Use wrong likelihood model (σ=0.1 assumed, actual σ=1.0) OFF

Failure mode: Assuming measurements are 10x more precise than they really are. The estimator over-trusts noisy data. Each observation yanks the posterior wildly because the model says "this measurement is basically exact." The posterior becomes overconfident — very narrow but centered at the wrong value.

Reduce to 2 observations (from 20) OFF

Failure mode: With only 2 noisy observations, the posterior barely moves from the prior. You might be far from the truth if your prior was wrong. The posterior width stays large — high uncertainty. This is CORRECT behavior (admitting ignorance) but shows why small samples are dangerous if you ignore that uncertainty.

Chapter 6: Connection to the Kalman Filter

The Kalman filter is nothing more than recursive Bayesian estimation with Gaussians. The state is the unknown parameter. The prior is the predicted state (Gaussian). The likelihood comes from the measurement model (also Gaussian). The posterior is the updated state estimate (still Gaussian, thanks to conjugacy!).

Bayesian Estimation	Kalman Filter
Prior p(θ)	Predicted state N(x̂¯, P¯)
Likelihood p(z\|θ)	Measurement model N(Hx, R)
Posterior p(θ\|z)	Updated state N(x̂, P)
Posterior mean (MMSE)	x̂ = x̂¯ + K(z − Hx̂¯)
Posterior precision = prior prec. + likelihood prec.	P¯¹ = P¯¯¹ + H¹R¯¹H

Key insight: The Kalman gain K is just the ratio of precisions — how much to trust data vs prediction. K = P¯H¹(HP¯H¹+R)¯¹. This is Bayes' rule in matrix form. The "predict" step propagates the prior forward in time; the "update" step applies Bayes' theorem.

In 1D, the Kalman gain has a beautifully simple form: K = σ_pred² / (σ_pred² + σ_meas²). When prediction is very uncertain (σ_pred ≫ σ_meas), K → 1: trust the measurement. When measurement is very noisy (σ_meas ≫ σ_pred), K → 0: trust the prediction. K is literally a trust slider between 0 and 1, set by the ratio of uncertainties. The posterior mean = prediction + K · (measurement − prediction).

Kalman as Bayesian Update (1D)

The blue Gaussian is the prediction (prior). The red is the measurement (likelihood). The green is the Kalman estimate (posterior) — always the narrowest.

Predict σ1.5

Measure σ1.0

Check: The Kalman filter is best described as:

Recursive Bayesian estimation with Gaussian distributions A type of neural network A frequentist hypothesis test

🔗 Pattern Recognition

Precision-Weighted Fusion is Universal

This Lesson (Bayes)

μ_post = (σ₂²·μ₁ + σ₁²·μ₂) / (σ₁² + σ₂²)
"Weight each source by the OTHER's variance"

Kalman Filter

x̂ = x̄ + K(z − Hx̄), K = P̄H^T(HP̄H^T+R)⁻¹
"Trust slider between prediction and measurement" → Kalman Filter lesson

These are the SAME equation. The Kalman gain K is just the ratio of prediction uncertainty to total uncertainty — exactly the precision-weighted average from Bayesian estimation, but generalized to matrices. When you learn the Kalman filter, you already know 90% of it from this lesson. The "predict" step propagates the Gaussian prior forward in time; the "update" step applies Bayes' theorem with the measurement as likelihood.

Attention in transformers also computes a weighted average where weights depend on "relevance" (similarity). Could you frame attention as precision-weighted fusion? What would "precision" mean for tokens?

⚔ Adversarial: Your Kalman filter's state estimate is diverging (getting worse over time). The measurement noise R is correct. What's the most likely Bayesian explanation?

You're tracking a 1D position. True dynamics: x_t+1 = x_t + v_t (constant velocity). Your model assumes: x_t+1 = x_t (zero velocity). Measurements arrive at z_t = x_t + noise. After 50 timesteps, the estimate is far from truth.

The measurement noise R is too small The prior was wrong The process model (likelihood structure) is wrong — the "prior propagation" step doesn't match reality

Chapter 7: Bayesian vs Frequentist

Bayesian and frequentist are two philosophies of statistics. They often agree on the numbers but disagree profoundly on what those numbers mean.

Aspect	Bayesian	Frequentist
Parameters	Random variables with distributions	Fixed but unknown constants
Probability means	Degree of belief	Long-run frequency
Prior knowledge	Encoded in the prior	Not used formally
Result	Posterior distribution p(θ\|data)	Point estimate + confidence interval
Small samples	Naturally handles (prior regularizes)	Can be unreliable
Computation	Can be expensive (integrals)	Usually simple formulas

The honest truth: Neither approach is universally better. Use Bayesian methods when you have genuine prior knowledge, small datasets, or need full uncertainty quantification. Use frequentist methods when you need simplicity, have large datasets (where the prior washes out), or need guaranteed coverage properties.

Bayesian vs Frequentist: Coin Estimation

With few flips, the Bayesian estimate (with uniform prior) is more conservative than MLE. With many flips, they converge. Adjust the number of flips.

Number of flips5

Heads observed4

Check: In Bayesian statistics, what is a parameter?

A random variable with a probability distribution A fixed constant that can never change A number you choose before the experiment

Chapter 8: Modern Bayesian Methods

Conjugate priors are elegant but limited — most real problems don't have conjugate forms. Modern Bayesian computation uses powerful algorithms to approximate posteriors for arbitrary models.

MCMC

Markov Chain Monte Carlo: draw samples from the posterior by constructing a Markov chain. Gold standard for accuracy. Can be slow.

↓

Variational Inference

Approximate the posterior with a simpler distribution by minimizing KL divergence. Fast but approximate.

↓

Bayesian Neural Nets

Put distributions over neural network weights. Get uncertainty estimates for predictions. Computationally expensive.

Method	Accuracy	Speed	Use Case
Conjugate priors	Exact	Instant	Simple models (coin, Gaussian)
MCMC (HMC, NUTS)	Asymptotically exact	Slow	Complex models, small-medium data
Variational Inference	Approximate	Fast	Large-scale, deep learning
Laplace Approximation	Gaussian approx.	Fast	Well-peaked posteriors
Particle Filters	Approximate	Medium	Nonlinear, non-Gaussian sequential

What breaks — non-conjugate priors: Conjugate priors are the exception, not the rule. Most real models — logistic regression, neural networks, hierarchical models — have no closed-form posterior. The prior × likelihood product doesn't simplify. You're left with an intractable integral for the normalizing constant. This is exactly when you need MCMC (draw samples) or variational inference (optimize an approximation). The entire field of modern Bayesian computation exists because conjugacy breaks down.

The curse of dimensionality. Even with MCMC, high-dimensional parameter spaces are brutal. A small neural network might have 10,000 parameters. MCMC needs to explore a 10,000-dimensional posterior — the volume of that space is unimaginably large. This is why Bayesian neural networks remain expensive: the posterior is complex, multimodal, and high-dimensional. Variational inference trades accuracy for speed by fitting a simple distribution (usually a diagonal Gaussian) to the true posterior.

Scenario	Parameters	Conjugate?	Method
Coin bias	1	Yes (Beta)	Closed-form update
Sensor mean	1	Yes (Normal)	Closed-form / Kalman
Logistic regression	10–100	No	MCMC or Laplace approx.
Gaussian process	~1,000	Partially	Sparse VI, inducing points
Neural network	10K–1B	No	VI, MC Dropout, ensembles

Connections: Bayesian estimation connects to everything. Kalman filters = recursive Bayesian with Gaussians. POMDPs maintain a Bayesian belief over states. Regularization in ML = implicit Bayesian prior (L2 = Gaussian prior, L1 = Laplace prior). Dropout in neural networks approximates Bayesian model averaging. Understanding Bayes' theorem is understanding the heart of inference.

"Probability theory is nothing but common sense reduced to calculation."

— Pierre-Simon Laplace

You now understand Bayesian estimation. Every time you update a belief with evidence, you're doing Bayes. Now you know the mathematics behind it.

Check: When conjugate priors are unavailable, which method draws samples from the posterior?

MCMC (Markov Chain Monte Carlo) Gradient descent Principal component analysis

🏗 Design Challenge You're the Architect: Bayesian A/B Testing System ▶ ✓ ATTEMPTED

Your e-commerce company runs 50 A/B tests per month. The current frequentist approach requires fixed sample sizes (10,000 per variant) and produces binary "significant/not" verdicts. The VP of Product wants: (1) ability to peek at results early without inflating false positives, (2) probability statements ("95% chance B is better"), (3) a way to incorporate historical data from similar tests, and (4) support for tests with 3+ variants.

Traffic

500K daily visitors, split across ~5 concurrent tests

Baseline CVR

3.2% conversion rate (historical)

MDE

Must detect 5% relative lifts (3.2% → 3.36%)

Decision speed

Want to call winners within 7 days, not 14

1. What prior do you use for each variant's conversion rate? Flat Beta(1,1)? Informative prior from historical 3.2%? How do you set the strength?

2. What stopping rule do you use? P(B > A) > 0.95? Expected loss < $X? How do you handle "peeking" (checking results daily)?

3. With 3+ variants, how do you compute P(variant i is best)? Direct posterior sampling? What's the computational cost?

4. How many samples do you need before your first "peek" is meaningful? What's the minimum sample per variant?

Industry standard (VWO, Optimizely, Google):

Prior: Beta(1, 1) or weakly informative Beta(32, 968) centered at historical 3.2% CVR. The weak prior (equivalent to ~1000 pseudo-observations) prevents extreme early estimates but washes out after ~5000 real observations. Too-strong priors make tests insensitive to real differences.

Stopping rule: "Expected loss" framework. Stop when E[loss | choosing B] < ε (e.g., < 0.01% conversion rate). This naturally handles peeking because it's a PROPERTY of the current posterior, not a p-value that gets inflated. Alternatively: P(B > A | data) > 0.95 AND at least 1000 samples per arm (the minimum prevents lucky streaks).

Multi-variant: Draw 10,000 samples from each variant's Beta posterior. For each draw, record which variant was highest. P(i is best) = fraction of draws where i won. Cost: O(K × N_samples), trivially fast for <10 variants.

Minimum sample: With Beta(1,1) prior: ~100 conversions per variant before peeking meaningfully (prior influence <2%). First meaningful peek at ~3000 visitors per variant (given 3.2% CVR = ~96 conversions). Daily peeking with this minimum is safe because the Bayesian approach doesn't accumulate Type I error.

🔗 Pattern Recognition

Posterior Sampling Solves Exploration-Exploitation

This Lesson (Bayesian Estimation)

Maintain Beta(α, β) posterior for each arm's reward probability.
Posterior encodes BOTH the best estimate AND the uncertainty.

RL (Thompson Sampling)

At each step: sample θ_i ~ Beta(α_i, β_i) for each arm. Play the arm with highest sample.
Naturally explores uncertain arms. → RL Algorithms lesson

Thompson Sampling is literally "sample from the posterior, act as if the sample is truth." Arms with high uncertainty have wide posteriors → their samples occasionally beat the current best → they get explored. Arms with tight posteriors centered low never get sampled high → they get ignored. The Bayesian posterior IS the exploration strategy. No epsilon-greedy, no UCB formula needed — just sample and act.

The Bayes filter (for robot localization) also maintains a posterior over states. Could you design a "Thompson Sampling" version of robot navigation that samples a position from the belief and acts as if it's there? What would go wrong?

Understand Bayesian
Estimation

Chapter 0: Parameters vs States

Chapter 1: Prior, Likelihood, Posterior

Chapter 2: MAP Estimation

Chapter 3: MMSE — The Posterior Mean

Chapter 4: Conjugate Priors

Chapter 5: Recursive Bayesian Estimation

Chapter 6: Connection to the Kalman Filter

Chapter 7: Bayesian vs Frequentist

Chapter 8: Modern Bayesian Methods

Understand BayesianEstimation

Chapter 0: Parameters vs States

Chapter 1: Prior, Likelihood, Posterior

Chapter 2: MAP Estimation

Chapter 3: MMSE — The Posterior Mean

Chapter 4: Conjugate Priors

Chapter 5: Recursive Bayesian Estimation

Chapter 6: Connection to the Kalman Filter

Chapter 7: Bayesian vs Frequentist

Chapter 8: Modern Bayesian Methods

Understand Bayesian
Estimation