Ch 13: Sampling & Bootstrapping — Piech Probability CS

Chapter 0: Why Sampling?

You are the king of Bhutan and you want to know the average happiness of the 774,000 people in your country. You cannot ask every single person. But you can pick 200 citizens at random and ask them. The question that drives this entire chapter is: what can you conclude about the whole population from that tiny slice?

This is not just a Bhutanese concern. Every scientific study, every A/B test, every political poll faces the same challenge. You observe a sample and need to make claims about the population. The mathematics of sampling tells you exactly how confident those claims can be.

Suppose you sample n = 200 people and record their happiness scores: 72, 85, 63, ... , 71. You can think of each score as a random variable X_i. Because each person was chosen independently and at random from the same population, these are IID — independent, identically distributed — draws from some unknown distribution F.

The core idea: A random sample is a window into a hidden distribution. From the sample you can estimate the population's mean and variance, and — crucially — you can quantify how uncertain those estimates are. That uncertainty is the standard error.

Population (F)

Unknown distribution with true mean μ and variance σ²

↓ draw n IID samples

Sample X₁, ..., X_n

Observed data: a histogram of values

↓ compute statistics

Estimates

Sample mean X̄ ≈ μ, sample variance S² ≈ σ², with standard error

The plan for this chapter: first we will learn how to estimate population parameters from a sample and how to measure the uncertainty of those estimates. Then we will discover bootstrapping — a general-purpose resampling trick that gives you uncertainty estimates for any statistic, not just the mean. Finally, we will build the foundations of information theory — a way to quantify uncertainty itself.

Concept	What it answers
Sample mean X̄	What is the population mean μ?
Sample variance S²	How spread out is the population?
Standard error	How much would X̄ change if I drew a different sample?
Bootstrap	What is the uncertainty of any statistic?
Entropy	How much uncertainty does a distribution contain?
KL divergence	How different are two distributions?

Check: Why do we need sampling theory?

Because populations are too large to measure entirely Because random variables are always normally distributed Because the sample mean is always exactly equal to the population mean

Chapter 1: Random Sampling Methods

We draw n observations X₁, X₂, ..., X_n from an unknown distribution F with true mean μ and true variance σ². From this sample we compute two estimates.

The sample mean is our best guess for μ:

X̄ = (1/n) ∑_i=1ⁿ X_i

The sample variance is our best guess for σ²:

S² = (1/(n−1)) ∑_i=1ⁿ (X_i − X̄)²

Why n−1 and not n? Because the sample mean X̄ itself was computed from the data, so the deviations (X_i − X̄) are slightly too small on average. Dividing by n−1 corrects for this — it makes S² an unbiased estimator of σ². This correction is called Bessel's correction.

Key insight — unbiasedness: An estimator is unbiased if its expected value equals the true parameter. Let us prove E[X̄] = μ. By linearity of expectation: E[X̄] = E[(1/n)∑X_i] = (1/n)∑E[X_i] = (1/n)(nμ) = μ. Done. The proof that E[S²] = σ² is trickier (it requires expanding the square and using independence) but the result is the same: our estimates are unbiased.

Unbiased is good, but it does not mean precise. If you take a sample of size n = 3 versus n = 3000, both sample means are unbiased, but the n = 3000 one is far more precise. We need a way to measure that precision.

The standard error of the sample mean measures how much X̄ would wiggle if you repeated the sampling process:

Var(X̄) = Var((1/n)∑X_i) = (1/n²)∑Var(X_i) = (1/n²)(nσ²) = σ²/n

SE = Std(X̄) = σ/√n ≈ S/√n

Since we do not know σ, we plug in our sample estimate S. The standard error shrinks as √n — to halve your uncertainty, you need four times as many samples.

Worked example: You survey n = 200 Bhutanese citizens. Sample mean X̄ = 83, sample variance S² = 450. The standard error is SE = √(450/200) = √2.25 = 1.5. You report: "The average happiness is 83 ± 1.5." This means if you repeated the survey with a different random sample of 200, the new mean would typically land within about 1.5 of 83.

Symbol	Meaning
μ	True population mean (unknown)
σ²	True population variance (unknown)
X̄	Sample mean (our estimate of μ)
S²	Sample variance (our estimate of σ²)
SE = S/√n	Standard error of the mean
n	Sample size

But what about the standard error of S²? Or the standard error of the median? Or any other statistic? The formula SE = S/√n only works for the mean. For anything else, we need a more powerful tool. That tool is the bootstrap.

Check: If you quadruple your sample size from n = 100 to n = 400, what happens to the standard error of the mean?

It is cut in half (SE ∝ 1/√n) It is cut to one quarter It stays the same

Chapter 2: The Bootstrap Idea

Imagine you could clone the entire country of Bhutan and survey a fresh sample of 200 people from each clone. If you did this 10,000 times, you would get 10,000 sample means, and the spread of those means would tell you exactly how uncertain your original estimate is. Of course, you cannot clone Bhutan. But you can do the next best thing.

The bootstrap replaces the unknown population with the sample itself. Your 200 data points are the best picture you have of F. So you treat them as if they were the population, and resample from them.

The bootstrap recipe:
1. You have a sample of n data points.
2. Draw a resample of size n with replacement from your original sample.
3. Compute the statistic of interest on the resample.
4. Repeat steps 2-3 many times (say 10,000).
5. The distribution of resampled statistics approximates the true sampling distribution.

Why with replacement? Because each draw must be independent, just like the original sampling process. Some data points will appear twice in a resample, others not at all — that is exactly the randomness that generates variation.

Why size n? Because the variability of a statistic depends on sample size. If you resampled 50 points from a dataset of 200, you would overestimate the uncertainty. The resample must be the same size as the original.

pseudocode
def bootstrap(sample):
  n = len(sample)
  pmf = estimate underlying PMF from sample
  stats = []
  repeat 10000 times:
    resample = draw n values from pmf # with replacement
    stat = compute_statistic(resample)
    stats.append(stat)
  return stats  # distribution of the statistic

In practice, "estimate underlying PMF from sample" is simple: the probability of each unique value is its frequency in the sample. Equivalently, you just pick n items from your sample array uniformly at random with replacement.

Key insight: The bootstrap is a general-purpose tool. It works for any statistic — the mean, the variance, the median, the 90th percentile, a regression coefficient, anything. You do not need a formula for the standard error. You just resample, compute, and look at the spread.

Worked example: You have n = 200 happiness scores with S² = 450. You want the standard error of S² (no formula exists for this). Bootstrap: draw 10,000 resamples of size 200 with replacement, compute S² for each. Suppose the 10,000 resampled variances have a standard deviation of 42. Then SE(S²) ≈ 42. You report: "The variance of happiness is 450 ± 42."

Check: In the bootstrap, why must the resample be the same size n as the original sample?

Because the variability of a statistic depends on sample size — a smaller resample would overestimate uncertainty Because the bootstrap only works with equal-sized arrays Because the Central Limit Theorem requires it

Chapter 3: Bootstrap Confidence Intervals

Beyond standard errors, the bootstrap gives you confidence intervals — a range that captures the true parameter with high probability. And it gives you p-values for hypothesis testing.

To build a 95% confidence interval for a statistic, take the 10,000 bootstrap replicates and find the 2.5th and 97.5th percentiles. That interval covers the true parameter roughly 95% of the time.

CI_95% = [ θ*_(0.025), θ*_(0.975) ]

where θ*_(q) is the q-th quantile of the bootstrap distribution.

Worked example — p-value by bootstrap: Is Nepal happier than Bhutan? You sample n₁ = 200 Bhutanese (mean 7.3) and n₂ = 300 Nepalese (mean 7.8). The observed difference is 0.5. The null hypothesis: both populations have the same distribution.

To test it:
1. Combine all 500 observations into one pool (this is the null — no difference).
2. Draw 200 values (Bhutan resample) and 300 values (Nepal resample) from the pool with replacement.
3. Compute the difference of resample means.
4. Repeat 10,000 times.
5. p-value = fraction of resampled differences ≥ 0.5.

If p = 0.03, then under the null hypothesis there is only a 3% chance of seeing a difference this large by chance. That is evidence that Nepal really is happier.

pseudocode
def pvalue_bootstrap(bhutan, nepal):
  observed_diff = mean(nepal) - mean(bhutan)
  pool = combine(bhutan, nepal)
  pmf = estimate PMF from pool
  count = 0
  repeat 10000 times:
    bhutan_re = draw len(bhutan) from pmf
    nepal_re  = draw len(nepal) from pmf
    diff = mean(nepal_re) - mean(bhutan_re)
    if |diff| ≥ observed_diff: count += 1
  return count / 10000

Why bootstrap beats the t-test: The classical t-test assumes both samples are Gaussian with equal variance. The bootstrap makes no such assumption — it works with any distribution. In the modern era of cheap computation, bootstrapping is the more general and often more correct tool.

The bootstrap does have limits. It breaks down when the underlying distribution has very heavy tails (where extreme outliers dominate) or when the samples are not truly IID (e.g., time-series data with autocorrelation). But for the vast majority of practical problems, it is remarkably robust.

Check: To test whether two groups differ, the bootstrap null hypothesis resamples from...

Each group separately The combined pool of both groups A standard normal distribution

Chapter 4: Information Theory Basics

We are about to change gears entirely. Sampling asked: "What can we learn about a population?" Information theory asks a deeper question: "How do we measure uncertainty itself?"

Consider a game: Think of an Animal. A child is thinking of an animal. You get to ask yes/no questions. Which question should you ask first? "Is it a pet?" or "Is it a dog?" The answer depends on which question reduces your uncertainty the most.

Suppose the animals and their probabilities (based on four-year-old popularity) are:

Animal	P
Dog	0.4
Cat	0.3
Elephant	0.15
Bear	0.10
Monkey	0.05

If you ask "Is it a dog?" and the answer is "Yes" (probability 0.4), you are done — zero uncertainty. But if "No" (probability 0.6), you still have four animals to distinguish. If you ask "Is it a pet?" and the answer is "Yes" (probability 0.7: dog + cat), you narrow it to two animals. "No" (probability 0.3) narrows it to three.

Key insight: The best question is the one that, on average, leaves you with the least remaining uncertainty. To make this precise, we need a formula that quantifies "how much uncertainty a probability distribution contains." That formula is entropy.

Information theory was invented by Claude Shannon in 1948 at Bell Labs. He was trying to figure out the most efficient way to compress and transmit messages. His key insight: the amount of "information" in an event is related to how surprising it is. Rare events carry more information than common ones.

The surprise (or self-information) of an event E with probability p is:

Surprise(E) = log₂(1/P(E)) = −log₂ P(E)

The base-2 logarithm means surprise is measured in bits. An event with probability 1/2 has 1 bit of surprise. An event with probability 1/4 has 2 bits. An event with probability 1/8 has 3 bits.

P(E)	Surprise (bits)
1/2	1
1/4	2
1/8	3
1/16	4
1/32	5

Why logarithm? Because surprises should add. If two independent events each have probability 1/4, seeing both is like seeing one event of probability 1/16. The surprise of the joint event (4 bits) is the sum of the individual surprises (2 + 2).

Check: What is the surprise (in bits) of an event with probability 1/64?

4 bits 5 bits 6 bits

Chapter 5: Entropy

Surprise measures uncertainty about a single event. Entropy measures the uncertainty of an entire random variable — it is the expected surprise across all possible outcomes.

H(X) = E[Surprise(X)] = ∑_x P(X=x) · log₂(1/P(X=x))

Or equivalently:

H(X) = −∑_x P(X=x) · log₂ P(X=x)

Entropy is measured in bits. A fair coin has entropy 1 bit. A fair die with 8 sides has entropy 3 bits. A deterministic variable (probability 1 on one outcome) has entropy 0 bits — no uncertainty at all.

Worked example: Compute the entropy of our animal game. P = {Dog: 0.4, Cat: 0.3, Elephant: 0.15, Bear: 0.10, Monkey: 0.05}.

H = 0.4 · log₂(1/0.4) + 0.3 · log₂(1/0.3) + 0.15 · log₂(1/0.15) + 0.10 · log₂(1/0.10) + 0.05 · log₂(1/0.05)
= 0.4(1.322) + 0.3(1.737) + 0.15(2.737) + 0.10(3.322) + 0.05(4.322)
= 0.529 + 0.521 + 0.411 + 0.332 + 0.216
= 2.009 bits

What does 2.009 bits mean intuitively? It means that, on average, you need about 2 yes/no questions to identify the animal. If all five animals were equally likely (each 0.2), the entropy would be log₂(5) ≈ 2.322 bits — higher, because the distribution is more spread out. The uniform distribution always maximizes entropy for a given number of outcomes.

python
import numpy as np

def entropy(pmf):
    """Compute entropy H(X) in bits from a PMF dict."""
    h = 0
    for x in pmf:
        p = pmf[x]
        if p == 0: continue
        h += p * np.log2(1 / p)
    return h

# Test: fair coin
print(entropy({0: 0.5, 1: 0.5}))  # 1.0 bit

# Test: animal game
animals = {'Dog': 0.4, 'Cat': 0.3, 'Elephant': 0.15,
           'Bear': 0.10, 'Monkey': 0.05}
print(entropy(animals))  # 2.009 bits

Entropy and optimal questions: Going back to our animal game — the question "Is it a pet?" has an expected remaining uncertainty of about 1.3 bits. The question "Is it a dog?" has an expected remaining uncertainty of about 1.7 bits. The better question is the one that leaves you with lower expected entropy after hearing the answer. This idea — greedily picking the question that reduces entropy the most — is the exact algorithm behind decision trees in machine learning.

Check: Which distribution over 4 outcomes has the highest entropy?

P = {0.97, 0.01, 0.01, 0.01} P = {0.25, 0.25, 0.25, 0.25} (uniform) P = {0.5, 0.5, 0, 0}

Chapter 6: KL Divergence

Entropy measures the uncertainty within a single distribution. But often we need to measure how different two distributions are. Given a true distribution P and an approximation Q, the Kullback-Leibler divergence quantifies the "extra surprise" you incur by using Q when the truth is P.

KL(P || Q) = ∑_x P(x) · log₂(P(x) / Q(x))

Think of it this way: for each outcome x, you expected surprise log₂(1/P(x)) under the true distribution P, but Q told you the surprise would be log₂(1/Q(x)). The difference, weighted by P, is the KL divergence. It measures how "wrong" Q is as a model of P.

Key properties:
• KL(P || Q) ≥ 0 always (Gibbs' inequality). It equals zero if and only if P = Q.
• It is not symmetric: KL(P || Q) ≠ KL(Q || P) in general. It is not a true "distance."
• If Q(x) = 0 for some x where P(x) > 0, then KL = ∞. Q must "cover" all outcomes that P considers possible.

Worked example: Let P = {A: 0.5, B: 0.3, C: 0.2} be the true distribution and Q = {A: 0.33, B: 0.33, C: 0.34} be a uniform-ish approximation.

KL(P || Q) = 0.5 · log₂(0.5/0.33) + 0.3 · log₂(0.3/0.33) + 0.2 · log₂(0.2/0.34)
= 0.5(0.600) + 0.3(−0.138) + 0.2(−0.766)
= 0.300 − 0.041 − 0.153
= 0.106 bits

There are other ways to compare distributions. Total Variation distance sums the absolute differences: TV(P,Q) = (1/2)∑|P(x) − Q(x)|. The Earth Mover's distance (Wasserstein metric) asks how much "dirt" you need to move to reshape P into Q. KL divergence is the most common in machine learning because it arises naturally in maximum likelihood estimation — minimizing KL(data || model) is equivalent to maximizing the log-likelihood.

python
import math

def kl_divergence(p, q):
    """KL(P || Q) in bits. p and q are dicts with same keys."""
    kl = 0
    for x in p:
        if p[x] == 0: continue
        kl += p[x] * math.log2(p[x] / q[x])
    return kl

p = {'A': 0.5, 'B': 0.3, 'C': 0.2}
q = {'A': 0.33, 'B': 0.33, 'C': 0.34}
print(kl_divergence(p, q))  # 0.106 bits

Distance Measure	Formula	Properties
KL Divergence	∑ P log(P/Q)	Non-negative, asymmetric, unbounded
Total Variation	(1/2)∑\|P−Q\|	Symmetric, range [0,1]
Earth Mover's	Optimal transport cost	Symmetric, true metric, considers geometry

Check: KL divergence is zero when...

P and Q are identical distributions P is uniform Q has maximum entropy

Chapter 7: Showcase — Interactive Bootstrap Lab

Time to see bootstrapping in action. The simulation below lets you build a dataset, draw bootstrap resamples, and watch the sampling distribution of your statistic emerge in real time.

Bootstrap Resampler

Click Add Data to add random data points (or use the default). Click Bootstrap to draw one resample and compute the mean. Click Run 1000 to see the full bootstrap distribution and confidence interval.

n = 0, resamples = 0

What to notice: The histogram of bootstrap means forms a bell curve centered on the sample mean (CLT in action!). The blue lines mark the 2.5th and 97.5th percentiles — that is your 95% bootstrap confidence interval. Try adding more data: the CI gets narrower. Try with fewer data: it gets wider. The bootstrap automatically adapts to your sample size.

Entropy Explorer

Drag the sliders to change the probability distribution over 4 outcomes. Watch how entropy H changes. Maximum entropy is log₂(4) = 2 bits (uniform distribution).

H = 2.000 bits

P(A)0.25

P(B)0.25

P(C)0.25

Standard Error vs. Sample Size

Watch how the standard error SE = σ/√n shrinks as n grows. The population has σ = 20. Drag the slider to change n.

n25

SE = 4.000

Check: In the bootstrap simulation, what does the histogram of resampled means approximate?

The sampling distribution of the sample mean The population distribution The prior distribution

Chapter 8: Worked Problems

Problem 1: Standard Error

You measure the heights of n = 64 students. The sample mean is X̄ = 170 cm and the sample standard deviation is S = 8 cm. What is the standard error of the mean, and what is an approximate 95% confidence interval?

Solution: SE = S/√n = 8/√64 = 8/8 = 1.0 cm. An approximate 95% CI (using the normal approximation X̄ ± 2·SE) is 170 ± 2.0 = [168, 172] cm.

Problem 2: Bootstrap Variance

You have 100 exam scores with sample variance S² = 225. You run a bootstrap with 10,000 resamples. The 10,000 resampled variances have mean 224.3 and standard deviation 31.7. Report the estimated variance and its uncertainty.

Solution: The estimated population variance is S² = 225. The bootstrap standard error of S² is 31.7. Report: "The variance of exam scores is 225 ± 31.7." For a 95% CI, find the 2.5th and 97.5th percentiles of the 10,000 resampled variances (e.g., [165, 290]).

Problem 3: Entropy Calculation

A loaded die has P(1) = P(2) = P(3) = P(4) = 0.1, P(5) = 0.2, P(6) = 0.4. Compute its entropy. Compare to a fair die.

Solution:
H = 4(0.1 · log₂(10)) + 0.2 · log₂(5) + 0.4 · log₂(2.5)
= 4(0.1 · 3.322) + 0.2(2.322) + 0.4(1.322)
= 1.329 + 0.464 + 0.529
= 2.322 bits.

A fair six-sided die has H = log₂(6) = 2.585 bits. The loaded die has lower entropy (less uncertainty) because the outcomes are not equally likely — you can "predict" it somewhat.

Problem 4: KL Divergence

A model predicts P = {heads: 0.5, tails: 0.5}. The true distribution is Q = {heads: 0.7, tails: 0.3}. Compute KL(Q || P).

Solution:
KL(Q || P) = 0.7 · log₂(0.7/0.5) + 0.3 · log₂(0.3/0.5)
= 0.7 · log₂(1.4) + 0.3 · log₂(0.6)
= 0.7(0.485) + 0.3(−0.737)
= 0.340 − 0.221
= 0.119 bits.

The model (fair coin) is a poor approximation of the biased coin — each "heads" event is less surprising than the model expects, and each "tails" is more surprising.

Problem 5: p-value by Bootstrap

Drug A: 50 patients, 30 recovered (60%). Drug B: 50 patients, 38 recovered (76%). The observed difference is 16%. Describe how to compute a p-value using the bootstrap.

Solution: Under the null hypothesis (no difference), pool all 100 outcomes into one dataset (68 recovered, 32 not). Repeat 10,000 times: draw 50 outcomes with replacement for group A and 50 for group B. Compute the difference of recovery rates. The p-value is the fraction of times the resampled difference ≥ 16%. If p = 0.04, you have evidence that Drug B is genuinely better.

Check: If a fair six-sided die has entropy 2.585 bits, what is the entropy of a fair coin?

1 bit 2 bits 0.5 bits

Chapter 9: Connections

This chapter ties together three ideas that may seem unrelated — sampling, bootstrapping, and information theory — but share a common thread: they are all about what you can learn from data.

The unifying thread: Sampling tells you what parameters you can estimate and how uncertain those estimates are. Bootstrapping gives you a computational tool to measure that uncertainty for any statistic. Information theory gives you a mathematical framework for uncertainty itself — entropy measures it, and KL divergence measures the cost of getting it wrong.

Concept from this chapter	Where it appears next
Sample mean & standard error	Chapter 14: estimation theory (MLE, MAP)
Bootstrap confidence intervals	Hypothesis testing, A/B testing in practice
Entropy	Decision trees (information gain), compression, Huffman coding
KL divergence	Maximum likelihood (minimizing KL = maximizing log-likelihood)
KL divergence	Variational inference (ELBO), VAEs, policy gradient in RL
p-values via bootstrap	Scientific method, clinical trials, any hypothesis test

Looking backward: The Central Limit Theorem (Chapter 12) is why the bootstrap works — the sampling distribution of the mean is approximately Gaussian, and the bootstrap approximates that distribution. The Beta distribution (Chapter 12) is the conjugate prior for Bernoulli data, which connects to Thompson Sampling (a Bayesian approach to the explore-exploit tradeoff that uses posterior sampling — very similar in spirit to bootstrapping).

Looking forward: Chapter 14 will build on sampling theory to develop maximum likelihood estimation (MLE) — finding the parameters that make the observed sample most probable. MLE turns out to be equivalent to minimizing the KL divergence between the data distribution and your model. So everything in this chapter flows directly into the next.

In machine learning: Entropy and KL divergence are everywhere. Cross-entropy loss (the standard loss function for classification) is H(data) + KL(data || model). When you train a neural network with cross-entropy loss, you are literally minimizing KL divergence. Variational autoencoders minimize a KL term to keep the latent space well-structured. Reinforcement learning algorithms like PPO use KL divergence to constrain policy updates.

The big picture: You now have the tools to go from raw data to principled scientific claims. You can estimate parameters (sample mean, variance), quantify your uncertainty (standard error, bootstrap CIs), test hypotheses (bootstrap p-values), measure how much information a distribution contains (entropy), and compare distributions (KL divergence). These five capabilities are the foundation of modern data science.

Check: Minimizing KL divergence between data and model is equivalent to...

Minimizing the sample variance Maximizing the log-likelihood of the data Maximizing the entropy of the model