Drawing conclusions about a whole population from a handful of observations — and then stress-testing those conclusions with resampling.
You are the king of Bhutan and you want to know the average happiness of the 774,000 people in your country. You cannot ask every single person. But you can pick 200 citizens at random and ask them. The question that drives this entire chapter is: what can you conclude about the whole population from that tiny slice?
This is not just a Bhutanese concern. Every scientific study, every A/B test, every political poll faces the same challenge. You observe a sample and need to make claims about the population. The mathematics of sampling tells you exactly how confident those claims can be.
Suppose you sample n = 200 people and record their happiness scores: 72, 85, 63, ... , 71. You can think of each score as a random variable Xi. Because each person was chosen independently and at random from the same population, these are IID — independent, identically distributed — draws from some unknown distribution F.
The plan for this chapter: first we will learn how to estimate population parameters from a sample and how to measure the uncertainty of those estimates. Then we will discover bootstrapping — a general-purpose resampling trick that gives you uncertainty estimates for any statistic, not just the mean. Finally, we will build the foundations of information theory — a way to quantify uncertainty itself.
| Concept | What it answers |
|---|---|
| Sample mean X̄ | What is the population mean μ? |
| Sample variance S² | How spread out is the population? |
| Standard error | How much would X̄ change if I drew a different sample? |
| Bootstrap | What is the uncertainty of any statistic? |
| Entropy | How much uncertainty does a distribution contain? |
| KL divergence | How different are two distributions? |
We draw n observations X1, X2, ..., Xn from an unknown distribution F with true mean μ and true variance σ². From this sample we compute two estimates.
The sample mean is our best guess for μ:
The sample variance is our best guess for σ²:
Why n−1 and not n? Because the sample mean X̄ itself was computed from the data, so the deviations (Xi − X̄) are slightly too small on average. Dividing by n−1 corrects for this — it makes S² an unbiased estimator of σ². This correction is called Bessel's correction.
Unbiased is good, but it does not mean precise. If you take a sample of size n = 3 versus n = 3000, both sample means are unbiased, but the n = 3000 one is far more precise. We need a way to measure that precision.
The standard error of the sample mean measures how much X̄ would wiggle if you repeated the sampling process:
Since we do not know σ, we plug in our sample estimate S. The standard error shrinks as √n — to halve your uncertainty, you need four times as many samples.
| Symbol | Meaning |
|---|---|
| μ | True population mean (unknown) |
| σ² | True population variance (unknown) |
| X̄ | Sample mean (our estimate of μ) |
| S² | Sample variance (our estimate of σ²) |
| SE = S/√n | Standard error of the mean |
| n | Sample size |
But what about the standard error of S²? Or the standard error of the median? Or any other statistic? The formula SE = S/√n only works for the mean. For anything else, we need a more powerful tool. That tool is the bootstrap.
Imagine you could clone the entire country of Bhutan and survey a fresh sample of 200 people from each clone. If you did this 10,000 times, you would get 10,000 sample means, and the spread of those means would tell you exactly how uncertain your original estimate is. Of course, you cannot clone Bhutan. But you can do the next best thing.
The bootstrap replaces the unknown population with the sample itself. Your 200 data points are the best picture you have of F. So you treat them as if they were the population, and resample from them.
Why with replacement? Because each draw must be independent, just like the original sampling process. Some data points will appear twice in a resample, others not at all — that is exactly the randomness that generates variation.
Why size n? Because the variability of a statistic depends on sample size. If you resampled 50 points from a dataset of 200, you would overestimate the uncertainty. The resample must be the same size as the original.
pseudocode def bootstrap(sample): n = len(sample) pmf = estimate underlying PMF from sample stats = [] repeat 10000 times: resample = draw n values from pmf # with replacement stat = compute_statistic(resample) stats.append(stat) return stats # distribution of the statistic
In practice, "estimate underlying PMF from sample" is simple: the probability of each unique value is its frequency in the sample. Equivalently, you just pick n items from your sample array uniformly at random with replacement.
Beyond standard errors, the bootstrap gives you confidence intervals — a range that captures the true parameter with high probability. And it gives you p-values for hypothesis testing.
To build a 95% confidence interval for a statistic, take the 10,000 bootstrap replicates and find the 2.5th and 97.5th percentiles. That interval covers the true parameter roughly 95% of the time.
where θ*(q) is the q-th quantile of the bootstrap distribution.
pseudocode def pvalue_bootstrap(bhutan, nepal): observed_diff = mean(nepal) - mean(bhutan) pool = combine(bhutan, nepal) pmf = estimate PMF from pool count = 0 repeat 10000 times: bhutan_re = draw len(bhutan) from pmf nepal_re = draw len(nepal) from pmf diff = mean(nepal_re) - mean(bhutan_re) if |diff| ≥ observed_diff: count += 1 return count / 10000
The bootstrap does have limits. It breaks down when the underlying distribution has very heavy tails (where extreme outliers dominate) or when the samples are not truly IID (e.g., time-series data with autocorrelation). But for the vast majority of practical problems, it is remarkably robust.
We are about to change gears entirely. Sampling asked: "What can we learn about a population?" Information theory asks a deeper question: "How do we measure uncertainty itself?"
Consider a game: Think of an Animal. A child is thinking of an animal. You get to ask yes/no questions. Which question should you ask first? "Is it a pet?" or "Is it a dog?" The answer depends on which question reduces your uncertainty the most.
Suppose the animals and their probabilities (based on four-year-old popularity) are:
| Animal | P |
|---|---|
| Dog | 0.4 |
| Cat | 0.3 |
| Elephant | 0.15 |
| Bear | 0.10 |
| Monkey | 0.05 |
If you ask "Is it a dog?" and the answer is "Yes" (probability 0.4), you are done — zero uncertainty. But if "No" (probability 0.6), you still have four animals to distinguish. If you ask "Is it a pet?" and the answer is "Yes" (probability 0.7: dog + cat), you narrow it to two animals. "No" (probability 0.3) narrows it to three.
Information theory was invented by Claude Shannon in 1948 at Bell Labs. He was trying to figure out the most efficient way to compress and transmit messages. His key insight: the amount of "information" in an event is related to how surprising it is. Rare events carry more information than common ones.
The surprise (or self-information) of an event E with probability p is:
The base-2 logarithm means surprise is measured in bits. An event with probability 1/2 has 1 bit of surprise. An event with probability 1/4 has 2 bits. An event with probability 1/8 has 3 bits.
| P(E) | Surprise (bits) |
|---|---|
| 1/2 | 1 |
| 1/4 | 2 |
| 1/8 | 3 |
| 1/16 | 4 |
| 1/32 | 5 |
Why logarithm? Because surprises should add. If two independent events each have probability 1/4, seeing both is like seeing one event of probability 1/16. The surprise of the joint event (4 bits) is the sum of the individual surprises (2 + 2).
Surprise measures uncertainty about a single event. Entropy measures the uncertainty of an entire random variable — it is the expected surprise across all possible outcomes.
Or equivalently:
Entropy is measured in bits. A fair coin has entropy 1 bit. A fair die with 8 sides has entropy 3 bits. A deterministic variable (probability 1 on one outcome) has entropy 0 bits — no uncertainty at all.
What does 2.009 bits mean intuitively? It means that, on average, you need about 2 yes/no questions to identify the animal. If all five animals were equally likely (each 0.2), the entropy would be log2(5) ≈ 2.322 bits — higher, because the distribution is more spread out. The uniform distribution always maximizes entropy for a given number of outcomes.
python import numpy as np def entropy(pmf): """Compute entropy H(X) in bits from a PMF dict.""" h = 0 for x in pmf: p = pmf[x] if p == 0: continue h += p * np.log2(1 / p) return h # Test: fair coin print(entropy({0: 0.5, 1: 0.5})) # 1.0 bit # Test: animal game animals = {'Dog': 0.4, 'Cat': 0.3, 'Elephant': 0.15, 'Bear': 0.10, 'Monkey': 0.05} print(entropy(animals)) # 2.009 bits
Entropy measures the uncertainty within a single distribution. But often we need to measure how different two distributions are. Given a true distribution P and an approximation Q, the Kullback-Leibler divergence quantifies the "extra surprise" you incur by using Q when the truth is P.
Think of it this way: for each outcome x, you expected surprise log2(1/P(x)) under the true distribution P, but Q told you the surprise would be log2(1/Q(x)). The difference, weighted by P, is the KL divergence. It measures how "wrong" Q is as a model of P.
There are other ways to compare distributions. Total Variation distance sums the absolute differences: TV(P,Q) = (1/2)∑|P(x) − Q(x)|. The Earth Mover's distance (Wasserstein metric) asks how much "dirt" you need to move to reshape P into Q. KL divergence is the most common in machine learning because it arises naturally in maximum likelihood estimation — minimizing KL(data || model) is equivalent to maximizing the log-likelihood.
python import math def kl_divergence(p, q): """KL(P || Q) in bits. p and q are dicts with same keys.""" kl = 0 for x in p: if p[x] == 0: continue kl += p[x] * math.log2(p[x] / q[x]) return kl p = {'A': 0.5, 'B': 0.3, 'C': 0.2} q = {'A': 0.33, 'B': 0.33, 'C': 0.34} print(kl_divergence(p, q)) # 0.106 bits
| Distance Measure | Formula | Properties |
|---|---|---|
| KL Divergence | ∑ P log(P/Q) | Non-negative, asymmetric, unbounded |
| Total Variation | (1/2)∑|P−Q| | Symmetric, range [0,1] |
| Earth Mover's | Optimal transport cost | Symmetric, true metric, considers geometry |
Time to see bootstrapping in action. The simulation below lets you build a dataset, draw bootstrap resamples, and watch the sampling distribution of your statistic emerge in real time.
Click Add Data to add random data points (or use the default). Click Bootstrap to draw one resample and compute the mean. Click Run 1000 to see the full bootstrap distribution and confidence interval.
Drag the sliders to change the probability distribution over 4 outcomes. Watch how entropy H changes. Maximum entropy is log2(4) = 2 bits (uniform distribution).
Watch how the standard error SE = σ/√n shrinks as n grows. The population has σ = 20. Drag the slider to change n.
Problem 1: Standard Error
You measure the heights of n = 64 students. The sample mean is X̄ = 170 cm and the sample standard deviation is S = 8 cm. What is the standard error of the mean, and what is an approximate 95% confidence interval?
Problem 2: Bootstrap Variance
You have 100 exam scores with sample variance S² = 225. You run a bootstrap with 10,000 resamples. The 10,000 resampled variances have mean 224.3 and standard deviation 31.7. Report the estimated variance and its uncertainty.
Problem 3: Entropy Calculation
A loaded die has P(1) = P(2) = P(3) = P(4) = 0.1, P(5) = 0.2, P(6) = 0.4. Compute its entropy. Compare to a fair die.
Problem 4: KL Divergence
A model predicts P = {heads: 0.5, tails: 0.5}. The true distribution is Q = {heads: 0.7, tails: 0.3}. Compute KL(Q || P).
Problem 5: p-value by Bootstrap
Drug A: 50 patients, 30 recovered (60%). Drug B: 50 patients, 38 recovered (76%). The observed difference is 16%. Describe how to compute a p-value using the bootstrap.
This chapter ties together three ideas that may seem unrelated — sampling, bootstrapping, and information theory — but share a common thread: they are all about what you can learn from data.
| Concept from this chapter | Where it appears next |
|---|---|
| Sample mean & standard error | Chapter 14: estimation theory (MLE, MAP) |
| Bootstrap confidence intervals | Hypothesis testing, A/B testing in practice |
| Entropy | Decision trees (information gain), compression, Huffman coding |
| KL divergence | Maximum likelihood (minimizing KL = maximizing log-likelihood) |
| KL divergence | Variational inference (ELBO), VAEs, policy gradient in RL |
| p-values via bootstrap | Scientific method, clinical trials, any hypothesis test |
Looking backward: The Central Limit Theorem (Chapter 12) is why the bootstrap works — the sampling distribution of the mean is approximately Gaussian, and the bootstrap approximates that distribution. The Beta distribution (Chapter 12) is the conjugate prior for Bernoulli data, which connects to Thompson Sampling (a Bayesian approach to the explore-exploit tradeoff that uses posterior sampling — very similar in spirit to bootstrapping).
Looking forward: Chapter 14 will build on sampling theory to develop maximum likelihood estimation (MLE) — finding the parameters that make the observed sample most probable. MLE turns out to be equivalent to minimizing the KL divergence between the data distribution and your model. So everything in this chapter flows directly into the next.
In machine learning: Entropy and KL divergence are everywhere. Cross-entropy loss (the standard loss function for classification) is H(data) + KL(data || model). When you train a neural network with cross-entropy loss, you are literally minimizing KL divergence. Variational autoencoders minimize a KL term to keep the latent space well-structured. Reinforcement learning algorithms like PPO use KL divergence to constrain policy updates.