Piech, Chapter 12

Beta Distribution & CLT

Distributions over probabilities, adding random variables, and the most beautiful theorem in statistics.

Prerequisites: Chapter 11 (General inference, Bayes with continuous RVs). Familiarity with common distributions (Normal, Poisson, Binomial).
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Beta & CLT?

You flip a coin 10 times and see 9 heads. What is the true probability of heads? You could say 0.9, but that is a single number — it does not capture how uncertain you are. With only 10 flips, the true probability could easily be 0.7 or 0.95. What you really want is a distribution over the probability itself.

That is exactly what the Beta distribution gives you: a probability distribution whose support is [0, 1], perfectly shaped to express beliefs about an unknown probability.

Now consider a different question. You roll a die 100 times and sum the results. What does the distribution of that sum look like? Surprisingly, it looks like a bell curve — a Normal distribution. The die is discrete and uniform, yet the sum is continuous and Gaussian. This is the Central Limit Theorem (CLT), perhaps the single most important result in all of probability.

This chapter in one sentence: The Beta distribution lets you reason about uncertainty in probabilities, and the CLT explains why Normal distributions appear everywhere — whenever you sum many independent random variables, no matter their original shape, the sum converges to a Normal.

Worked example — why a point estimate fails: A new drug is tested on 6 patients. 4 are cured, 2 are not. A point estimate gives P(cure) = 4/6 ≈ 0.667. But would you bet your life on that number? With only 6 trials, the true cure rate could plausibly be anywhere from 0.3 to 0.9. The Beta distribution captures this — Beta(5, 3) has mean 0.625 and standard deviation 0.16, which honestly reflects our uncertainty.

TopicCore question it answers
Beta distributionWhat is the distribution of an unknown probability?
Conjugate priorsHow do we update beliefs cleanly as new data arrives?
Adding random variablesWhat is the distribution of X + Y?
ConvolutionHow do we compute the PDF/PMF of a sum?
Central Limit TheoremWhy does the sum of anything approach a Normal?
Check: Why is a single point estimate (like 9/10 = 0.9) insufficient for describing an unknown probability?

Chapter 1: The Beta Distribution

The Beta distribution is a continuous distribution on [0, 1] controlled by two shape parameters, a and b. Its PDF is:

f(x) = (1 / B(a,b)) · xa−1 (1 − x)b−1    for x ∈ [0, 1]

where B(a, b) = ∫01 ta−1(1 − t)b−1 dt is the Beta function, a normalizing constant that ensures the PDF integrates to 1. You never need to compute B(a, b) by hand — every programming language has it built in.

The key statistics:

E[X] = a / (a + b)      Var(X) = ab / [(a+b)2(a+b+1)]

The parameters a and b control the shape in an intuitive way. Think of a as counting "successes + 1" and b as "failures + 1". When a > b, the distribution leans toward 1 (high probability). When b > a, it leans toward 0. When a = b, it is symmetric around 0.5.

Shape intuition: a and b act like pseudo-counts. Beta(1, 1) is the uniform distribution — complete ignorance. Beta(10, 10) is a tight bell centered at 0.5 — strong belief the probability is near half. Beta(100, 1) is a spike near 1.0 — overwhelming evidence of success. More data (larger a + b) means a tighter, more confident distribution.
Beta PDF Explorer

Drag the sliders to change a and b. Watch how the shape shifts.

a5
b3

Worked example — coin flips: You flip a coin 10 times and observe 9 heads, 1 tail. With a uniform prior (Beta(1,1)), your posterior belief about the coin's heads probability is X ~ Beta(1 + 9, 1 + 1) = Beta(10, 2). The mean is 10/12 ≈ 0.833 and the variance is 20 / (144 × 13) ≈ 0.0107, giving a standard deviation of about 0.103.

The Beta tells us: yes, the probability is probably high (the peak is near 0.9), but there is meaningful spread — values from 0.6 to 1.0 are all plausible. With 100 heads out of 110 flips, we would get Beta(101, 11), still centered near 0.9 but much tighter.

Beta(a, b)ShapeMeanInterpretation
Beta(1, 1)Flat (uniform)0.50Total ignorance
Beta(2, 5)Skewed left0.291 success, 4 failures observed
Beta(10, 2)Skewed right0.839 successes, 1 failure observed
Beta(50, 50)Tight symmetric0.5098 trials, evenly split
Check: If you observe 4 cured out of 6 patients (with a uniform prior), what Beta distribution represents your belief about the cure rate?

Chapter 2: Beta as Conjugate Prior

In Chapter 1, we started with a uniform prior Beta(1, 1) and updated it with observed data. But what if we already had some prior belief? Maybe we think the coin is roughly fair — we could start with Beta(5, 5) instead of Beta(1, 1). The beautiful result: the posterior is still a Beta.

Conjugate prior property: If your prior is Beta(a, b) and you observe h heads and t tails, your posterior is Beta(a + h, b + t). The prior and posterior are from the same family — this is what "conjugate" means.

Here is the derivation. Let our prior belief about the heads probability be X ~ Beta(a, b). After observing h heads and t tails:

f(X = x | H = h, T = t) = P(H=h, T=t | X=x) · f(X=x) / P(H=h, T=t)

The likelihood is Binomial: P(H=h, T=t | X=x) ∝ xh(1 − x)t. The prior is Beta: f(X=x) ∝ xa−1(1 − x)b−1. Multiplying:

f(X = x | data) ∝ xh(1 − x)t · xa−1(1 − x)b−1 = xa+h−1(1 − x)b+t−1

That is the kernel of Beta(a + h, b + t). The normalizing constant handles itself. We just add the observed counts to the prior parameters.

Think of a and b as pseudo-observations: a − 1 imaginary successes and b − 1 imaginary failures. When you start with Beta(1, 1), there are 0 imaginary trials — complete ignorance. When you start with Beta(10, 10), you are asserting the equivalent of 18 coin flips that came out evenly. New real data gets blended with these pseudo-observations.

Bayesian Updating: Prior → Posterior

Set a prior Beta(a, b). Each click of Flip generates a coin flip (true p = slider value). Watch the posterior update in real time.

h=0, t=0
True p0.70
Prior a1
Prior b1

Worked example — strong prior vs. weak prior: You believe a coin is fair: prior Beta(10, 10). You then observe 8 heads, 2 tails. Your posterior is Beta(18, 12). The posterior mean is 18/30 = 0.60 — the prior has "pulled" your estimate toward 0.5. Had you used a weak prior Beta(1, 1), the posterior would be Beta(9, 3) with mean 0.75 — much more influenced by the data. The strength of the prior (measured by a + b) determines how many real observations it takes to overwhelm it.

Bayesians vs. Frequentists: Are you allowed to just make up priors? Bayesians say yes — it lets you incorporate domain knowledge. Frequentists say no — results should depend only on observed data. For small datasets, a good prior can dramatically improve predictions. For large datasets, the prior washes out and both approaches converge to the same answer.
Check: Your prior is Beta(5, 5) and you observe 20 heads and 5 tails. What is the posterior?

Chapter 3: Adding Random Variables

We now shift to a completely different topic: what happens when you add two random variables together? If X and Y are random variables, what is the distribution of Z = X + Y? This is called convolution, and it underlies the Central Limit Theorem.

For discrete random variables, the PMF of the sum is:

P(X + Y = n) = ∑i=−∞ P(X = i, Y = n − i)

If X and Y are independent, this simplifies to:

P(X + Y = n) = ∑i P(X = i) · P(Y = n − i)

The idea: to get a sum of n, enumerate all the ways X and Y can combine to produce n. X could be 0 and Y could be n. X could be 1 and Y could be n − 1. Add up the probabilities of all these mutually exclusive cases.

Key intuition: Convolution is not adding PDFs. It is computing the distribution of a new random variable Z = X + Y by considering every possible pair (X = i, Y = n − i) that sums to n. The probabilities of all these pairs get summed (discrete) or integrated (continuous).

Worked example — sum of two dice: Let X and Y each be uniform on {1, 2, 3, 4, 5, 6}. What is P(X + Y = 7)?

We need all pairs (i, 7 − i) where both i and 7 − i are in {1,...,6}. That gives i = 1,2,3,4,5,6 — all six values work. Each pair has probability (1/6)(1/6) = 1/36.

P(X + Y = 7) = ∑i=16 P(X=i) · P(Y=7−i) = 6 × (1/36) = 6/36 = 1/6

For P(X + Y = 4): pairs are (1,3), (2,2), (3,1). That is 3/36 = 1/12. The triangle-shaped PMF of the dice sum peaks at 7 because 7 has the most contributing pairs.

Sum nContributing pairsP(X+Y = n)
2(1,1)1/36
5(1,4),(2,3),(3,2),(4,1)4/36
7(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)6/36
12(6,6)1/36
Check: When computing P(X + Y = n) for independent X, Y, why do we sum over all i?

Chapter 4: Convolution Results

For several important distribution families, the sum of two independent random variables stays in the same family. These closed-form results save you from computing the convolution sum or integral by hand.

The big three convolution theorems:
Poissons: X ~ Poi(λ1), Y ~ Poi(λ2) ⇒ X + Y ~ Poi(λ1 + λ2)
Binomials (same p): X ~ Bin(n1, p), Y ~ Bin(n2, p) ⇒ X + Y ~ Bin(n1 + n2, p)
Normals: X ~ N(μ1, σ12), Y ~ N(μ2, σ22) ⇒ X + Y ~ N(μ1 + μ2, σ12 + σ22)

Proof sketch for Poissons: Let X ~ Poi(λ1) and Y ~ Poi(λ2). Using the convolution formula:

P(X+Y = n) = ∑k=0n P(X=k) · P(Y=n−k) = ∑k=0n [e−λ1λ1k/k!] · [e−λ2λ2n−k/(n−k)!]
= e−(λ12) / n! · ∑k=0n C(n,k) λ1k λ2n−k = e−(λ12)12)n / n!

The last step uses the Binomial theorem: (a + b)n = ∑ C(n,k) ak bn−k. The result is exactly Poi(λ1 + λ2).

The Normal result is especially powerful because it extends to any number of independent summands. The means add, the variances add, and the result is still Normal. This is the algebraic backbone of the CLT.

Sum of two independent uniforms: If X, Y ~ Uni(0,1) independently, the PDF of Z = X + Y is triangular:

f(z) = z    if 0 < z ≤ 1      f(z) = 2 − z    if 1 < z ≤ 2

A uniform plus a uniform is not uniform. The distribution peaks at z = 1 and tapers linearly to zero at z = 0 and z = 2. Already with just two summands, we see the beginnings of a bell shape. This is the CLT in embryo.

Worked example — customer arrivals: Customers arrive at a store from two independent sources. Source 1: Poi(3) per hour. Source 2: Poi(5) per hour. Total arrivals per hour: Poi(3 + 5) = Poi(8). The probability of exactly 10 total arrivals: P(Z = 10) = e−8 · 810 / 10! ≈ 0.0993.

Distribution familyClosed under addition?Condition
PoissonYesIndependence
BinomialYesSame p, independence
NormalYesIndependence
UniformNoSum is triangular, not uniform
ExponentialNoSum of n exponentials is Gamma(n, λ)
Check: X ~ Poi(4) and Y ~ Poi(7) are independent. What is the distribution of X + Y?

Chapter 5: The Central Limit Theorem

Imagine summing 100 independent Uni(0, 1) random variables. Each one is flat and rectangular. But their sum? A beautiful bell curve, almost indistinguishable from a Normal. This is the Central Limit Theorem in action.

CLT (Sum version): Let X1, X2, ..., Xn be IID random variables with mean μ and variance σ2. Then as n → ∞:

i=1n Xi ~ N(nμ, nσ2)
CLT (Average version): Equivalently, the sample mean converges to Normal:

(1/n) ∑i=1n Xi ~ N(μ, σ2/n)

The two versions are equivalent: divide the sum by n and you get the average. The sum has variance nσ2 (growing); the average has variance σ2/n (shrinking). Both are Normal.

What makes this remarkable is the word any. The original Xi can be uniform, exponential, Bernoulli, Poisson, or any other distribution — as long as it has finite mean and finite variance, the sum converges to Normal. The original distribution's shape gets washed out.

Why Normals are everywhere: Height is the sum of many genetic and environmental factors. Measurement error is the sum of many small noise sources. Test scores aggregate many independent question outcomes. Stock returns are driven by many independent events. Whenever a quantity is the result of many small, independent contributions, the CLT kicks in and the result looks Normal. This is why the Normal distribution dominates statistics.

Worked example — 10 dice: Roll a 6-sided die 10 times. Let X = X1 + ... + X10. Each die has E[Xi] = 3.5 and Var(Xi) = 35/12. By the CLT, X ≈ N(10 × 3.5, 10 × 35/12) = N(35, 29.17).

What is P(X ≤ 25 or X ≥ 45)? Using a continuity correction (because dice are discrete):

P(X ≤ 25) ≈ P(Y < 25.5) = Φ((25.5 − 35) / √29.17) = Φ(−1.76) ≈ 0.039
P(X ≥ 45) ≈ P(Y > 44.5) = 1 − Φ((44.5 − 35) / √29.17) = 1 − Φ(1.76) ≈ 0.039
P(win) ≈ 0.039 + 0.039 = 0.078

Notice the continuity correction: we used 25.5 instead of 25 and 44.5 instead of 45. This accounts for the fact that we are approximating a discrete distribution with a continuous one. You need a continuity correction whenever the original Xi are discrete (dice, coins, Poisson counts). You do not need it when the Xi are already continuous (uniforms, exponentials).

Check: You average 10,000 independent Exp(λ=8) samples. What does the CLT say about the distribution of this average?

Chapter 6: CLT Proof Sketch

Why does the CLT work? The full proof uses moment-generating functions (MGFs) or characteristic functions. Here is the idea at a high level.

Every random variable X has a moment-generating function MX(t) = E[etX]. The key property: if X and Y are independent, then MX+Y(t) = MX(t) · MY(t). Multiplication in MGF-space corresponds to convolution in distribution-space.

Let Zi = (Xi − μ) / σ be the standardized version of each Xi. The standardized sum is Sn = (Z1 + ... + Zn) / √n.

MSn(t) = [MZ(t / √n)]n

Now Taylor-expand MZ(t / √n) around t = 0. Since Z is standardized, E[Z] = 0 and E[Z2] = 1:

MZ(t / √n) ≈ 1 + 0 + (1/2)(t / √n)2 + ... = 1 + t2/(2n) + ...

Raising to the n-th power:

[1 + t2/(2n)]n → et2/2    as n → ∞

But et2/2 is exactly the MGF of the standard Normal N(0, 1). Since the MGF uniquely determines the distribution, the standardized sum converges to a standard Normal. Done.

The deep reason: The (1 + x/n)n → ex limit is the engine of the CLT. Each individual variable contributes a tiny perturbation (t2/(2n)), but n of them compound exponentially into the Gaussian shape. The higher-order terms (skewness, kurtosis of the original distribution) get divided by higher powers of √n and vanish. Only the first two moments — mean and variance — survive.

Worked example — convergence rate: How fast does the CLT kick in? For symmetric distributions (like uniforms), n = 5 already gives a good Normal approximation. For heavily skewed distributions (like Exp(λ)), you might need n = 30–50. For Bernoulli(p) with p very close to 0 or 1, you might need n > 100. A common rule of thumb: np > 5 and n(1−p) > 5 for the Binomial.

What the CLT does NOT say: The CLT requires finite mean and finite variance. Distributions with infinite variance (like the Cauchy distribution) do not converge to Normal. The sum of Cauchy random variables is still Cauchy — it never becomes Gaussian. This is a real limitation, not a theoretical curiosity: financial returns sometimes exhibit "fat tails" that violate CLT assumptions.
StepWhat happens
1. StandardizeZi = (Xi − μ)/σ so E[Z]=0, Var(Z)=1
2. MGF of sumProduct of individual MGFs: [MZ(t/√n)]n
3. Taylor expandMZ(t/√n) ≈ 1 + t2/(2n)
4. Take limit(1 + t2/(2n))n → et2/2 = MGF of N(0,1)
Check: Why does the CLT fail for the Cauchy distribution?

Chapter 7: Showcase — CLT Demonstrator

This is the payoff simulation. Choose a source distribution (Dice, Uniform, Exponential, or Beta), set how many you sum (N), and watch the histogram of 2000 samples converge to a Normal as N increases. The orange curve is the theoretical Normal predicted by the CLT.

Central Limit Theorem — Interactive Demonstrator

Choose a distribution. Increase N to sum more independent copies. The histogram (teal) should approach the orange Normal curve. Click Resample for fresh draws.

N (# summed)1
What to explore:
• Start with Dice, N=1. The histogram is flat (uniform on 1–6). Increase N to 2 — triangular. By N=5, it is already bell-shaped.
• Switch to Exponential. At N=1 it is heavily right-skewed. By N=10 it is roughly symmetric. By N=30 it is a near-perfect bell.
• Try Beta(2,5) — an asymmetric shape. By N=15, the skew has melted into symmetry.
• The orange Normal curve is the CLT prediction: N(μ=N·E[Xi], σ2=N·Var(Xi)). See how the histogram hugs it more tightly as N grows.

Worked example — algorithm runtime: You have an algorithm with unknown mean runtime μ and known variance σ2 = 4 sec2. You want to estimate μ to within ±0.5 seconds with 95% confidence. How many trials n do you need?

By the CLT, the sample mean X̄ ~ N(μ, 4/n). We need P(−0.5 ≤ X̄ − μ ≤ 0.5) = 0.95. Standardizing:

P(−0.5√n/2 ≤ Z ≤ 0.5√n/2) = 0.95

For 95% confidence, we need 0.5√n/2 = 1.96, so √n = 7.84, giving n = 61.4. We need 62 runs.

Check: When N = 1 with the Dice distribution, the histogram looks uniform. What shape does it approach as N increases?

Chapter 8: Worked Problems

Problem 1: Beta expectation. A website A/B test shows 30 clicks out of 100 visitors for variant A. Using a uniform prior, what is the posterior distribution for the click-through rate, and what is its 95% credible interval?

With prior Beta(1, 1) and data h = 30, t = 70: posterior is Beta(31, 71). Mean = 31/102 ≈ 0.304. The 95% credible interval can be computed numerically: approximately [0.22, 0.40]. The posterior tells us the true rate is likely between 22% and 40%.

E[X] = 31 / (31 + 71) = 31/102 ≈ 0.304      SD = √(31 × 71 / (1022 × 103)) ≈ 0.045

Problem 2: Convolution of dice. You roll two 4-sided dice (each uniform on {1,2,3,4}). What is P(sum = 5)?

Pairs that sum to 5: (1,4), (2,3), (3,2), (4,1). Each has probability (1/4)(1/4) = 1/16.

P(X + Y = 5) = 4 × (1/16) = 4/16 = 1/4

Problem 3: CLT for proportions. A fair coin is flipped 400 times. What is the probability of getting between 185 and 215 heads (inclusive)?

Let S = number of heads. S ~ Bin(400, 0.5). By CLT, S ≈ N(200, 100). Using continuity correction:

P(185 ≤ S ≤ 215) ≈ P(184.5 < Y < 215.5) = Φ((215.5−200)/10) − Φ((184.5−200)/10)
= Φ(1.55) − Φ(−1.55) = 0.9394 − 0.0606 = 0.879

Problem 4: Updating with a strong prior. A doctor believes a treatment has a 60% success rate based on previous experience, modeled as Beta(12, 8). A new trial gives 15 successes and 25 failures. What is the posterior, and how does it differ from using a uniform prior?

With Beta(12, 8) prior: posterior = Beta(12 + 15, 8 + 25) = Beta(27, 33). Mean = 27/60 = 0.45.

With Beta(1, 1) prior: posterior = Beta(16, 26). Mean = 16/42 = 0.381.

The informative prior pulls the estimate up toward 0.6, while the data pulls it down toward 15/40 = 0.375. The compromise (0.45) is between the two. With more data, the prior's influence would diminish.

Problem 5: Sample size for CLT. Light bulbs have mean lifetime 1000 hours with SD 200 hours. An engineer tests n bulbs and averages their lifetimes. How large must n be so the average is within 20 hours of the true mean with 99% probability?

P(|X̄ − μ| ≤ 20) = 0.99  ⇒  20√n / 200 = 2.576  ⇒  √n = 25.76  ⇒  n = 664
Check: In Problem 3, why did we use 184.5 and 215.5 instead of 185 and 215?

Chapter 9: Connections

The Beta distribution and the Central Limit Theorem are not isolated results — they connect to nearly every branch of statistics, machine learning, and engineering.

Where it leadsHow Beta / CLT appear
Bayesian inferenceBeta priors are the standard starting point for binary outcomes; conjugate updating makes sequential inference tractable
A/B testingClick-through rates modeled as Beta posteriors; compare two Betas to decide which variant wins
Thompson sampling (RL)Multi-armed bandit algorithms sample from Beta posteriors to balance exploration and exploitation
Confidence intervalsThe CLT justifies the ±1.96σ/√n formula for 95% confidence intervals on the mean
Hypothesis testingz-tests and t-tests rely on the CLT to assume sample means are Normal
Machine learningSGD averages many gradient samples; the CLT explains why minibatch gradients behave predictably
Signal processingConvolution of signals is identical to convolution of random variables — same math, different domain
Dirichlet distributionThe multivariate generalization of Beta; used for topic models (LDA) and categorical priors
The deeper pattern: The Beta distribution answers "what is the probability of the probability?" — a fundamentally Bayesian question. The CLT answers "why is everything Normal?" — because addition is the universal operation on uncertainty. Together, they form two of the most powerful tools in applied probability. The Beta handles your beliefs; the CLT handles your sums.

What we built:

• Beta distribution & its shape parameters
• Conjugate prior property: Beta + data = Beta
• Convolution formula for sums of RVs
• Closed-form results: Poisson, Binomial, Normal
• Central Limit Theorem (sum & average)
• Proof sketch via MGFs

What comes next:

Chapter 13: Sampling — generating random variables
• MCMC and Monte Carlo methods
• Bootstrap: resampling to estimate uncertainty
• Bayesian networks and graphical models

Master Beta and CLT and you hold the two keys to applied statistics: principled belief updating and the universal convergence to Normality.

The three big results, one more time: (1) Beta(a,b) is the distribution for an unknown probability with a−1 successes and b−1 failures observed. (2) Conjugate updating: prior Beta(a,b) + h heads, t tails = posterior Beta(a+h, b+t). (3) CLT: the sum of n IID variables with mean μ and variance σ2 is approximately N(nμ, nσ2) for large n. These three facts carry you through Bayesian statistics, experimental design, and hypothesis testing.
"The central limit theorem is the supreme law of
Unreason... the more perfectly it is obeyed."
— Sir Francis Galton
Check: In Thompson sampling for multi-armed bandits, why is the Beta distribution a natural choice?