Ch 12: Beta Distribution & CLT — Piech Probability CS

Chapter 0: Why Beta & CLT?

You flip a coin 10 times and see 9 heads. What is the true probability of heads? You could say 0.9, but that is a single number — it does not capture how uncertain you are. With only 10 flips, the true probability could easily be 0.7 or 0.95. What you really want is a distribution over the probability itself.

That is exactly what the Beta distribution gives you: a probability distribution whose support is [0, 1], perfectly shaped to express beliefs about an unknown probability.

Now consider a different question. You roll a die 100 times and sum the results. What does the distribution of that sum look like? Surprisingly, it looks like a bell curve — a Normal distribution. The die is discrete and uniform, yet the sum is continuous and Gaussian. This is the Central Limit Theorem (CLT), perhaps the single most important result in all of probability.

This chapter in one sentence: The Beta distribution lets you reason about uncertainty in probabilities, and the CLT explains why Normal distributions appear everywhere — whenever you sum many independent random variables, no matter their original shape, the sum converges to a Normal.

Worked example — why a point estimate fails: A new drug is tested on 6 patients. 4 are cured, 2 are not. A point estimate gives P(cure) = 4/6 ≈ 0.667. But would you bet your life on that number? With only 6 trials, the true cure rate could plausibly be anywhere from 0.3 to 0.9. The Beta distribution captures this — Beta(5, 3) has mean 0.625 and standard deviation 0.16, which honestly reflects our uncertainty.

Topic	Core question it answers
Beta distribution	What is the distribution of an unknown probability?
Conjugate priors	How do we update beliefs cleanly as new data arrives?
Adding random variables	What is the distribution of X + Y?
Convolution	How do we compute the PDF/PMF of a sum?
Central Limit Theorem	Why does the sum of anything approach a Normal?

Check: Why is a single point estimate (like 9/10 = 0.9) insufficient for describing an unknown probability?

It is always wrong It cannot express how uncertain we are about the true value Probabilities must always be 0 or 1

Chapter 1: The Beta Distribution

The Beta distribution is a continuous distribution on [0, 1] controlled by two shape parameters, a and b. Its PDF is:

f(x) = (1 / B(a,b)) · x^a−1 (1 − x)^b−1 for x ∈ [0, 1]

where B(a, b) = ∫₀¹ t^a−1(1 − t)^b−1 dt is the Beta function, a normalizing constant that ensures the PDF integrates to 1. You never need to compute B(a, b) by hand — every programming language has it built in.

The key statistics:

E[X] = a / (a + b) Var(X) = ab / [(a+b)²(a+b+1)]

The parameters a and b control the shape in an intuitive way. Think of a as counting "successes + 1" and b as "failures + 1". When a > b, the distribution leans toward 1 (high probability). When b > a, it leans toward 0. When a = b, it is symmetric around 0.5.

Shape intuition: a and b act like pseudo-counts. Beta(1, 1) is the uniform distribution — complete ignorance. Beta(10, 10) is a tight bell centered at 0.5 — strong belief the probability is near half. Beta(100, 1) is a spike near 1.0 — overwhelming evidence of success. More data (larger a + b) means a tighter, more confident distribution.

Beta PDF Explorer

Drag the sliders to change a and b. Watch how the shape shifts.

Worked example — coin flips: You flip a coin 10 times and observe 9 heads, 1 tail. With a uniform prior (Beta(1,1)), your posterior belief about the coin's heads probability is X ~ Beta(1 + 9, 1 + 1) = Beta(10, 2). The mean is 10/12 ≈ 0.833 and the variance is 20 / (144 × 13) ≈ 0.0107, giving a standard deviation of about 0.103.

The Beta tells us: yes, the probability is probably high (the peak is near 0.9), but there is meaningful spread — values from 0.6 to 1.0 are all plausible. With 100 heads out of 110 flips, we would get Beta(101, 11), still centered near 0.9 but much tighter.

Beta(a, b)	Shape	Mean	Interpretation
Beta(1, 1)	Flat (uniform)	0.50	Total ignorance
Beta(2, 5)	Skewed left	0.29	1 success, 4 failures observed
Beta(10, 2)	Skewed right	0.83	9 successes, 1 failure observed
Beta(50, 50)	Tight symmetric	0.50	98 trials, evenly split

Check: If you observe 4 cured out of 6 patients (with a uniform prior), what Beta distribution represents your belief about the cure rate?

Beta(5, 3) Beta(4, 2) Beta(4, 6)

Chapter 2: Beta as Conjugate Prior

In Chapter 1, we started with a uniform prior Beta(1, 1) and updated it with observed data. But what if we already had some prior belief? Maybe we think the coin is roughly fair — we could start with Beta(5, 5) instead of Beta(1, 1). The beautiful result: the posterior is still a Beta.

Conjugate prior property: If your prior is Beta(a, b) and you observe h heads and t tails, your posterior is Beta(a + h, b + t). The prior and posterior are from the same family — this is what "conjugate" means.

Here is the derivation. Let our prior belief about the heads probability be X ~ Beta(a, b). After observing h heads and t tails:

f(X = x | H = h, T = t) = P(H=h, T=t | X=x) · f(X=x) / P(H=h, T=t)

The likelihood is Binomial: P(H=h, T=t | X=x) ∝ x^h(1 − x)^t. The prior is Beta: f(X=x) ∝ x^a−1(1 − x)^b−1. Multiplying:

f(X = x | data) ∝ x^h(1 − x)^t · x^a−1(1 − x)^b−1 = x^a+h−1(1 − x)^b+t−1

That is the kernel of Beta(a + h, b + t). The normalizing constant handles itself. We just add the observed counts to the prior parameters.

Think of a and b as pseudo-observations: a − 1 imaginary successes and b − 1 imaginary failures. When you start with Beta(1, 1), there are 0 imaginary trials — complete ignorance. When you start with Beta(10, 10), you are asserting the equivalent of 18 coin flips that came out evenly. New real data gets blended with these pseudo-observations.

Bayesian Updating: Prior → Posterior

Set a prior Beta(a, b). Each click of Flip generates a coin flip (true p = slider value). Watch the posterior update in real time.

h=0, t=0

True p0.70

Prior a1

Prior b1

Worked example — strong prior vs. weak prior: You believe a coin is fair: prior Beta(10, 10). You then observe 8 heads, 2 tails. Your posterior is Beta(18, 12). The posterior mean is 18/30 = 0.60 — the prior has "pulled" your estimate toward 0.5. Had you used a weak prior Beta(1, 1), the posterior would be Beta(9, 3) with mean 0.75 — much more influenced by the data. The strength of the prior (measured by a + b) determines how many real observations it takes to overwhelm it.

Bayesians vs. Frequentists: Are you allowed to just make up priors? Bayesians say yes — it lets you incorporate domain knowledge. Frequentists say no — results should depend only on observed data. For small datasets, a good prior can dramatically improve predictions. For large datasets, the prior washes out and both approaches converge to the same answer.

Check: Your prior is Beta(5, 5) and you observe 20 heads and 5 tails. What is the posterior?

Beta(20, 5) Beta(25, 10) Beta(21, 6)

Chapter 3: Adding Random Variables

We now shift to a completely different topic: what happens when you add two random variables together? If X and Y are random variables, what is the distribution of Z = X + Y? This is called convolution, and it underlies the Central Limit Theorem.

For discrete random variables, the PMF of the sum is:

P(X + Y = n) = ∑_i=−∞^∞ P(X = i, Y = n − i)

If X and Y are independent, this simplifies to:

P(X + Y = n) = ∑_i P(X = i) · P(Y = n − i)

The idea: to get a sum of n, enumerate all the ways X and Y can combine to produce n. X could be 0 and Y could be n. X could be 1 and Y could be n − 1. Add up the probabilities of all these mutually exclusive cases.

Key intuition: Convolution is not adding PDFs. It is computing the distribution of a new random variable Z = X + Y by considering every possible pair (X = i, Y = n − i) that sums to n. The probabilities of all these pairs get summed (discrete) or integrated (continuous).

Worked example — sum of two dice: Let X and Y each be uniform on {1, 2, 3, 4, 5, 6}. What is P(X + Y = 7)?

We need all pairs (i, 7 − i) where both i and 7 − i are in {1,...,6}. That gives i = 1,2,3,4,5,6 — all six values work. Each pair has probability (1/6)(1/6) = 1/36.

P(X + Y = 7) = ∑_i=1⁶ P(X=i) · P(Y=7−i) = 6 × (1/36) = 6/36 = 1/6

For P(X + Y = 4): pairs are (1,3), (2,2), (3,1). That is 3/36 = 1/12. The triangle-shaped PMF of the dice sum peaks at 7 because 7 has the most contributing pairs.

Sum n	Contributing pairs	P(X+Y = n)
2	(1,1)	1/36
5	(1,4),(2,3),(3,2),(4,1)	4/36
7	(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)	6/36
12	(6,6)	1/36

Check: When computing P(X + Y = n) for independent X, Y, why do we sum over all i?

The events (X=i, Y=n−i) for different i are mutually exclusive and exhaustive ways to get a sum of n We are averaging the probabilities across all values of i The PDFs of X and Y must be added together

Chapter 4: Convolution Results

For several important distribution families, the sum of two independent random variables stays in the same family. These closed-form results save you from computing the convolution sum or integral by hand.

The big three convolution theorems:
• Poissons: X ~ Poi(λ₁), Y ~ Poi(λ₂) ⇒ X + Y ~ Poi(λ₁ + λ₂)
• Binomials (same p): X ~ Bin(n₁, p), Y ~ Bin(n₂, p) ⇒ X + Y ~ Bin(n₁ + n₂, p)
• Normals: X ~ N(μ₁, σ₁²), Y ~ N(μ₂, σ₂²) ⇒ X + Y ~ N(μ₁ + μ₂, σ₁² + σ₂²)

Proof sketch for Poissons: Let X ~ Poi(λ₁) and Y ~ Poi(λ₂). Using the convolution formula:

P(X+Y = n) = ∑_k=0ⁿ P(X=k) · P(Y=n−k) = ∑_k=0ⁿ [e^−λ₁λ₁^k/k!] · [e^−λ₂λ₂^n−k/(n−k)!]

= e^{−(λ₁+λ₂)} / n! · ∑_k=0ⁿ C(n,k) λ₁^k λ₂^n−k = e^{−(λ₁+λ₂)} (λ₁+λ₂)ⁿ / n!

The last step uses the Binomial theorem: (a + b)ⁿ = ∑ C(n,k) a^k b^n−k. The result is exactly Poi(λ₁ + λ₂).

The Normal result is especially powerful because it extends to any number of independent summands. The means add, the variances add, and the result is still Normal. This is the algebraic backbone of the CLT.

Sum of two independent uniforms: If X, Y ~ Uni(0,1) independently, the PDF of Z = X + Y is triangular:

f(z) = z if 0 < z ≤ 1 f(z) = 2 − z if 1 < z ≤ 2

A uniform plus a uniform is not uniform. The distribution peaks at z = 1 and tapers linearly to zero at z = 0 and z = 2. Already with just two summands, we see the beginnings of a bell shape. This is the CLT in embryo.

Worked example — customer arrivals: Customers arrive at a store from two independent sources. Source 1: Poi(3) per hour. Source 2: Poi(5) per hour. Total arrivals per hour: Poi(3 + 5) = Poi(8). The probability of exactly 10 total arrivals: P(Z = 10) = e⁻⁸ · 8¹⁰ / 10! ≈ 0.0993.

Distribution family	Closed under addition?	Condition
Poisson	Yes	Independence
Binomial	Yes	Same p, independence
Normal	Yes	Independence
Uniform	No	Sum is triangular, not uniform
Exponential	No	Sum of n exponentials is Gamma(n, λ)

Check: X ~ Poi(4) and Y ~ Poi(7) are independent. What is the distribution of X + Y?

Poi(28) Poi(11) N(11, 11)

Chapter 5: The Central Limit Theorem

Imagine summing 100 independent Uni(0, 1) random variables. Each one is flat and rectangular. But their sum? A beautiful bell curve, almost indistinguishable from a Normal. This is the Central Limit Theorem in action.

CLT (Sum version): Let X₁, X₂, ..., X_n be IID random variables with mean μ and variance σ². Then as n → ∞:

∑_i=1ⁿ X_i ~ N(nμ, nσ²)

CLT (Average version): Equivalently, the sample mean converges to Normal:

(1/n) ∑_i=1ⁿ X_i ~ N(μ, σ²/n)

The two versions are equivalent: divide the sum by n and you get the average. The sum has variance nσ² (growing); the average has variance σ²/n (shrinking). Both are Normal.

What makes this remarkable is the word any. The original X_i can be uniform, exponential, Bernoulli, Poisson, or any other distribution — as long as it has finite mean and finite variance, the sum converges to Normal. The original distribution's shape gets washed out.

Why Normals are everywhere: Height is the sum of many genetic and environmental factors. Measurement error is the sum of many small noise sources. Test scores aggregate many independent question outcomes. Stock returns are driven by many independent events. Whenever a quantity is the result of many small, independent contributions, the CLT kicks in and the result looks Normal. This is why the Normal distribution dominates statistics.

Worked example — 10 dice: Roll a 6-sided die 10 times. Let X = X₁ + ... + X₁₀. Each die has E[X_i] = 3.5 and Var(X_i) = 35/12. By the CLT, X ≈ N(10 × 3.5, 10 × 35/12) = N(35, 29.17).

What is P(X ≤ 25 or X ≥ 45)? Using a continuity correction (because dice are discrete):

P(X ≤ 25) ≈ P(Y < 25.5) = Φ((25.5 − 35) / √29.17) = Φ(−1.76) ≈ 0.039

P(X ≥ 45) ≈ P(Y > 44.5) = 1 − Φ((44.5 − 35) / √29.17) = 1 − Φ(1.76) ≈ 0.039

P(win) ≈ 0.039 + 0.039 = 0.078

Notice the continuity correction: we used 25.5 instead of 25 and 44.5 instead of 45. This accounts for the fact that we are approximating a discrete distribution with a continuous one. You need a continuity correction whenever the original X_i are discrete (dice, coins, Poisson counts). You do not need it when the X_i are already continuous (uniforms, exponentials).

Check: You average 10,000 independent Exp(λ=8) samples. What does the CLT say about the distribution of this average?

It is approximately Normal with mean μ = 1/8 and very small variance σ²/10000 It is still Exponential It is uniformly distributed

Chapter 6: CLT Proof Sketch

Why does the CLT work? The full proof uses moment-generating functions (MGFs) or characteristic functions. Here is the idea at a high level.

Every random variable X has a moment-generating function M_X(t) = E[e^tX]. The key property: if X and Y are independent, then M_X+Y(t) = M_X(t) · M_Y(t). Multiplication in MGF-space corresponds to convolution in distribution-space.

Let Z_i = (X_i − μ) / σ be the standardized version of each X_i. The standardized sum is S_n = (Z₁ + ... + Z_n) / √n.

M_{S_n}(t) = [M_Z(t / √n)]ⁿ

Now Taylor-expand M_Z(t / √n) around t = 0. Since Z is standardized, E[Z] = 0 and E[Z²] = 1:

M_Z(t / √n) ≈ 1 + 0 + (1/2)(t / √n)² + ... = 1 + t²/(2n) + ...

Raising to the n-th power:

[1 + t²/(2n)]ⁿ → e^t²/2 as n → ∞

But e^t²/2 is exactly the MGF of the standard Normal N(0, 1). Since the MGF uniquely determines the distribution, the standardized sum converges to a standard Normal. Done.

The deep reason: The (1 + x/n)ⁿ → e^x limit is the engine of the CLT. Each individual variable contributes a tiny perturbation (t²/(2n)), but n of them compound exponentially into the Gaussian shape. The higher-order terms (skewness, kurtosis of the original distribution) get divided by higher powers of √n and vanish. Only the first two moments — mean and variance — survive.

Worked example — convergence rate: How fast does the CLT kick in? For symmetric distributions (like uniforms), n = 5 already gives a good Normal approximation. For heavily skewed distributions (like Exp(λ)), you might need n = 30–50. For Bernoulli(p) with p very close to 0 or 1, you might need n > 100. A common rule of thumb: np > 5 and n(1−p) > 5 for the Binomial.

What the CLT does NOT say: The CLT requires finite mean and finite variance. Distributions with infinite variance (like the Cauchy distribution) do not converge to Normal. The sum of Cauchy random variables is still Cauchy — it never becomes Gaussian. This is a real limitation, not a theoretical curiosity: financial returns sometimes exhibit "fat tails" that violate CLT assumptions.

Step	What happens
1. Standardize	Z_i = (X_i − μ)/σ so E[Z]=0, Var(Z)=1
2. MGF of sum	Product of individual MGFs: [M_Z(t/√n)]ⁿ
3. Taylor expand	M_Z(t/√n) ≈ 1 + t²/(2n)
4. Take limit	(1 + t²/(2n))ⁿ → e^t²/2 = MGF of N(0,1)

Check: Why does the CLT fail for the Cauchy distribution?

The Cauchy distribution has infinite variance, so the CLT's assumptions are violated The Cauchy is continuous, and CLT only applies to discrete variables The Cauchy distribution is symmetric, which cancels out the CLT

Chapter 7: Showcase — CLT Demonstrator

This is the payoff simulation. Choose a source distribution (Dice, Uniform, Exponential, or Beta), set how many you sum (N), and watch the histogram of 2000 samples converge to a Normal as N increases. The orange curve is the theoretical Normal predicted by the CLT.

Central Limit Theorem — Interactive Demonstrator

Choose a distribution. Increase N to sum more independent copies. The histogram (teal) should approach the orange Normal curve. Click Resample for fresh draws.

N (# summed)1

What to explore:
• Start with Dice, N=1. The histogram is flat (uniform on 1–6). Increase N to 2 — triangular. By N=5, it is already bell-shaped.
• Switch to Exponential. At N=1 it is heavily right-skewed. By N=10 it is roughly symmetric. By N=30 it is a near-perfect bell.
• Try Beta(2,5) — an asymmetric shape. By N=15, the skew has melted into symmetry.
• The orange Normal curve is the CLT prediction: N(μ=N·E[X_i], σ²=N·Var(X_i)). See how the histogram hugs it more tightly as N grows.

Worked example — algorithm runtime: You have an algorithm with unknown mean runtime μ and known variance σ² = 4 sec². You want to estimate μ to within ±0.5 seconds with 95% confidence. How many trials n do you need?

By the CLT, the sample mean X̄ ~ N(μ, 4/n). We need P(−0.5 ≤ X̄ − μ ≤ 0.5) = 0.95. Standardizing:

P(−0.5√n/2 ≤ Z ≤ 0.5√n/2) = 0.95

For 95% confidence, we need 0.5√n/2 = 1.96, so √n = 7.84, giving n = 61.4. We need 62 runs.

Check: When N = 1 with the Dice distribution, the histogram looks uniform. What shape does it approach as N increases?

Exponential Normal (bell curve) Triangular

Chapter 8: Worked Problems

Problem 1: Beta expectation. A website A/B test shows 30 clicks out of 100 visitors for variant A. Using a uniform prior, what is the posterior distribution for the click-through rate, and what is its 95% credible interval?

With prior Beta(1, 1) and data h = 30, t = 70: posterior is Beta(31, 71). Mean = 31/102 ≈ 0.304. The 95% credible interval can be computed numerically: approximately [0.22, 0.40]. The posterior tells us the true rate is likely between 22% and 40%.

E[X] = 31 / (31 + 71) = 31/102 ≈ 0.304 SD = √(31 × 71 / (102² × 103)) ≈ 0.045

Problem 2: Convolution of dice. You roll two 4-sided dice (each uniform on {1,2,3,4}). What is P(sum = 5)?

Pairs that sum to 5: (1,4), (2,3), (3,2), (4,1). Each has probability (1/4)(1/4) = 1/16.

P(X + Y = 5) = 4 × (1/16) = 4/16 = 1/4

Problem 3: CLT for proportions. A fair coin is flipped 400 times. What is the probability of getting between 185 and 215 heads (inclusive)?

Let S = number of heads. S ~ Bin(400, 0.5). By CLT, S ≈ N(200, 100). Using continuity correction:

P(185 ≤ S ≤ 215) ≈ P(184.5 < Y < 215.5) = Φ((215.5−200)/10) − Φ((184.5−200)/10)

= Φ(1.55) − Φ(−1.55) = 0.9394 − 0.0606 = 0.879

Problem 4: Updating with a strong prior. A doctor believes a treatment has a 60% success rate based on previous experience, modeled as Beta(12, 8). A new trial gives 15 successes and 25 failures. What is the posterior, and how does it differ from using a uniform prior?

With Beta(12, 8) prior: posterior = Beta(12 + 15, 8 + 25) = Beta(27, 33). Mean = 27/60 = 0.45.

With Beta(1, 1) prior: posterior = Beta(16, 26). Mean = 16/42 = 0.381.

The informative prior pulls the estimate up toward 0.6, while the data pulls it down toward 15/40 = 0.375. The compromise (0.45) is between the two. With more data, the prior's influence would diminish.

Problem 5: Sample size for CLT. Light bulbs have mean lifetime 1000 hours with SD 200 hours. An engineer tests n bulbs and averages their lifetimes. How large must n be so the average is within 20 hours of the true mean with 99% probability?

P(|X̄ − μ| ≤ 20) = 0.99 ⇒ 20√n / 200 = 2.576 ⇒ √n = 25.76 ⇒ n = 664

Check: In Problem 3, why did we use 184.5 and 215.5 instead of 185 and 215?

Continuity correction: we are approximating a discrete distribution (Binomial) with a continuous one (Normal) It is a rounding convention for large sample sizes The Normal distribution only accepts non-integer inputs

Chapter 9: Connections

The Beta distribution and the Central Limit Theorem are not isolated results — they connect to nearly every branch of statistics, machine learning, and engineering.

Where it leads	How Beta / CLT appear
Bayesian inference	Beta priors are the standard starting point for binary outcomes; conjugate updating makes sequential inference tractable
A/B testing	Click-through rates modeled as Beta posteriors; compare two Betas to decide which variant wins
Thompson sampling (RL)	Multi-armed bandit algorithms sample from Beta posteriors to balance exploration and exploitation
Confidence intervals	The CLT justifies the ±1.96σ/√n formula for 95% confidence intervals on the mean
Hypothesis testing	z-tests and t-tests rely on the CLT to assume sample means are Normal
Machine learning	SGD averages many gradient samples; the CLT explains why minibatch gradients behave predictably
Signal processing	Convolution of signals is identical to convolution of random variables — same math, different domain
Dirichlet distribution	The multivariate generalization of Beta; used for topic models (LDA) and categorical priors

The deeper pattern: The Beta distribution answers "what is the probability of the probability?" — a fundamentally Bayesian question. The CLT answers "why is everything Normal?" — because addition is the universal operation on uncertainty. Together, they form two of the most powerful tools in applied probability. The Beta handles your beliefs; the CLT handles your sums.

What we built:

• Beta distribution & its shape parameters
• Conjugate prior property: Beta + data = Beta
• Convolution formula for sums of RVs
• Closed-form results: Poisson, Binomial, Normal
• Central Limit Theorem (sum & average)
• Proof sketch via MGFs

What comes next:

• Chapter 13: Sampling — generating random variables
• MCMC and Monte Carlo methods
• Bootstrap: resampling to estimate uncertainty
• Bayesian networks and graphical models

Master Beta and CLT and you hold the two keys to applied statistics: principled belief updating and the universal convergence to Normality.

The three big results, one more time: (1) Beta(a,b) is the distribution for an unknown probability with a−1 successes and b−1 failures observed. (2) Conjugate updating: prior Beta(a,b) + h heads, t tails = posterior Beta(a+h, b+t). (3) CLT: the sum of n IID variables with mean μ and variance σ² is approximately N(nμ, nσ²) for large n. These three facts carry you through Bayesian statistics, experimental design, and hypothesis testing.

"The central limit theorem is the supreme law of
Unreason... the more perfectly it is obeyed."
— Sir Francis Galton

Check: In Thompson sampling for multi-armed bandits, why is the Beta distribution a natural choice?

Each arm has a binary outcome (reward or not), and Beta models belief about the success probability with easy conjugate updates The CLT requires Beta priors to converge Beta distributions are the only distributions on [0,1]