Piech, Chapter 8

Continuous Distributions

From discrete counts to smooth curves — PDFs, CDFs, and the distributions that run the world.

Prerequisites: Chapter 7 (Discrete distributions, expectation, variance). That's it.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Continuous?

You are running to catch the bus. You will arrive at 2:15 pm but you do not know exactly when the bus comes. You want to calculate the probability you will have to wait more than five minutes. The arrival time is a continuous quantity — it could be 2:17 pm and 12.123 seconds, or 2:18 pm and 0.0007 seconds. There are infinitely many possibilities between any two moments.

This is fundamentally different from counting heads on coin flips. With discrete random variables, we could ask "what is the probability of exactly 3 heads?" and get a sensible non-zero answer. But what is the probability the bus arrives at exactly 2:17:12.12333911102389234? That question is absurd. No bus will arrive at precisely that instant. Real-valued measurements have infinite precision, and the probability of hitting any single exact value is zero.

So how do we do probability with continuous quantities like time, weight, height, and temperature?

The core idea: Instead of assigning probability to individual points, we assign probability density to regions. The probability that a continuous random variable falls in some range is the area under a curve over that range. The curve is called the probability density function (PDF), and area means integration.

Think of it as a limiting process. Start by discretizing time into 5-minute chunks and assigning a probability to each chunk. Then refine to 2.5-minute chunks. Keep going. In the limit, the bar chart becomes a smooth curve. The height of that curve at any point is the density, and the area under the curve between two points is the probability.

Discrete
PMF: P(X = k) is a number you can look up
↓ refine the bins
Continuous
PDF: f(x) is a density; integrate to get probability
↓ accumulate area
CDF
F(x) = P(X ≤ x) — no integration needed

This chapter introduces the machinery for continuous random variables — PDFs, CDFs, and the three most important continuous distributions: Uniform, Exponential, and Normal. We will close by showing that the Normal distribution can approximate the Binomial when n is large, connecting continuous and discrete worlds.

Discrete worldContinuous world
PMF: P(X = k)PDF: f(x) (density, not probability)
Summation: ∑Integration: ∫
P(X = k) can be > 0P(X = exact value) = 0 always
CDF: F(a) = ∑i≤a P(X=i)CDF: F(x) = ∫−∞x f(t) dt
E[X] = ∑ x·P(X=x)E[X] = ∫ x·f(x) dx
Check: Why is P(X = exact value) = 0 for a continuous random variable?

Chapter 1: PDF vs PMF

In the discrete world, a probability mass function (PMF) tells you the probability that a random variable equals a specific value: P(X = 3) = 0.25, for instance. You can read off probabilities directly from the PMF.

In the continuous world, we instead have a probability density function (PDF), written f(x). The PDF tells you the density of probability at each point — not the probability itself. To get an actual probability, you must integrate the PDF over a range:

P(a ≤ X ≤ b) = ∫ab f(x) dx

The PDF must satisfy two properties (mirroring the probability axioms):

Non-negative
f(x) ≥ 0 for all x
Integrates to 1
−∞ f(x) dx = 1
Key insight: f(x) is NOT a probability. It is a probability density — probability per unit of x. It can even exceed 1! For example, if a random variable is concentrated in a tiny range (say 0 to 0.1), the density in that range must be about 10 to make the total area equal 1. Density greater than 1 is perfectly legal.

Worked example — finding a constant: Let X be continuous with PDF:

f(x) = C(4x − 2x2)   for 0 < x < 2,    0 otherwise

What is C? Since the PDF must integrate to 1:

02 C(4x − 2x2) dx = 1

Compute the integral:

C · [2x2 − (2x3/3)]02 = C · [(2·4 − 16/3) − 0] = C · [8 − 5.333] = C · (8/3) = 1

So C = 3/8.

Now: what is P(X > 1)?

P(X > 1) = ∫12 (3/8)(4x − 2x2) dx = (3/8) · [2x2 − 2x3/3]12
= (3/8) · [(8 − 16/3) − (2 − 2/3)] = (3/8) · [(8/3) − (4/3)] = (3/8) · (4/3) = 1/2

Exactly half the probability mass lies above x = 1. The PDF is symmetric about x = 1, so this makes perfect sense.

Expectation and variance work the same way, just replace sums with integrals. E[X] = ∫ x·f(x) dx and Var(X) = E[X2] − (E[X])2. The linear transform rules (E[aX+b] = aE[X]+b and Var(aX+b) = a2Var(X)) hold for continuous RVs too.
Check: A PDF f(x) = 5 on the interval [0.1, 0.3] and 0 elsewhere. Is this valid?

Chapter 2: The CDF for Continuous RVs

Every time we want a probability, we have to solve an integral. That gets tedious fast. The cumulative distribution function (CDF) saves us by pre-computing the integral from −∞ up to any point x:

F(x) = P(X ≤ x) = ∫−∞x f(t) dt

Once you have the CDF, you can answer any probability question without integrating again:

QuestionAnswer using CDFWhy
P(X < a)F(a)Definition of CDF
P(X ≤ a)F(a)P(X = a) = 0 for continuous
P(X > a)1 − F(a)Complement rule
P(a < X < b)F(b) − F(a)Subtract the accumulated area
Key insight: For continuous random variables, P(X < a) = P(X ≤ a) because P(X = a) = 0. This is a luxury we did not have in the discrete world, where whether you include the endpoint matters.

The CDF has three important properties:

Non-decreasing
F(a) ≤ F(b) whenever a ≤ b
Limits
F(−∞) = 0 and F(∞) = 1
PDF recovery
f(x) = dF/dx — differentiate to get the PDF back

Worked example: For X ~ Uni(0, 10), F(x) = x/10 for 0 ≤ x ≤ 10. What is P(3 < X < 7)?

P(3 < X < 7) = F(7) − F(3) = 7/10 − 3/10 = 4/10 = 0.4

No integration needed — just two function evaluations and a subtraction.

PDF and CDF are two sides of one coin. Integration takes the PDF to the CDF. Differentiation takes the CDF back to the PDF. You will always work with whichever is more convenient for the distribution at hand. For the Exponential, the CDF has a clean closed form. For the Normal, the PDF has a clean formula but the CDF does not — you have to look it up in a table or use software.
Check: If F(5) = 0.3 and F(12) = 0.85, what is P(5 < X < 12)?

Chapter 3: The Uniform Distribution

The simplest continuous distribution is the Uniform. A uniform random variable X ~ Uni(α, β) is equally likely to take any value in the range [α, β]. Think of it as the continuous version of "roll a fair die" — every outcome is equally likely.

f(x) = 1/(β − α)   for α ≤ x ≤ β,    0 otherwise

Why 1/(β − α) and not just 1? Because the area under the curve must equal 1. The rectangle has width (β − α) and height 1/(β − α), so the area is exactly 1.

PropertyFormula
PDFf(x) = 1/(β − α) for x ∈ [α, β]
CDFF(x) = (x − α) / (β − α) for x ∈ [α, β]
ExpectationE[X] = (α + β) / 2
VarianceVar(X) = (β − α)2 / 12
P(a ≤ X ≤ b)(b − a) / (β − α)
Intuition: The Uniform is the "maximum ignorance" distribution over a bounded range. If you know absolutely nothing except that a quantity lies between α and β, the Uniform is the most honest model you can use. It treats every sub-interval of the same width as equally probable.

Worked example — the bus problem: You arrive at the bus stop at 2:15 pm. The bus arrives uniformly at random between 2:00 and 2:30. Let T ~ Uni(0, 30) be minutes after 2:00. What is the probability you wait less than 5 minutes, i.e., the bus arrives between 2:15 and 2:20?

P(15 < T < 20) = (20 − 15) / (30 − 0) = 5/30 = 1/6 ≈ 0.167

About a 16.7% chance. The full arithmetic: f(t) = 1/30 for 0 ≤ t ≤ 30. The integral from 15 to 20 of (1/30) dt = [t/30]1520 = 20/30 − 15/30 = 5/30 = 1/6. The CDF shortcut gives the same answer instantly.

Uniform Distribution: PDF and CDF

Drag the sliders to change α and β. The shaded area shows P(a < X < b) for the highlighted region.

α=0, β=10
α0
β10
Check: X ~ Uni(2, 8). What is P(X > 5)?

Chapter 4: The Exponential Distribution

You are monitoring a server for crashes. On average, a crash occurs once every 10 hours. You want to know: what is the probability of no crash in the next 5 hours? The Exponential distribution is built for exactly this — it models the time until the next event in a Poisson process.

If X ~ Exp(λ), then X measures the waiting time until the next event, where events happen at a constant average rate of λ per unit time.

f(x) = λe−λx   for x ≥ 0

The CDF has a beautiful closed form — no table lookups, no numerical integration:

F(x) = 1 − e−λx
PropertyFormula
PDFf(x) = λe−λx
CDFF(x) = 1 − e−λx
ExpectationE[X] = 1/λ
VarianceVar(X) = 1/λ2
Key insight — Memoryless property: The Exponential is memoryless. If you have already waited s minutes with no event, the probability of waiting at least t more minutes is the same as if you had just started waiting: P(X > s+t | X > s) = P(X > t). The past gives you no information about the future. This is the only continuous distribution with this property.

Proof of memorylessness:

P(X > s+t | X > s) = P(X > s+t) / P(X > s) = e−λ(s+t) / e−λs = e−λt = P(X > t)   ✓

Worked example — earthquakes: Major earthquakes (≥ 8.0) occur in a region at a rate of λ = 0.002 per year. What is P(earthquake in the next 4 years)?

Let Y ~ Exp(0.002). We want P(Y < 4):

P(Y < 4) = F(4) = 1 − e−0.002 × 4 = 1 − e−0.008 ≈ 1 − 0.99203 = 0.00797 ≈ 0.8%

About a 0.8% chance. Notice how we used the CDF directly — no integral needed.

Exponential vs Poisson: These are siblings, not twins. Poisson counts the number of events in a fixed time window (discrete). Exponential measures the time until the next event (continuous). Both assume the same underlying Poisson process with rate λ.
Exponential Distribution: PDF & CDF

Adjust λ to see how the rate parameter controls the distribution shape. Higher λ means events happen more frequently, so the density is concentrated near zero.

λ1.0
Check: A server crashes at rate λ = 0.1/hour. What is P(no crash in 10 hours)?

Chapter 5: The Normal (Gaussian) Distribution

The single most important continuous distribution is the Normal, also called the Gaussian. It appears everywhere — measurement errors, test scores, heights, stock returns — because of a deep mathematical result: when you sum many independent random variables, the result tends toward a Normal, regardless of what the individual variables look like. This is the Central Limit Theorem (coming in a later chapter).

A Normal random variable X ~ N(μ, σ2) is parameterized by its mean μ and variance σ2. Its PDF is the famous bell curve:

f(x) = (1 / σ√(2π)) · e−(x−μ)2 / (2σ2)

When x equals the mean μ, the exponent is zero and e0 = 1, so the PDF is at its maximum. As x moves away from μ in either direction, the density drops off symmetrically, forming the bell shape.

PropertyValue
ExpectationE[X] = μ
VarianceVar(X) = σ2
Symmetryf(μ + d) = f(μ − d) for all d
68-95-99.7 rule68.3% within 1σ, 95.4% within 2σ, 99.7% within 3σ
Why the Normal is everywhere: The Normal is the most entropic (conservative) distribution for a given mean and variance. If all you know about a quantity is its average and spread, modeling it as Normal makes the fewest additional assumptions. It is the mathematically honest "I don't know more than this" distribution.

There is no closed-form CDF for the Normal. You cannot write down a simple formula for ∫−∞x f(t) dt. Instead, we transform to the Standard Normal and look up values in a table (or use software). The next chapter covers this transformation.

Worked example — the 68% rule: What fraction of a Normal lies within one standard deviation of its mean?

P(μ − σ < X < μ + σ) = P(X < μ + σ) − P(X < μ − σ)

Converting to the Standard Normal (next chapter shows how):

= Φ((μ+σ−μ)/σ) − Φ((μ−σ−μ)/σ) = Φ(1) − Φ(−1) ≈ 0.8413 − 0.1587 = 0.6826 ≈ 68.3%

This holds for every Normal — regardless of μ and σ. About two-thirds of the probability mass always lies within one standard deviation of the mean.

Normal Distribution: Bell Curve

Adjust μ and σ to see how they control the location and spread of the bell curve. The shaded region shows the area within one standard deviation (≈68.3%).

μ0
σ1.0
Check: For X ~ N(100, 25), what is P(95 < X < 105)?

Chapter 6: Standard Normal & Z-scores

The Standard Normal is a Normal with mean 0 and variance 1: Z ~ N(0, 1). Its CDF is so important it gets its own Greek letter: Φ(z). Every Normal probability question ultimately reduces to looking up values of Φ.

The key trick: for any Normal X ~ N(μ, σ2), you can transform it to the Standard Normal by subtracting the mean and dividing by the standard deviation:

Z = (X − μ) / σ     Z ~ N(0, 1)

This works because of the linear transform property: if X ~ N(μ, σ2) and Y = aX + b, then Y ~ N(aμ + b, a2σ2). Setting a = 1/σ and b = −μ/σ gives Z ~ N(0, 1).

Key insight: This transform lets you handle every Normal with a single table. No matter what μ and σ are, the CDF of X can be written as: FX(x) = Φ((x − μ) / σ). One function Φ rules them all.

The value (x − μ) / σ is called the z-score. It measures "how many standard deviations is x away from the mean?" A z-score of +2 means "two standard deviations above the mean." A z-score of −1.5 means "one and a half standard deviations below."

Φ has a useful symmetry: Φ(−z) = 1 − Φ(z). This means you only need the table for positive z.

Worked example 1: X ~ N(3, 16), so μ = 3 and σ = 4. What is P(X > 0)?

P(X > 0) = 1 − P(X ≤ 0) = 1 − Φ((0 − 3)/4) = 1 − Φ(−0.75)
= 1 − (1 − Φ(0.75)) = Φ(0.75) ≈ 0.7734

Worked example 2: Same X ~ N(3, 16). What is P(2 < X < 5)?

P(2 < X < 5) = Φ((5−3)/4) − Φ((2−3)/4) = Φ(0.5) − Φ(−0.25)
= Φ(0.5) − (1 − Φ(0.25)) = 0.6915 − (1 − 0.5987) = 0.6915 − 0.4013 = 0.2902

Worked example 3 — noisy communication: You send voltage +2 for bit 1 and −2 for bit 0. The received signal is R = X + Y where Y ~ N(0, 1) is noise. The decoder says "1" if R ≥ 0.5. If we send bit 1 (X = 2), what is P(decoding error)?

P(error) = P(R < 0.5) = P(2 + Y < 0.5) = P(Y < −1.5) = Φ(−1.5) ≈ 0.0668

About a 6.7% error rate per bit.

zΦ(z)zΦ(z)
0.000.50001.000.8413
0.250.59871.500.9332
0.500.69152.000.9772
0.750.77342.500.9938
In Python: Use scipy.stats.norm.cdf(x, mu, sigma). Note: the second parameter is the standard deviation σ, not the variance σ2. If X ~ N(3, 16), call norm.cdf(x, 3, 4) — pass σ = 4, not σ2 = 16.
Check: X ~ N(50, 100). What is the z-score for x = 70?

Chapter 7: Normal Approximation to the Binomial

Consider X ~ Bin(10000, 0.5). What is P(X > 5500)? The exact formula requires summing 4500 terms of binomial coefficients multiplied by 0.510000. This is computationally brutal. But if you squint at the shape of Bin(10000, 0.5), it looks almost exactly like a bell curve.

This is not a coincidence. When n is large and p is not too extreme, a Binomial is well-approximated by a Normal with matching mean and variance:

X ~ Bin(n, p)  ≈  Y ~ N(np, np(1−p))

The rule of thumb: use the Normal approximation when np(1 − p) > 10.

Why this works: A Binomial is a sum of n independent Bernoulli trials. By the Central Limit Theorem, the sum of many independent random variables converges to a Normal. The larger n is, the better the approximation.

Because the Normal is continuous and the Binomial is discrete, we apply a continuity correction — shifting by 0.5 to account for the "width" of each discrete bar:

Discrete questionContinuous equivalent
P(X = 6)P(5.5 < Y < 6.5)
P(X ≥ 6)P(Y > 5.5)
P(X > 6)P(Y > 6.5)
P(X < 6)P(Y < 5.5)
P(X ≤ 6)P(Y < 6.5)

Worked example 1 — A/B test: 100 visitors see a new website design. Let X = number who spend more time. If the design has no effect, each visitor is a fair coin flip: X ~ Bin(100, 0.5). The CEO endorses the change if X ≥ 65. What is P(CEO endorses | no effect)?

Compute the parameters: E[X] = np = 50. Var(X) = np(1−p) = 25. So σ = 5.

Normal approximation: Y ~ N(50, 25). With continuity correction:

P(X ≥ 65) ≈ P(Y > 64.5) = P(Z > (64.5 − 50)/5) = P(Z > 2.9) = 1 − Φ(2.9) ≈ 0.0019

Less than 0.2%. Very unlikely to get 65+ by pure chance — which is the whole point of statistical significance.

Worked example 2 — Stanford admissions: Stanford accepts 2480 students, each with a 68% chance of attending. Let X ~ Bin(2480, 0.68). What is P(X > 1745)?

E[X] = 2480 × 0.68 = 1686.4. Var(X) = 2480 × 0.68 × 0.32 = 539.648. σ = √539.648 ≈ 23.23.

P(X > 1745) ≈ P(Y > 1745.5) = P(Z > (1745.5 − 1686.4)/23.23) = P(Z > 2.54) = 1 − Φ(2.54) ≈ 0.0055

About 0.55%. Stanford can be fairly confident that enrollment will not exceed 1745.

Poisson approximation (for rare events): When n is large but p is very small (< 0.05), use a Poisson with λ = np instead of a Normal. Example: 10,000 bits each corrupted with probability 10−6. λ = 0.01. P(no corruption) = e−0.01 ≈ 0.99. The Poisson approximation is for rare events; the Normal approximation is for moderate-probability events.
Check: To approximate Bin(200, 0.4) with a Normal, what parameters should the Normal have?

Chapter 8: Showcase — Distribution Explorer

Time to bring everything together. The interactive explorer below lets you switch between all three continuous distributions from this chapter — Uniform, Exponential, and Normal — and see their PDF and CDF side by side. Adjust the parameters and watch both curves update in real time.

Use this to build visual intuition: How does increasing λ in the Exponential squeeze the density toward zero? How does increasing σ in the Normal flatten and widen the bell? How does the CDF always climb from 0 to 1?

Distribution Explorer: PDF & CDF

Choose a distribution and adjust its parameters. Left panel shows the PDF; right panel shows the CDF. The shaded region on the PDF and the highlighted segment on the CDF correspond to P(a < X < b) which is displayed numerically.

P = —
Param 10
Param 25
Query a1
Query b4
Things to try: (1) Set Normal with μ=0, σ=1 and query a=−1, b=1 — you should see ≈68.3%. (2) Set Exponential λ=1, query a=0, b=1 — you should see ≈63.2% (which is 1−e−1). (3) Set Uniform α=0, β=10, query a=2, b=7 — you should see exactly 50%.
Check: Using the explorer, what happens to the Normal PDF as σ increases?

Chapter 9: Connections

This chapter moved us from the discrete world into the continuous world. The mechanics changed (integrals replace sums, density replaces mass) but the conceptual framework is the same: distributions, expectations, variances, CDFs. Here is what connects to what:

Looking back:

• Chapter 7 gave us discrete distributions (Bernoulli, Binomial, Poisson). Every idea here is the continuous analog.
• The Exponential is the continuous companion to the Poisson — same process, different question.
• The Normal approximation to the Binomial bridges discrete and continuous.

Looking ahead:

• Chapter 9 introduces joint distributions for multiple continuous RVs.
• Chapter 10 covers the Central Limit Theorem, which explains why the Normal appears everywhere.
• The Normal distribution becomes the foundation for statistical inference, confidence intervals, and hypothesis testing.

DistributionUse whenKey parameter
Uniform(α, β)All values in a range equally likelyRange [α, β]
Exponential(λ)Time until next event (Poisson process)Rate λ
Normal(μ, σ2)Sums of many independent factorsMean μ, variance σ2
The big picture: Continuous distributions unlock the ability to model real-valued measurements. The Uniform is the "I know nothing" distribution. The Exponential is the "events at constant rate" distribution. The Normal is the "sum of many small effects" distribution. These three, plus the discrete distributions from Chapter 7, form the toolkit you will use for the rest of probability and statistics.

What comes next: So far, every random variable has lived alone. But real problems involve multiple quantities at once — height and weight, temperature and humidity, your score and mine. Chapter 9 introduces joint distributions, where two or more random variables live in the same probability space. Joint PDFs, marginals, conditional densities, and covariance all extend naturally from what you have learned here.

"The normal distribution is the most beautiful thing
in all of mathematics."
— Francis Galton
Check: Which distribution would you use to model the time until the next customer arrives at a store?