From discrete counts to smooth curves — PDFs, CDFs, and the distributions that run the world.
You are running to catch the bus. You will arrive at 2:15 pm but you do not know exactly when the bus comes. You want to calculate the probability you will have to wait more than five minutes. The arrival time is a continuous quantity — it could be 2:17 pm and 12.123 seconds, or 2:18 pm and 0.0007 seconds. There are infinitely many possibilities between any two moments.
This is fundamentally different from counting heads on coin flips. With discrete random variables, we could ask "what is the probability of exactly 3 heads?" and get a sensible non-zero answer. But what is the probability the bus arrives at exactly 2:17:12.12333911102389234? That question is absurd. No bus will arrive at precisely that instant. Real-valued measurements have infinite precision, and the probability of hitting any single exact value is zero.
So how do we do probability with continuous quantities like time, weight, height, and temperature?
Think of it as a limiting process. Start by discretizing time into 5-minute chunks and assigning a probability to each chunk. Then refine to 2.5-minute chunks. Keep going. In the limit, the bar chart becomes a smooth curve. The height of that curve at any point is the density, and the area under the curve between two points is the probability.
This chapter introduces the machinery for continuous random variables — PDFs, CDFs, and the three most important continuous distributions: Uniform, Exponential, and Normal. We will close by showing that the Normal distribution can approximate the Binomial when n is large, connecting continuous and discrete worlds.
| Discrete world | Continuous world |
|---|---|
| PMF: P(X = k) | PDF: f(x) (density, not probability) |
| Summation: ∑ | Integration: ∫ |
| P(X = k) can be > 0 | P(X = exact value) = 0 always |
| CDF: F(a) = ∑i≤a P(X=i) | CDF: F(x) = ∫−∞x f(t) dt |
| E[X] = ∑ x·P(X=x) | E[X] = ∫ x·f(x) dx |
In the discrete world, a probability mass function (PMF) tells you the probability that a random variable equals a specific value: P(X = 3) = 0.25, for instance. You can read off probabilities directly from the PMF.
In the continuous world, we instead have a probability density function (PDF), written f(x). The PDF tells you the density of probability at each point — not the probability itself. To get an actual probability, you must integrate the PDF over a range:
The PDF must satisfy two properties (mirroring the probability axioms):
Worked example — finding a constant: Let X be continuous with PDF:
What is C? Since the PDF must integrate to 1:
Compute the integral:
So C = 3/8.
Now: what is P(X > 1)?
Exactly half the probability mass lies above x = 1. The PDF is symmetric about x = 1, so this makes perfect sense.
Every time we want a probability, we have to solve an integral. That gets tedious fast. The cumulative distribution function (CDF) saves us by pre-computing the integral from −∞ up to any point x:
Once you have the CDF, you can answer any probability question without integrating again:
| Question | Answer using CDF | Why |
|---|---|---|
| P(X < a) | F(a) | Definition of CDF |
| P(X ≤ a) | F(a) | P(X = a) = 0 for continuous |
| P(X > a) | 1 − F(a) | Complement rule |
| P(a < X < b) | F(b) − F(a) | Subtract the accumulated area |
The CDF has three important properties:
Worked example: For X ~ Uni(0, 10), F(x) = x/10 for 0 ≤ x ≤ 10. What is P(3 < X < 7)?
No integration needed — just two function evaluations and a subtraction.
The simplest continuous distribution is the Uniform. A uniform random variable X ~ Uni(α, β) is equally likely to take any value in the range [α, β]. Think of it as the continuous version of "roll a fair die" — every outcome is equally likely.
Why 1/(β − α) and not just 1? Because the area under the curve must equal 1. The rectangle has width (β − α) and height 1/(β − α), so the area is exactly 1.
| Property | Formula |
|---|---|
| f(x) = 1/(β − α) for x ∈ [α, β] | |
| CDF | F(x) = (x − α) / (β − α) for x ∈ [α, β] |
| Expectation | E[X] = (α + β) / 2 |
| Variance | Var(X) = (β − α)2 / 12 |
| P(a ≤ X ≤ b) | (b − a) / (β − α) |
Worked example — the bus problem: You arrive at the bus stop at 2:15 pm. The bus arrives uniformly at random between 2:00 and 2:30. Let T ~ Uni(0, 30) be minutes after 2:00. What is the probability you wait less than 5 minutes, i.e., the bus arrives between 2:15 and 2:20?
About a 16.7% chance. The full arithmetic: f(t) = 1/30 for 0 ≤ t ≤ 30. The integral from 15 to 20 of (1/30) dt = [t/30]1520 = 20/30 − 15/30 = 5/30 = 1/6. The CDF shortcut gives the same answer instantly.
Drag the sliders to change α and β. The shaded area shows P(a < X < b) for the highlighted region.
You are monitoring a server for crashes. On average, a crash occurs once every 10 hours. You want to know: what is the probability of no crash in the next 5 hours? The Exponential distribution is built for exactly this — it models the time until the next event in a Poisson process.
If X ~ Exp(λ), then X measures the waiting time until the next event, where events happen at a constant average rate of λ per unit time.
The CDF has a beautiful closed form — no table lookups, no numerical integration:
| Property | Formula |
|---|---|
| f(x) = λe−λx | |
| CDF | F(x) = 1 − e−λx |
| Expectation | E[X] = 1/λ |
| Variance | Var(X) = 1/λ2 |
Proof of memorylessness:
Worked example — earthquakes: Major earthquakes (≥ 8.0) occur in a region at a rate of λ = 0.002 per year. What is P(earthquake in the next 4 years)?
Let Y ~ Exp(0.002). We want P(Y < 4):
About a 0.8% chance. Notice how we used the CDF directly — no integral needed.
Adjust λ to see how the rate parameter controls the distribution shape. Higher λ means events happen more frequently, so the density is concentrated near zero.
The single most important continuous distribution is the Normal, also called the Gaussian. It appears everywhere — measurement errors, test scores, heights, stock returns — because of a deep mathematical result: when you sum many independent random variables, the result tends toward a Normal, regardless of what the individual variables look like. This is the Central Limit Theorem (coming in a later chapter).
A Normal random variable X ~ N(μ, σ2) is parameterized by its mean μ and variance σ2. Its PDF is the famous bell curve:
When x equals the mean μ, the exponent is zero and e0 = 1, so the PDF is at its maximum. As x moves away from μ in either direction, the density drops off symmetrically, forming the bell shape.
| Property | Value |
|---|---|
| Expectation | E[X] = μ |
| Variance | Var(X) = σ2 |
| Symmetry | f(μ + d) = f(μ − d) for all d |
| 68-95-99.7 rule | 68.3% within 1σ, 95.4% within 2σ, 99.7% within 3σ |
There is no closed-form CDF for the Normal. You cannot write down a simple formula for ∫−∞x f(t) dt. Instead, we transform to the Standard Normal and look up values in a table (or use software). The next chapter covers this transformation.
Worked example — the 68% rule: What fraction of a Normal lies within one standard deviation of its mean?
Converting to the Standard Normal (next chapter shows how):
This holds for every Normal — regardless of μ and σ. About two-thirds of the probability mass always lies within one standard deviation of the mean.
Adjust μ and σ to see how they control the location and spread of the bell curve. The shaded region shows the area within one standard deviation (≈68.3%).
The Standard Normal is a Normal with mean 0 and variance 1: Z ~ N(0, 1). Its CDF is so important it gets its own Greek letter: Φ(z). Every Normal probability question ultimately reduces to looking up values of Φ.
The key trick: for any Normal X ~ N(μ, σ2), you can transform it to the Standard Normal by subtracting the mean and dividing by the standard deviation:
This works because of the linear transform property: if X ~ N(μ, σ2) and Y = aX + b, then Y ~ N(aμ + b, a2σ2). Setting a = 1/σ and b = −μ/σ gives Z ~ N(0, 1).
The value (x − μ) / σ is called the z-score. It measures "how many standard deviations is x away from the mean?" A z-score of +2 means "two standard deviations above the mean." A z-score of −1.5 means "one and a half standard deviations below."
Φ has a useful symmetry: Φ(−z) = 1 − Φ(z). This means you only need the table for positive z.
Worked example 1: X ~ N(3, 16), so μ = 3 and σ = 4. What is P(X > 0)?
Worked example 2: Same X ~ N(3, 16). What is P(2 < X < 5)?
Worked example 3 — noisy communication: You send voltage +2 for bit 1 and −2 for bit 0. The received signal is R = X + Y where Y ~ N(0, 1) is noise. The decoder says "1" if R ≥ 0.5. If we send bit 1 (X = 2), what is P(decoding error)?
About a 6.7% error rate per bit.
| z | Φ(z) | z | Φ(z) |
|---|---|---|---|
| 0.00 | 0.5000 | 1.00 | 0.8413 |
| 0.25 | 0.5987 | 1.50 | 0.9332 |
| 0.50 | 0.6915 | 2.00 | 0.9772 |
| 0.75 | 0.7734 | 2.50 | 0.9938 |
scipy.stats.norm.cdf(x, mu, sigma). Note: the second parameter is the standard deviation σ, not the variance σ2. If X ~ N(3, 16), call norm.cdf(x, 3, 4) — pass σ = 4, not σ2 = 16.Consider X ~ Bin(10000, 0.5). What is P(X > 5500)? The exact formula requires summing 4500 terms of binomial coefficients multiplied by 0.510000. This is computationally brutal. But if you squint at the shape of Bin(10000, 0.5), it looks almost exactly like a bell curve.
This is not a coincidence. When n is large and p is not too extreme, a Binomial is well-approximated by a Normal with matching mean and variance:
The rule of thumb: use the Normal approximation when np(1 − p) > 10.
Because the Normal is continuous and the Binomial is discrete, we apply a continuity correction — shifting by 0.5 to account for the "width" of each discrete bar:
| Discrete question | Continuous equivalent |
|---|---|
| P(X = 6) | P(5.5 < Y < 6.5) |
| P(X ≥ 6) | P(Y > 5.5) |
| P(X > 6) | P(Y > 6.5) |
| P(X < 6) | P(Y < 5.5) |
| P(X ≤ 6) | P(Y < 6.5) |
Worked example 1 — A/B test: 100 visitors see a new website design. Let X = number who spend more time. If the design has no effect, each visitor is a fair coin flip: X ~ Bin(100, 0.5). The CEO endorses the change if X ≥ 65. What is P(CEO endorses | no effect)?
Compute the parameters: E[X] = np = 50. Var(X) = np(1−p) = 25. So σ = 5.
Normal approximation: Y ~ N(50, 25). With continuity correction:
Less than 0.2%. Very unlikely to get 65+ by pure chance — which is the whole point of statistical significance.
Worked example 2 — Stanford admissions: Stanford accepts 2480 students, each with a 68% chance of attending. Let X ~ Bin(2480, 0.68). What is P(X > 1745)?
E[X] = 2480 × 0.68 = 1686.4. Var(X) = 2480 × 0.68 × 0.32 = 539.648. σ = √539.648 ≈ 23.23.
About 0.55%. Stanford can be fairly confident that enrollment will not exceed 1745.
Time to bring everything together. The interactive explorer below lets you switch between all three continuous distributions from this chapter — Uniform, Exponential, and Normal — and see their PDF and CDF side by side. Adjust the parameters and watch both curves update in real time.
Use this to build visual intuition: How does increasing λ in the Exponential squeeze the density toward zero? How does increasing σ in the Normal flatten and widen the bell? How does the CDF always climb from 0 to 1?
Choose a distribution and adjust its parameters. Left panel shows the PDF; right panel shows the CDF. The shaded region on the PDF and the highlighted segment on the CDF correspond to P(a < X < b) which is displayed numerically.
This chapter moved us from the discrete world into the continuous world. The mechanics changed (integrals replace sums, density replaces mass) but the conceptual framework is the same: distributions, expectations, variances, CDFs. Here is what connects to what:
Looking back:
• Chapter 7 gave us discrete distributions (Bernoulli, Binomial, Poisson). Every idea here is the continuous analog.
• The Exponential is the continuous companion to the Poisson — same process, different question.
• The Normal approximation to the Binomial bridges discrete and continuous.
Looking ahead:
• Chapter 9 introduces joint distributions for multiple continuous RVs.
• Chapter 10 covers the Central Limit Theorem, which explains why the Normal appears everywhere.
• The Normal distribution becomes the foundation for statistical inference, confidence intervals, and hypothesis testing.
| Distribution | Use when | Key parameter |
|---|---|---|
| Uniform(α, β) | All values in a range equally likely | Range [α, β] |
| Exponential(λ) | Time until next event (Poisson process) | Rate λ |
| Normal(μ, σ2) | Sums of many independent factors | Mean μ, variance σ2 |
What comes next: So far, every random variable has lived alone. But real problems involve multiple quantities at once — height and weight, temperature and humidity, your score and mine. Chapter 9 introduces joint distributions, where two or more random variables live in the same probability space. Joint PDFs, marginals, conditional densities, and covariance all extend naturally from what you have learned here.