Ch 3: Probability & Information Theory — Goodfellow Deep Learning

Chapter 0: Why Probability?

You show a neural network a blurry photo. Is it a cat or a dog? The network cannot be certain. It should say "82% cat, 18% dog" — a probability distribution over possible answers. If it simply said "cat" with no measure of confidence, you would not know when to trust it.

Deep learning is inherently probabilistic. Training data is a random sample. Dropout randomly silences neurons. The loss function is usually the negative log-probability of the true label. Even the final prediction is a probability distribution output by softmax.

Three sources of uncertainty in ML: (1) Inherent stochasticity — the world itself is noisy. (2) Incomplete observability — we cannot see everything. (3) Incomplete modeling — our model is a simplification. Probability gives us a principled framework for handling all three.

Random Variables

Quantities whose values are uncertain

↓

Distributions

PMFs and PDFs describe likelihood of each value

↓

Bayes' Rule

Update beliefs given new evidence

↓

Information Theory

Quantify surprise, compare distributions

Why do neural networks output probability distributions instead of single hard predictions?

Because it is computationally cheaper Because uncertainty is inherent in the world and model, and probabilities communicate confidence Because they always need exactly two outputs

Chapter 1: Random Variables

A random variable is a variable whose value is determined by a random process. We write it as x (lowercase, italic). It can be discrete (taking on a finite or countable set of values, like the outcome of a die roll) or continuous (taking on any real value in a range, like a person's height).

A probability distribution describes how likely each value is. For discrete variables, we use a probability mass function (PMF), written P(x). For continuous variables, we use a probability density function (PDF), written p(x). The key requirement: all probabilities must sum (or integrate) to 1.

Discrete: ∑_x P(x) = 1 Continuous: ∫ p(x) dx = 1

Marginal probability: If x and y are jointly distributed, the marginal over x sums out y: P(x) = ∑_y P(x, y). This is called the sum rule. It lets us recover the distribution of one variable when we know the joint distribution of both.

Conditional probability describes how the distribution of x changes when we know y: P(x | y) = P(x, y) / P(y). This is the product rule rearranged. In deep learning, the output distribution P(label | image) is a conditional probability.

What is the difference between a PMF and a PDF?

PMF is for discrete random variables (probabilities sum to 1); PDF is for continuous ones (density integrates to 1) PMF is always uniform; PDF can have peaks There is no difference

Chapter 2: Distributions in Action

Let us make this concrete. Suppose you roll a fair die. The PMF is P(x = k) = 1/6 for k = 1, 2, ..., 6. Each outcome is equally likely. Now suppose the die is loaded — maybe P(x = 6) = 1/2 and the other faces share the remaining 1/2. The PMF changes, but still sums to 1.

PMF vs PDF

Left: a discrete PMF (bar heights = probabilities). Right: a continuous PDF (area under curve = probability). Drag the slider to skew the distribution.

Skew0.0

For a continuous PDF, p(x) can be greater than 1 at specific points (it is a density, not a probability). What matters is that the area under the curve between two points gives the probability of falling in that interval: P(a ≤ x ≤ b) = ∫_a^b p(x) dx.

Key insight: For continuous variables, the probability of any single exact value is zero. Only intervals have nonzero probability. This is why we work with densities rather than point probabilities.

Can a probability density function have values greater than 1?

Yes — it is a density, not a probability; only the integral over an interval gives a probability No — probabilities must be between 0 and 1 Only for uniform distributions

Chapter 3: Bayes' Rule

Bayes' rule lets us invert conditional probabilities. We know P(symptom | disease) from medical data. But the doctor needs P(disease | symptom) — the reverse. Bayes' rule bridges the gap:

P(y | x) = P(x | y) P(y) / P(x)

P(y) is the prior — what we believed before seeing evidence. P(x | y) is the likelihood — how likely the evidence is given each hypothesis. P(y | x) is the posterior — our updated belief after seeing evidence. P(x) normalizes everything to sum to 1.

Deep learning connection: Maximum likelihood training finds model parameters θ that maximize P(data | θ). Bayesian deep learning treats θ as a random variable and uses Bayes' rule: P(θ | data) ∝ P(data | θ) P(θ). The prior P(θ) is regularization. Weight decay = Gaussian prior on weights.

Two variables are independent if knowing one tells you nothing about the other: P(x, y) = P(x) P(y). They are conditionally independent given z if P(x, y | z) = P(x | z) P(y | z). Conditional independence is used everywhere — naive Bayes, graphical models, and many neural network assumptions.

In Bayes' rule, what role does the prior P(y) play?

It is the probability of the evidence It measures how well the model fits the data It encodes our belief about y before seeing any evidence

Chapter 4: Common Distributions

A few distributions appear everywhere in deep learning. Knowing them is like knowing the standard library of a programming language.

The Bernoulli distribution models a single binary outcome (coin flip): P(x = 1) = φ. The Categorical distribution generalizes this to multiple classes — it is the output of softmax.

The Gaussian (Normal) distribution is the workhorse of continuous probability:

N(x; μ, σ²) = (1 / √(2πσ²)) exp(−(x − μ)² / (2σ²))

It is parameterized by mean μ (center) and variance σ² (spread). The Central Limit Theorem says the sum of many independent random variables converges to a Gaussian, which is why noise in data, weight initialization, and measurement errors are often modeled as Gaussian.

Gaussian Distribution

Adjust mean and standard deviation to see how the bell curve changes.

μ0.0

σ1.0

Other important distributions: The Exponential distribution models time between events. The Laplace distribution has heavier tails than Gaussian and corresponds to L1 regularization. The Dirac delta puts all mass at one point — used for empirical distributions over finite datasets.

Why is the Gaussian distribution so common in deep learning?

It is the simplest distribution to compute The Central Limit Theorem: sums of many independent variables converge to Gaussian It only has positive values

Chapter 5: Expectation & Variance

The expectation (or expected value) of a function f(x) under distribution P is the probability-weighted average:

E_x~P[f(x)] = ∑_x P(x) f(x) or ∫ p(x) f(x) dx

When f(x) = x, the expectation is the mean — the center of mass of the distribution. The variance measures spread: Var(x) = E[(x − E[x])²]. Its square root is the standard deviation σ.

The covariance measures how two variables move together: Cov(x, y) = E[(x − E[x])(y − E[y])]. Positive covariance means they increase together; negative means one increases while the other decreases. The covariance matrix collects all pairwise covariances.

Why this matters: The loss function in deep learning IS an expectation: L = E_(x,y)~data[loss(model(x), y)]. Training minimizes this expectation. Since we cannot compute it over all possible data, we approximate with a mini-batch average — that is Monte Carlo estimation of the expectation.

Why does deep learning training use mini-batch averages of the loss?

To save memory only The true loss is an expectation over all data, and the mini-batch average is a practical approximation Because individual sample losses are always zero

Chapter 6: Information Theory

How much "surprise" does an event carry? If it rains in Seattle in November, you are not very surprised. If it rains in the Sahara, you are extremely surprised. Information theory quantifies this intuition.

The self-information of an event with probability p is I(x) = −log p(x). Rare events have high information; certain events have zero. The entropy H of a distribution is the expected self-information:

H(P) = −∑_x P(x) log P(x) = E_x~P[−log P(x)]

Entropy measures the average "surprise" or uncertainty in a distribution. A fair coin has maximum entropy (1 bit); a loaded coin that always lands heads has zero entropy.

The KL divergence measures how different two distributions are:

D_KL(P || Q) = E_x~P[log P(x) − log Q(x)]

KL divergence is always ≥ 0, and equals 0 only when P = Q. It is not symmetric: D_KL(P || Q) ≠ D_KL(Q || P).

The cross-entropy loss is KL divergence in disguise. When we minimize cross-entropy H(P, Q) = −∑ P(x) log Q(x) between the true labels P and the model's predictions Q, we are minimizing D_KL(P || Q) plus a constant (the entropy of P). Since P is fixed, minimizing cross-entropy = minimizing KL divergence = making Q match P.

Entropy & KL Divergence

Adjust the probability of heads for two coins. Watch entropy and KL divergence change.

P(heads)0.50

Q(heads)0.70

Why is minimizing cross-entropy loss equivalent to making the model's predictions match the true distribution?

Because cross-entropy equals KL divergence plus a constant, so minimizing cross-entropy minimizes KL divergence Because cross-entropy always equals zero Because cross-entropy only measures the mean

Chapter 7: Distribution Playground

This playground lets you explore the major probability distributions used in deep learning, see how their parameters affect their shapes, and watch entropy and KL divergence change in real time.

Distribution Explorer

Pick a distribution and adjust its parameters. The curve shows the PDF/PMF, with mean and standard deviation annotated.

Param 10.0

Param 21.0

Gaussian: mean=0.0, std=1.0 | Entropy=1.42 nats

Experiments to try: (1) Gaussian: set σ very small — the distribution becomes a spike (low entropy). (2) Laplace: compare to Gaussian with the same σ — Laplace has heavier tails. (3) Beta with both params < 1: the distribution becomes U-shaped, concentrating at the extremes.

Which distribution has heavier tails (more probability far from the mean): Gaussian or Laplace with the same variance?

Gaussian Laplace They have identical tails

Chapter 8: Connections

Probability and information theory form the second mathematical pillar of deep learning (alongside linear algebra from Chapter 2). Here is how each concept connects to the rest of the book:

Concept	Where It Appears
Probability distributions	Network outputs via softmax (Ch 6), generative models
Bayes' rule	Bayesian deep learning, posterior inference, regularization as prior (Ch 7)
Gaussian distribution	Weight initialization, VAEs, noise modeling
Cross-entropy / KL divergence	The standard classification loss function (Ch 6, 8)
Expectation	Loss = expected error; SGD approximates the expectation (Ch 8)
Maximum likelihood	The principle behind most training objectives (Ch 5, 6)

What you should take away: Every loss function in deep learning is a quantity from information theory. Cross-entropy loss = KL divergence from the model to the data. MSE loss = negative log-likelihood under a Gaussian model. Understanding this connection lets you design custom losses and know exactly what they optimize.

Up next: Chapter 4: Numerical Computation — the practical concerns of implementing these mathematical ideas on real hardware with finite precision.

Minimizing MSE loss is equivalent to maximizing likelihood under which distribution assumption?

Bernoulli Gaussian (the errors are normally distributed) Uniform