How neural networks reason under uncertainty. Distributions, Bayes' rule, entropy, and the math behind every loss function.
You show a neural network a blurry photo. Is it a cat or a dog? The network cannot be certain. It should say "82% cat, 18% dog" — a probability distribution over possible answers. If it simply said "cat" with no measure of confidence, you would not know when to trust it.
Deep learning is inherently probabilistic. Training data is a random sample. Dropout randomly silences neurons. The loss function is usually the negative log-probability of the true label. Even the final prediction is a probability distribution output by softmax.
A random variable is a variable whose value is determined by a random process. We write it as x (lowercase, italic). It can be discrete (taking on a finite or countable set of values, like the outcome of a die roll) or continuous (taking on any real value in a range, like a person's height).
A probability distribution describes how likely each value is. For discrete variables, we use a probability mass function (PMF), written P(x). For continuous variables, we use a probability density function (PDF), written p(x). The key requirement: all probabilities must sum (or integrate) to 1.
Conditional probability describes how the distribution of x changes when we know y: P(x | y) = P(x, y) / P(y). This is the product rule rearranged. In deep learning, the output distribution P(label | image) is a conditional probability.
Let us make this concrete. Suppose you roll a fair die. The PMF is P(x = k) = 1/6 for k = 1, 2, ..., 6. Each outcome is equally likely. Now suppose the die is loaded — maybe P(x = 6) = 1/2 and the other faces share the remaining 1/2. The PMF changes, but still sums to 1.
Left: a discrete PMF (bar heights = probabilities). Right: a continuous PDF (area under curve = probability). Drag the slider to skew the distribution.
For a continuous PDF, p(x) can be greater than 1 at specific points (it is a density, not a probability). What matters is that the area under the curve between two points gives the probability of falling in that interval: P(a ≤ x ≤ b) = ∫ab p(x) dx.
Bayes' rule lets us invert conditional probabilities. We know P(symptom | disease) from medical data. But the doctor needs P(disease | symptom) — the reverse. Bayes' rule bridges the gap:
P(y) is the prior — what we believed before seeing evidence. P(x | y) is the likelihood — how likely the evidence is given each hypothesis. P(y | x) is the posterior — our updated belief after seeing evidence. P(x) normalizes everything to sum to 1.
Two variables are independent if knowing one tells you nothing about the other: P(x, y) = P(x) P(y). They are conditionally independent given z if P(x, y | z) = P(x | z) P(y | z). Conditional independence is used everywhere — naive Bayes, graphical models, and many neural network assumptions.
A few distributions appear everywhere in deep learning. Knowing them is like knowing the standard library of a programming language.
The Bernoulli distribution models a single binary outcome (coin flip): P(x = 1) = φ. The Categorical distribution generalizes this to multiple classes — it is the output of softmax.
The Gaussian (Normal) distribution is the workhorse of continuous probability:
It is parameterized by mean μ (center) and variance σ2 (spread). The Central Limit Theorem says the sum of many independent random variables converges to a Gaussian, which is why noise in data, weight initialization, and measurement errors are often modeled as Gaussian.
Adjust mean and standard deviation to see how the bell curve changes.
The expectation (or expected value) of a function f(x) under distribution P is the probability-weighted average:
When f(x) = x, the expectation is the mean — the center of mass of the distribution. The variance measures spread: Var(x) = E[(x − E[x])2]. Its square root is the standard deviation σ.
The covariance measures how two variables move together: Cov(x, y) = E[(x − E[x])(y − E[y])]. Positive covariance means they increase together; negative means one increases while the other decreases. The covariance matrix collects all pairwise covariances.
How much "surprise" does an event carry? If it rains in Seattle in November, you are not very surprised. If it rains in the Sahara, you are extremely surprised. Information theory quantifies this intuition.
The self-information of an event with probability p is I(x) = −log p(x). Rare events have high information; certain events have zero. The entropy H of a distribution is the expected self-information:
Entropy measures the average "surprise" or uncertainty in a distribution. A fair coin has maximum entropy (1 bit); a loaded coin that always lands heads has zero entropy.
The KL divergence measures how different two distributions are:
KL divergence is always ≥ 0, and equals 0 only when P = Q. It is not symmetric: DKL(P || Q) ≠ DKL(Q || P).
Adjust the probability of heads for two coins. Watch entropy and KL divergence change.
This playground lets you explore the major probability distributions used in deep learning, see how their parameters affect their shapes, and watch entropy and KL divergence change in real time.
Pick a distribution and adjust its parameters. The curve shows the PDF/PMF, with mean and standard deviation annotated.
Gaussian: mean=0.0, std=1.0 | Entropy=1.42 nats
Probability and information theory form the second mathematical pillar of deep learning (alongside linear algebra from Chapter 2). Here is how each concept connects to the rest of the book:
| Concept | Where It Appears |
|---|---|
| Probability distributions | Network outputs via softmax (Ch 6), generative models |
| Bayes' rule | Bayesian deep learning, posterior inference, regularization as prior (Ch 7) |
| Gaussian distribution | Weight initialization, VAEs, noise modeling |
| Cross-entropy / KL divergence | The standard classification loss function (Ch 6, 8) |
| Expectation | Loss = expected error; SGD approximates the expectation (Ch 8) |
| Maximum likelihood | The principle behind most training objectives (Ch 5, 6) |
Up next: Chapter 4: Numerical Computation — the practical concerns of implementing these mathematical ideas on real hardware with finite precision.