Goodfellow et al., Chapter 5

Machine Learning Basics

Every deep learning algorithm is a learning algorithm first. Here are the core ideas — from fitting data to generalizing beyond it.

Prerequisites: Chapters 2-4 (linear algebra, probability, numerical computation).
11
Chapters
3+
Simulations
11
Quizzes

Chapter 0: Why Machine Learning?

You want a computer to recognize faces in photos. You could try writing rules by hand: "if the pixel at (120, 80) is skin-colored and there is a dark region above it..." But faces vary in infinite ways — lighting, angle, expression, skin tone. No finite set of hand-written rules can cover them all.

Machine learning flips the script. Instead of writing rules, you show the computer thousands of examples. It discovers the patterns itself. The more examples it sees, the better it gets. That is learning from experience.

Mitchell's definition (1997): "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." Every ML system has these three ingredients: a task, a performance measure, and experience (data).
Learning Algorithms
Task, performance, experience — and linear regression
Capacity & Generalization
Overfitting, underfitting, and the bias-variance trade-off
Estimators & MLE
How to derive good learning rules from first principles
SGD & Challenges
Scaling to big data and the curse of dimensionality
Why is hand-writing rules insufficient for tasks like face recognition?

Chapter 1: Learning Algorithms

A learning algorithm has three parts. The task T defines what the system should do — classify images, translate sentences, predict stock prices. The performance measure P quantifies how well it does — accuracy, error rate, log-likelihood. The experience E is the data it learns from.

Common tasks include classification (assign an input to one of k categories), regression (predict a continuous number), transcription (convert unstructured input to text), denoising (recover a clean signal from a corrupted one), and density estimation (learn the probability distribution that generated the data).

Performance on unseen data: We always evaluate on a test set — data the algorithm has never seen during training. A model that scores perfectly on training data but fails on new data has memorized rather than learned. The gap between training and test performance is the central tension in all of ML.

The simplest concrete example is linear regression. Given input features x, predict output y with ŷ = wTx + b. We choose w and b to minimize the mean squared error (MSE) on training data:

MSEtrain = (1/m) ∑i ||ŷ(i) − y(i)||2

For linear regression, we can solve this in closed form via the normal equations: w = (XTX)−1XTy. No iteration needed — just linear algebra. But as models grow more complex, closed-form solutions vanish and we turn to gradient descent.

What are the three components of Mitchell's definition of a learning algorithm?

Chapter 2: Capacity, Overfitting & Underfitting

The central challenge is not fitting the training data — that is easy. The challenge is generalization: performing well on inputs the model has never seen. A model that memorizes every training example but cannot predict anything new is useless.

Underfitting occurs when the model is too simple to capture the structure in the data. A straight line through a sinusoidal dataset misses the waves. Overfitting occurs when the model is so complex that it memorizes the noise in the training set. A degree-20 polynomial through 10 points will pass through every point but oscillate wildly between them.

Generalization error = Training error + Gap

The capacity of a model is its ability to fit a wide variety of functions. A linear model has low capacity. A polynomial of degree 20 has high capacity. The goal is to match capacity to the true complexity of the data — the sweet spot where training error is low but the gap to test error is also small.

Underfitting vs Overfitting

Adjust the polynomial degree to see how capacity affects fit. Low degree = underfitting. High degree = overfitting. Find the sweet spot.

Polynomial degree1
Key insight — the no free lunch theorem: No single ML algorithm is universally best across all possible tasks. A model that excels on one data distribution will fail on another. We must use prior knowledge about the problem to choose the right model family. This is why domain expertise matters even in the age of deep learning.
Regularization: We can control capacity without changing the model family by adding a penalty. For example, weight decay adds λ||w||2 to the cost function, discouraging large weights. This shrinks the effective capacity, pushing the model toward simpler solutions and reducing overfitting.
What is the difference between underfitting and overfitting?

Chapter 3: Hyperparameters & Validation

Most ML algorithms have settings that are not learned from data — they must be chosen by the practitioner. These are hyperparameters. The polynomial degree in our fitting demo is a hyperparameter. The weight decay coefficient λ is another. The learning rate ε is yet another.

Why not just learn them from training data? Because optimizing hyperparameters on training data always leads to maximum capacity and overfitting. If you let the algorithm choose its own polynomial degree to minimize training error, it will always choose the highest degree available.

The validation set: We split our data into three parts: training set (learn the parameters), validation set (choose hyperparameters), and test set (final evaluation). The test set is touched only once — at the very end — to give an unbiased estimate of generalization error. If you tune hyperparameters using the test set, you are overfitting to the test set.

Cross-validation is useful when data is scarce. In k-fold cross-validation, we partition the data into k subsets, train on k−1 of them, and validate on the remaining one. We repeat k times, rotating the validation fold, and average the results. This gives a more reliable estimate of generalization performance.

Why can't we simply optimize hyperparameters on the training set?

Chapter 4: Estimators, Bias & Variance

An estimator is a rule for computing an approximate value of a parameter from data. Given m samples from a distribution with unknown mean μ, the sample mean μ̂ = (1/m)∑x(i) is an estimator of μ. But how good is it?

Two properties matter: bias and variance. The bias of an estimator is the difference between its expected value and the true parameter: bias(θ̂) = E[θ̂] − θ. An unbiased estimator has zero bias — on average, it hits the true value exactly.

bias(θ̂) = E[θ̂] − θ

The variance measures how much the estimator fluctuates across different samples of data. High variance means the estimate changes wildly depending on which data you happened to draw.

The bias-variance trade-off: The mean squared error decomposes as MSE = Bias2 + Variance. A biased estimator with low variance can have lower MSE than an unbiased estimator with high variance. This is the statistical mirror of the underfitting/overfitting trade-off — simple models have high bias but low variance; complex models have low bias but high variance.
Bias-Variance Trade-off

Adjust model capacity to see how bias, variance, and total MSE change. The optimal capacity minimizes total error.

Model capacity10

A consistent estimator converges to the true parameter as the number of samples m approaches infinity. The sample mean is consistent: μ̂m → μ as m → ∞. Consistency means bias vanishes with enough data — but the reverse is not true. An unbiased estimator is not necessarily consistent.

How does the MSE of an estimator decompose?

Chapter 5: Maximum Likelihood Estimation

Where do good estimators come from? Rather than guessing, we need a principle that automatically derives the best estimator for any model. The most important such principle is maximum likelihood estimation (MLE).

The idea is simple. Given data X = {x(1), ..., x(m)} and a parametric model pmodel(x; θ), find the θ that makes the observed data most probable:

θML = arg maxθi pmodel(x(i); θ)

In practice we maximize the log-likelihood instead (products become sums, and we avoid numerical underflow):

θML = arg maxθi log pmodel(x(i); θ)
MLE = minimizing KL divergence: Maximizing the log-likelihood is equivalent to minimizing the KL divergence between the empirical data distribution p̂data and the model distribution pmodel. In other words, MLE tries to make your model's distribution match the data as closely as possible. Any negative log-likelihood loss is a cross-entropy between the data and the model.

Conditional MLE extends this to supervised learning. Instead of modeling p(x), we model p(y|x; θ) and maximize ∑ log p(y(i)|x(i); θ). When we assume p(y|x) is Gaussian, maximizing the log-likelihood is exactly equivalent to minimizing MSE. This justifies MSE as a principled loss function, not an arbitrary choice.

Properties of MLE: Under mild conditions, the MLE is consistent (converges to the true parameter with enough data) and statistically efficient (no consistent estimator has lower MSE for large m, by the Cramér-Rao bound). This is why MLE is the default starting point for designing learning algorithms.
Why do we maximize the log-likelihood rather than the likelihood itself?

Chapter 6: Bayesian Statistics

MLE gives you a single best guess for θ. But what if your data is limited and you are uncertain about the true parameter? The Bayesian approach represents that uncertainty explicitly.

In the frequentist view, θ is a fixed unknown number and the data is random. In the Bayesian view, the data is fixed (you observed it) and θ is a random variable with a probability distribution. Before seeing data, you have a prior p(θ) reflecting your initial beliefs. After observing data, you update to a posterior using Bayes' rule:

p(θ | data) = p(data | θ) · p(θ) / p(data)

Predictions then integrate over all possible θ values, weighted by the posterior. This naturally protects against overfitting — if you are uncertain about θ, that uncertainty is folded into every prediction.

MAP estimation: Full Bayesian inference is often intractable. A practical compromise is Maximum A Posteriori (MAP) estimation: find the single θ that maximizes the posterior. MAP = log-likelihood + log-prior. When the prior is Gaussian, the log-prior term becomes λ||w||2 — which is exactly weight decay. So weight decay is Bayesian inference with a Gaussian prior!
Bayesian vs frequentist, in one sentence: The frequentist asks "what would happen if I repeated this experiment many times?" The Bayesian asks "given the data I actually observed, what do I believe about θ?"
How does MAP estimation relate to regularization?

Chapter 7: Supervised & Unsupervised Learning

Supervised learning means learning from labeled examples — each input x comes with a target y. The algorithm learns to predict y from x. Classification, regression, and transcription are all supervised tasks.

Unsupervised learning means finding structure in unlabeled data. No targets, no teacher. The algorithm must discover patterns on its own. Density estimation, clustering, and dimensionality reduction are classic examples.

The boundary is blurry: Supervised and unsupervised learning are not rigid categories. The chain rule of probability shows that modeling p(x) (unsupervised) can be decomposed into a sequence of supervised problems p(xi | x1, ..., xi-1). Conversely, supervised learning of p(y|x) can be approached by learning the joint p(x, y) unsupervised and then conditioning.

Logistic regression extends linear regression to classification. Pass the linear output through a sigmoid σ(wTx + b) to squash it into [0, 1], and interpret the result as a probability. Unlike linear regression, there is no closed-form solution — we minimize the negative log-likelihood using gradient descent.

PCA (principal components analysis) is the quintessential unsupervised algorithm. It finds a linear transformation that decorrelates the data and projects it onto the directions of maximum variance, giving a lower-dimensional representation that preserves as much information as possible.

k-means clustering partitions data into k groups. Initialize k centroids randomly, assign each point to its nearest centroid, update each centroid to the mean of its assigned points, and repeat. Simple, fast, but the result depends heavily on initialization and k.

What is the key difference between supervised and unsupervised learning?

Chapter 8: Stochastic Gradient Descent

Nearly all of deep learning is powered by one algorithm: stochastic gradient descent (SGD). The cost function typically decomposes as a sum over training examples:

J(θ) = (1/m) ∑i=1m L(x(i), y(i), θ)

Computing the gradient over all m examples costs O(m) per step. When m is in the billions, a single step takes forever. SGD's insight: the gradient is an expectation. We can estimate it with a small random minibatch of m' examples (typically 32-256) drawn from the training set:

g = (1/m') ∑i=1m'θL(x(i), y(i), θ)     θ ← θ − ε g

This estimate is noisy but unbiased. On average, it points in the right direction. The noise even helps — it can bump the optimizer out of sharp local minima into flatter, better-generalizing regions.

Full Batch vs Stochastic Gradient Descent

Compare the smooth path of full-batch GD to the noisy but effective path of SGD. Click "Run" to animate.

Full-batch: smooth path | SGD: noisy but converges

Why SGD scales: The cost per update is O(m'), independent of the total dataset size m. With a billion examples, SGD takes the same time per step as with a thousand. Moreover, the model often converges before SGD has seen every training example even once. Asymptotically, training cost is O(1) in m.
Why does SGD use a minibatch instead of the full training set?

Chapter 9: Challenges Motivating Deep Learning

Traditional ML algorithms work well on many problems. But they fail on the hard ones — recognizing speech, understanding images, translating language. Why? Three fundamental challenges stand in the way.

The Curse of Dimensionality

Imagine dividing a 1D input into 10 bins. That takes 10 regions to cover. In 2D, 10 × 10 = 100 regions. In 3D, 1000. In d dimensions, 10d regions — an exponential explosion. With high-dimensional inputs like images (millions of pixels), the number of possible configurations dwarfs any training set. Traditional algorithms that rely on covering the space with examples (like k-nearest neighbors) drown in this sea of possibilities.

The Curse of Dimensionality

Increase the number of dimensions to see how the required number of regions explodes exponentially. Each dimension multiplies the space by 10.

Dimensions1

The Smoothness Prior Is Not Enough

Most traditional algorithms assume the target function is smooth — nearby inputs should produce similar outputs. This is the local constancy prior. It works when you have enough examples to cover every peak and valley. But for complex functions in high dimensions, the number of distinct regions vastly exceeds the data. Imagine a checkerboard: the pattern is simple but has many alternating regions. A smooth prior cannot discover the checkerboard structure without placing at least one example in every square.

Deep learning's answer: Deep learning assumes the data was generated by the composition of factors, potentially at multiple levels in a hierarchy. This lets a network with O(k) parameters distinguish O(2k) regions — an exponential gain. Composition is the key: by combining simple features into complex ones, deep networks transcend the smoothness prior.

Manifold Learning

High-dimensional data typically concentrates near low-dimensional manifolds. The space of all possible 256×256 images is astronomically large, but the space of natural images (faces, landscapes, objects) is a tiny, structured subset. If we can discover the manifold, we only need to learn a function on its intrinsic coordinates, not on all of Rn.

How does deep learning overcome the curse of dimensionality that defeats traditional ML?

Chapter 10: Connections

This chapter laid the foundation for everything that follows. Every deep learning algorithm is built from the same recipe described here:

IngredientChapter 5 ConceptDeep Learning Application
DatasetTraining / validation / test splitsImageNet, BookCorpus, Common Crawl
Cost functionMLE → negative log-likelihoodCross-entropy, MSE, contrastive loss
ModelCapacity, parameterizationCNNs, Transformers, diffusion models
OptimizerSGD with minibatchesAdam, AdamW, learning rate schedules
RegularizationWeight decay, validation-based tuningDropout, data augmentation, early stopping
What you should take away: ML is applied statistics. MLE gives you principled cost functions. The bias-variance trade-off governs generalization. SGD makes training scalable. And the curse of dimensionality explains why we need depth — composition of features — not just more data or smoother functions.

Up next: Chapter 6: Deep Feedforward Networks — the first real neural network architecture, where these ingredients come together to learn hierarchical representations.

What are the four core components of the ML algorithm recipe described in this chapter?