Every deep learning algorithm is a learning algorithm first. Here are the core ideas — from fitting data to generalizing beyond it.
You want a computer to recognize faces in photos. You could try writing rules by hand: "if the pixel at (120, 80) is skin-colored and there is a dark region above it..." But faces vary in infinite ways — lighting, angle, expression, skin tone. No finite set of hand-written rules can cover them all.
Machine learning flips the script. Instead of writing rules, you show the computer thousands of examples. It discovers the patterns itself. The more examples it sees, the better it gets. That is learning from experience.
A learning algorithm has three parts. The task T defines what the system should do — classify images, translate sentences, predict stock prices. The performance measure P quantifies how well it does — accuracy, error rate, log-likelihood. The experience E is the data it learns from.
Common tasks include classification (assign an input to one of k categories), regression (predict a continuous number), transcription (convert unstructured input to text), denoising (recover a clean signal from a corrupted one), and density estimation (learn the probability distribution that generated the data).
The simplest concrete example is linear regression. Given input features x, predict output y with ŷ = wTx + b. We choose w and b to minimize the mean squared error (MSE) on training data:
For linear regression, we can solve this in closed form via the normal equations: w = (XTX)−1XTy. No iteration needed — just linear algebra. But as models grow more complex, closed-form solutions vanish and we turn to gradient descent.
The central challenge is not fitting the training data — that is easy. The challenge is generalization: performing well on inputs the model has never seen. A model that memorizes every training example but cannot predict anything new is useless.
Underfitting occurs when the model is too simple to capture the structure in the data. A straight line through a sinusoidal dataset misses the waves. Overfitting occurs when the model is so complex that it memorizes the noise in the training set. A degree-20 polynomial through 10 points will pass through every point but oscillate wildly between them.
The capacity of a model is its ability to fit a wide variety of functions. A linear model has low capacity. A polynomial of degree 20 has high capacity. The goal is to match capacity to the true complexity of the data — the sweet spot where training error is low but the gap to test error is also small.
Adjust the polynomial degree to see how capacity affects fit. Low degree = underfitting. High degree = overfitting. Find the sweet spot.
Most ML algorithms have settings that are not learned from data — they must be chosen by the practitioner. These are hyperparameters. The polynomial degree in our fitting demo is a hyperparameter. The weight decay coefficient λ is another. The learning rate ε is yet another.
Why not just learn them from training data? Because optimizing hyperparameters on training data always leads to maximum capacity and overfitting. If you let the algorithm choose its own polynomial degree to minimize training error, it will always choose the highest degree available.
Cross-validation is useful when data is scarce. In k-fold cross-validation, we partition the data into k subsets, train on k−1 of them, and validate on the remaining one. We repeat k times, rotating the validation fold, and average the results. This gives a more reliable estimate of generalization performance.
An estimator is a rule for computing an approximate value of a parameter from data. Given m samples from a distribution with unknown mean μ, the sample mean μ̂ = (1/m)∑x(i) is an estimator of μ. But how good is it?
Two properties matter: bias and variance. The bias of an estimator is the difference between its expected value and the true parameter: bias(θ̂) = E[θ̂] − θ. An unbiased estimator has zero bias — on average, it hits the true value exactly.
The variance measures how much the estimator fluctuates across different samples of data. High variance means the estimate changes wildly depending on which data you happened to draw.
Adjust model capacity to see how bias, variance, and total MSE change. The optimal capacity minimizes total error.
A consistent estimator converges to the true parameter as the number of samples m approaches infinity. The sample mean is consistent: μ̂m → μ as m → ∞. Consistency means bias vanishes with enough data — but the reverse is not true. An unbiased estimator is not necessarily consistent.
Where do good estimators come from? Rather than guessing, we need a principle that automatically derives the best estimator for any model. The most important such principle is maximum likelihood estimation (MLE).
The idea is simple. Given data X = {x(1), ..., x(m)} and a parametric model pmodel(x; θ), find the θ that makes the observed data most probable:
In practice we maximize the log-likelihood instead (products become sums, and we avoid numerical underflow):
Conditional MLE extends this to supervised learning. Instead of modeling p(x), we model p(y|x; θ) and maximize ∑ log p(y(i)|x(i); θ). When we assume p(y|x) is Gaussian, maximizing the log-likelihood is exactly equivalent to minimizing MSE. This justifies MSE as a principled loss function, not an arbitrary choice.
MLE gives you a single best guess for θ. But what if your data is limited and you are uncertain about the true parameter? The Bayesian approach represents that uncertainty explicitly.
In the frequentist view, θ is a fixed unknown number and the data is random. In the Bayesian view, the data is fixed (you observed it) and θ is a random variable with a probability distribution. Before seeing data, you have a prior p(θ) reflecting your initial beliefs. After observing data, you update to a posterior using Bayes' rule:
Predictions then integrate over all possible θ values, weighted by the posterior. This naturally protects against overfitting — if you are uncertain about θ, that uncertainty is folded into every prediction.
Supervised learning means learning from labeled examples — each input x comes with a target y. The algorithm learns to predict y from x. Classification, regression, and transcription are all supervised tasks.
Unsupervised learning means finding structure in unlabeled data. No targets, no teacher. The algorithm must discover patterns on its own. Density estimation, clustering, and dimensionality reduction are classic examples.
Logistic regression extends linear regression to classification. Pass the linear output through a sigmoid σ(wTx + b) to squash it into [0, 1], and interpret the result as a probability. Unlike linear regression, there is no closed-form solution — we minimize the negative log-likelihood using gradient descent.
PCA (principal components analysis) is the quintessential unsupervised algorithm. It finds a linear transformation that decorrelates the data and projects it onto the directions of maximum variance, giving a lower-dimensional representation that preserves as much information as possible.
k-means clustering partitions data into k groups. Initialize k centroids randomly, assign each point to its nearest centroid, update each centroid to the mean of its assigned points, and repeat. Simple, fast, but the result depends heavily on initialization and k.
Nearly all of deep learning is powered by one algorithm: stochastic gradient descent (SGD). The cost function typically decomposes as a sum over training examples:
Computing the gradient over all m examples costs O(m) per step. When m is in the billions, a single step takes forever. SGD's insight: the gradient is an expectation. We can estimate it with a small random minibatch of m' examples (typically 32-256) drawn from the training set:
This estimate is noisy but unbiased. On average, it points in the right direction. The noise even helps — it can bump the optimizer out of sharp local minima into flatter, better-generalizing regions.
Compare the smooth path of full-batch GD to the noisy but effective path of SGD. Click "Run" to animate.
Full-batch: smooth path | SGD: noisy but converges
Traditional ML algorithms work well on many problems. But they fail on the hard ones — recognizing speech, understanding images, translating language. Why? Three fundamental challenges stand in the way.
Imagine dividing a 1D input into 10 bins. That takes 10 regions to cover. In 2D, 10 × 10 = 100 regions. In 3D, 1000. In d dimensions, 10d regions — an exponential explosion. With high-dimensional inputs like images (millions of pixels), the number of possible configurations dwarfs any training set. Traditional algorithms that rely on covering the space with examples (like k-nearest neighbors) drown in this sea of possibilities.
Increase the number of dimensions to see how the required number of regions explodes exponentially. Each dimension multiplies the space by 10.
Most traditional algorithms assume the target function is smooth — nearby inputs should produce similar outputs. This is the local constancy prior. It works when you have enough examples to cover every peak and valley. But for complex functions in high dimensions, the number of distinct regions vastly exceeds the data. Imagine a checkerboard: the pattern is simple but has many alternating regions. A smooth prior cannot discover the checkerboard structure without placing at least one example in every square.
High-dimensional data typically concentrates near low-dimensional manifolds. The space of all possible 256×256 images is astronomically large, but the space of natural images (faces, landscapes, objects) is a tiny, structured subset. If we can discover the manifold, we only need to learn a function on its intrinsic coordinates, not on all of Rn.
This chapter laid the foundation for everything that follows. Every deep learning algorithm is built from the same recipe described here:
| Ingredient | Chapter 5 Concept | Deep Learning Application |
|---|---|---|
| Dataset | Training / validation / test splits | ImageNet, BookCorpus, Common Crawl |
| Cost function | MLE → negative log-likelihood | Cross-entropy, MSE, contrastive loss |
| Model | Capacity, parameterization | CNNs, Transformers, diffusion models |
| Optimizer | SGD with minibatches | Adam, AdamW, learning rate schedules |
| Regularization | Weight decay, validation-based tuning | Dropout, data augmentation, early stopping |
Up next: Chapter 6: Deep Feedforward Networks — the first real neural network architecture, where these ingredients come together to learn hierarchical representations.