How a single matrix multiply turns pixels into class scores — and how we teach it to get the scores right.
In the previous lesson, we built a k-NN classifier. It works, but it has a fatal flaw: to classify one test image, you compare it against every single training image. With 50,000 training images, that's 50,000 comparisons per prediction. Slow. Memory-hungry. Impractical.
What if, instead of memorizing all the training data, we could distill it into a small set of numbers — a compact summary that captures what each class "looks like"? Then prediction would be instant: just multiply the image by those numbers and read off the scores.
f(x) = Wx + b. Training is slow (learning the weights), but prediction is blazing fast — the exact opposite of k-NN.k-NN stores everything and compares at test time. A linear classifier learns a compact summary (W, b) during training, then prediction is a single multiply.
This shift — from "memorize data" to "learn parameters" — is called a parametric model. The parameters (W, b) are learned during training and then the training data can be thrown away. This is the foundation of all deep learning.
Here's the setup. You have an image, flattened into a column vector x of D numbers. You have C classes. You want C scores — one per class — where the highest score is the predicted class. The score function maps pixels to scores:
Where W is a C×D matrix (C rows, D columns), x is a D×1 vector, and b is a C×1 bias vector. The result is a C×1 vector of scores — one score per class.
A tiny example: 4-pixel image, 3 classes. Each row of W produces one class score. The bias shifts scores up or down. Click "New Weights" to randomize.
Each row of W gets dot-producted with the image x. The dot product measures agreement: if the weight row and the image pixels point in the same direction, the score is high. If they point in opposite directions, the score is low. The bias b is an offset that lets some classes start with a higher or lower baseline score, independent of the data — useful when classes are imbalanced.
Here's a beautiful way to think about what the weight matrix is doing. Each row of W is a D-dimensional vector — the same size as an image. You can literally reshape that row back into an image and look at it. What do you see? A blurry template for that class.
The dot product between a weight row and an image is highest when the image looks like the template. So the classifier is basically asking: "Of all my templates, which one does this image resemble most?" The class whose template best matches wins.
Each row of W reshaped into a small image. Notice how the "template" looks like a blurry average of the class. Input images score high on the template they resemble most.
This reveals a fundamental limitation. A linear classifier learns one template per class. But real classes have multiple modes — a car can face left or right, a horse can be brown or white. The single template must somehow average over all these variations, which is why the learned templates look blurry. A horse template might have two heads (one facing each way).
This is exactly the limitation that neural networks overcome: they learn hierarchical features, not just a single template per class.
We have a score function. But how do we know if the scores are good? If the correct class has the highest score, great. If not, we need a number that tells us how badly we're doing. That number is the loss function.
The multiclass SVM loss (also called hinge loss) says: the correct class score should be higher than every incorrect class score by at least a margin of 1. If it is, no penalty. If not, the penalty grows linearly with how much we're violating the margin.
Where sj is the score for class j, syi is the score for the correct class, and the sum runs over all incorrect classes. Let's unpack this with a worked example.
Drag the sliders to set scores for 3 classes. The correct class is Class 0. Watch the loss change as scores shift. The loss is zero only when the correct class beats all others by at least 1.
The "max(0, ...)" is why it's called a hinge — the function bends at zero like a door hinge. If the margin is satisfied, the loss is flat at zero. Once violated, it ramps up linearly. The loss doesn't care how much the correct class wins by, as long as it wins by at least 1.
The hinge loss only cares whether the correct class wins by a margin. It doesn't interpret the scores as probabilities. The cross-entropy loss (used with softmax) takes a different philosophy: it converts scores into probabilities and then punishes low probability on the correct class.
First, convert raw scores to probabilities using the softmax function:
This exponentiates each score, then normalizes so they sum to 1. Now we have a probability distribution over classes. The loss is the negative log probability of the correct class:
Adjust the scores. Watch the softmax probabilities and cross-entropy loss update. The correct class is Class 0.
A practical trick: before computing softmax, subtract the maximum score from all scores. This doesn't change the probabilities (it cancels in the fraction) but prevents numerical overflow from large exponentials. This is called the log-sum-exp trick.
python def softmax_loss(scores, correct_class): scores -= np.max(scores) # numerical stability exp_scores = np.exp(scores) probs = exp_scores / np.sum(exp_scores) loss = -np.log(probs[correct_class]) return loss, probs
Both loss functions measure "how wrong are the scores," but they have different personalities. The hinge loss is satisfied once the correct class wins by a margin — after that, it stops caring. The cross-entropy loss always wants the correct class probability to be higher, even if it's already winning.
| Property | Hinge (SVM) | Cross-Entropy (Softmax) |
|---|---|---|
| Scores mean... | Arbitrary margins | Unnormalized log probabilities |
| Happy when... | Correct class wins by ≥ 1 | Correct class probability → 1 |
| Gradient behavior | Zero once margin satisfied | Always nonzero (always pushes) |
| Minimum value | 0 (achievable) | 0 (approached but never reached) |
| Probabilistic? | No | Yes — gives calibrated probabilities |
Same scores, two different loss functions. Drag the correct-class score and watch how each loss responds. Notice how hinge hits zero and stays there, while cross-entropy keeps decreasing.
In practice, cross-entropy is more popular in modern deep learning because it provides well-calibrated probabilities and its gradients never vanish (it always has a learning signal). But hinge loss is still used in some settings, especially SVMs and contrastive learning.
Imagine two weight matrices that produce exactly the same scores on the training data. Which one should we prefer? The one with smaller, more spread-out weights. Why? Because it generalizes better. A classifier that relies on a few very large weights is fragile — it's memorizing quirks of the training data rather than learning general patterns.
We encode this preference by adding a regularization penalty to the loss:
The first term is the average data loss (hinge or cross-entropy). The second is the regularization loss, weighted by a hyperparameter λ. The most common choice is L2 regularization:
Both weight vectors give the same dot product with x = [1,1,1,1], but their L2 penalties differ. Drag λ to see how regularization trades off data loss and weight penalty.
Higher λ means stronger regularization: we care more about keeping weights small and less about fitting the training data perfectly. This is a bias-variance tradeoff. Too little regularization: overfitting. Too much: underfitting. The sweet spot is found via the validation set.
Now let's see everything in action. Below is a 2D dataset with 3 classes. You control the weight matrix W and bias vector b directly. Watch the decision boundaries shift as you drag the sliders, and see the loss update in real time.
3 classes in 2D. Background color = predicted class. Drag sliders to adjust W and b. Loss is computed with cross-entropy. Try to minimize it!
Try these experiments:
Before training a linear classifier on real data, there are a few critical preprocessing steps that make a huge difference in practice.
Data preprocessing: mean subtraction. Center your data by subtracting the mean image (computed over the training set) from every image. This ensures the data cloud sits around the origin, which helps the optimizer. Without centering, the bias has to do extra work to shift the decision boundary to where the data actually lives.
Normalization. After centering, optionally divide by the standard deviation so each feature has unit variance. This puts all features on the same scale. Common in practice: normalize to [−1, 1] or zero-mean unit-variance.
Raw data (left) vs centered data (right). Centering shifts the cloud to the origin, making it easier for linear boundaries through the origin to separate classes.
Weight initialization. Initialize W with small random numbers (e.g., drawn from a Gaussian with std 0.01). If W starts at zero, all classes produce the same score and the gradients are symmetric — the weights never differentiate. Small random values break this symmetry. The bias b can safely start at zero.
python # Standard preprocessing pipeline X_train -= np.mean(X_train, axis=0) # mean subtraction X_train /= np.std(X_train, axis=0) # normalization # Apply SAME transform to test data X_test -= train_mean # use training stats! X_test /= train_std # Weight initialization W = 0.01 * np.random.randn(C, D) b = np.zeros(C)
We've built a complete linear classifier: a score function (f = Wx + b), a loss function (hinge or cross-entropy), and a regularizer (L2). This is a powerful framework, but it has a hard limitation: it can only draw linear decision boundaries.
Many real-world datasets aren't linearly separable. Think of a bullseye pattern: class A in the center, class B in a ring around it. No straight line can separate them. Or consider XOR: two classes arranged diagonally. A linear classifier is helpless.
| Property | Linear Classifier | Neural Network |
|---|---|---|
| Decision boundary | Straight lines / hyperplanes | Arbitrary curves |
| Features | Raw pixels (given) | Learned hierarchically |
| Templates per class | 1 (blurry average) | Many (distributed across neurons) |
| Parameters | W (C×D) + b | Multiple W matrices + biases |
| CIFAR-10 accuracy | ~40% | 96%+ |
The key lessons from linear classification: