Computer Vision Foundations

Linear Classification
From Zero

How a single matrix multiply turns pixels into class scores — and how we teach it to get the scores right.

Prerequisites: Basic linear algebra (matrix multiply) + Image Classification lesson. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Linear Classification?

In the previous lesson, we built a k-NN classifier. It works, but it has a fatal flaw: to classify one test image, you compare it against every single training image. With 50,000 training images, that's 50,000 comparisons per prediction. Slow. Memory-hungry. Impractical.

What if, instead of memorizing all the training data, we could distill it into a small set of numbers — a compact summary that captures what each class "looks like"? Then prediction would be instant: just multiply the image by those numbers and read off the scores.

The big idea: A linear classifier compresses all the training data into a weight matrix W and a bias vector b. Prediction is a single matrix multiply: f(x) = Wx + b. Training is slow (learning the weights), but prediction is blazing fast — the exact opposite of k-NN.
k-NN vs Linear Classifier

k-NN stores everything and compares at test time. A linear classifier learns a compact summary (W, b) during training, then prediction is a single multiply.

This shift — from "memorize data" to "learn parameters" — is called a parametric model. The parameters (W, b) are learned during training and then the training data can be thrown away. This is the foundation of all deep learning.

What is the key advantage of a parametric model over k-NN?

Chapter 1: The Score Function — f = Wx + b

Here's the setup. You have an image, flattened into a column vector x of D numbers. You have C classes. You want C scores — one per class — where the highest score is the predicted class. The score function maps pixels to scores:

f(x) = W · x + b

Where W is a C×D matrix (C rows, D columns), x is a D×1 vector, and b is a C×1 bias vector. The result is a C×1 vector of scores — one score per class.

Concrete example: CIFAR-10 has 10 classes and 32×32×3 = 3,072-pixel images. So W is 10×3072 (30,720 numbers), x is 3072×1, b is 10×1. The output is 10 scores. The whole classifier is just 30,730 numbers.
Score Function: f = Wx + b

A tiny example: 4-pixel image, 3 classes. Each row of W produces one class score. The bias shifts scores up or down. Click "New Weights" to randomize.

Each row of W gets dot-producted with the image x. The dot product measures agreement: if the weight row and the image pixels point in the same direction, the score is high. If they point in opposite directions, the score is low. The bias b is an offset that lets some classes start with a higher or lower baseline score, independent of the data — useful when classes are imbalanced.

In f = Wx + b, what are the dimensions of W for 10 classes and 3072-pixel images?

Chapter 2: Weight Templates — What W Learns

Here's a beautiful way to think about what the weight matrix is doing. Each row of W is a D-dimensional vector — the same size as an image. You can literally reshape that row back into an image and look at it. What do you see? A blurry template for that class.

The dot product between a weight row and an image is highest when the image looks like the template. So the classifier is basically asking: "Of all my templates, which one does this image resemble most?" The class whose template best matches wins.

The template interpretation: Row i of W is the classifier's "mental image" of what class i looks like. The score for class i is how well the input image matches that template (dot product), plus a bias offset. Classification = template matching.
Weight Templates

Each row of W reshaped into a small image. Notice how the "template" looks like a blurry average of the class. Input images score high on the template they resemble most.

This reveals a fundamental limitation. A linear classifier learns one template per class. But real classes have multiple modes — a car can face left or right, a horse can be brown or white. The single template must somehow average over all these variations, which is why the learned templates look blurry. A horse template might have two heads (one facing each way).

This is exactly the limitation that neural networks overcome: they learn hierarchical features, not just a single template per class.

Why do the learned weight templates look blurry?

Chapter 3: Hinge Loss — The SVM Way

We have a score function. But how do we know if the scores are good? If the correct class has the highest score, great. If not, we need a number that tells us how badly we're doing. That number is the loss function.

The multiclass SVM loss (also called hinge loss) says: the correct class score should be higher than every incorrect class score by at least a margin of 1. If it is, no penalty. If not, the penalty grows linearly with how much we're violating the margin.

Li = ∑j ≠ yi max(0, sj − syi + 1)

Where sj is the score for class j, syi is the score for the correct class, and the sum runs over all incorrect classes. Let's unpack this with a worked example.

Worked example: Three classes, scores = [3.2, 5.1, -1.7], correct class = 0 (score 3.2).
• Class 1: max(0, 5.1 − 3.2 + 1) = max(0, 2.9) = 2.9
• Class 2: max(0, −1.7 − 3.2 + 1) = max(0, −3.9) = 0
Loss = 2.9 + 0 = 2.9. Class 1 scores too high relative to the correct class. Class 2 is fine (already below by more than the margin).
Hinge Loss Calculator

Drag the sliders to set scores for 3 classes. The correct class is Class 0. Watch the loss change as scores shift. The loss is zero only when the correct class beats all others by at least 1.

s0 (correct)3.2
s15.1
s2-1.7

The "max(0, ...)" is why it's called a hinge — the function bends at zero like a door hinge. If the margin is satisfied, the loss is flat at zero. Once violated, it ramps up linearly. The loss doesn't care how much the correct class wins by, as long as it wins by at least 1.

If scores are [10, 2, 3] and the correct class is 0, what is the SVM loss?

Chapter 4: Cross-Entropy Loss — The Softmax Way

The hinge loss only cares whether the correct class wins by a margin. It doesn't interpret the scores as probabilities. The cross-entropy loss (used with softmax) takes a different philosophy: it converts scores into probabilities and then punishes low probability on the correct class.

First, convert raw scores to probabilities using the softmax function:

P(y = k | x) = esk / ∑j esj

This exponentiates each score, then normalizes so they sum to 1. Now we have a probability distribution over classes. The loss is the negative log probability of the correct class:

Li = −log P(yi | xi) = −log( esyi / ∑j esj )
Why negative log? If the probability of the correct class is 1.0 (perfect), −log(1.0) = 0. If the probability is 0.01 (terrible), −log(0.01) = 4.6. The worse the prediction, the higher the penalty — and it grows logarithmically, punishing confident wrong answers severely.
Softmax & Cross-Entropy

Adjust the scores. Watch the softmax probabilities and cross-entropy loss update. The correct class is Class 0.

s0 (correct)3.2
s11.3
s22.1

A practical trick: before computing softmax, subtract the maximum score from all scores. This doesn't change the probabilities (it cancels in the fraction) but prevents numerical overflow from large exponentials. This is called the log-sum-exp trick.

python
def softmax_loss(scores, correct_class):
    scores -= np.max(scores)           # numerical stability
    exp_scores = np.exp(scores)
    probs = exp_scores / np.sum(exp_scores)
    loss = -np.log(probs[correct_class])
    return loss, probs
What does the softmax function do to raw class scores?

Chapter 5: Hinge vs Cross-Entropy

Both loss functions measure "how wrong are the scores," but they have different personalities. The hinge loss is satisfied once the correct class wins by a margin — after that, it stops caring. The cross-entropy loss always wants the correct class probability to be higher, even if it's already winning.

Think of it this way: Hinge loss is a pass/fail exam — you either clear the bar or you don't. Cross-entropy is a continuous score — getting 95% is good but 99% is better. Cross-entropy never stops pushing for improvement.
PropertyHinge (SVM)Cross-Entropy (Softmax)
Scores mean...Arbitrary marginsUnnormalized log probabilities
Happy when...Correct class wins by ≥ 1Correct class probability → 1
Gradient behaviorZero once margin satisfiedAlways nonzero (always pushes)
Minimum value0 (achievable)0 (approached but never reached)
Probabilistic?NoYes — gives calibrated probabilities
Hinge vs Cross-Entropy: Side by Side

Same scores, two different loss functions. Drag the correct-class score and watch how each loss responds. Notice how hinge hits zero and stays there, while cross-entropy keeps decreasing.

scorrect2.0

In practice, cross-entropy is more popular in modern deep learning because it provides well-calibrated probabilities and its gradients never vanish (it always has a learning signal). But hinge loss is still used in some settings, especially SVMs and contrastive learning.

If the correct class already has the highest score by a large margin, which loss function still produces a nonzero gradient?

Chapter 6: Regularization — Prefer Simple Weights

Imagine two weight matrices that produce exactly the same scores on the training data. Which one should we prefer? The one with smaller, more spread-out weights. Why? Because it generalizes better. A classifier that relies on a few very large weights is fragile — it's memorizing quirks of the training data rather than learning general patterns.

We encode this preference by adding a regularization penalty to the loss:

L = (1/N) ∑i Li + λ R(W)

The first term is the average data loss (hinge or cross-entropy). The second is the regularization loss, weighted by a hyperparameter λ. The most common choice is L2 regularization:

R(W) = ∑kl Wk,l2
Occam's razor for weights: L2 regularization penalizes the sum of squared weights. Given two weight vectors [1, 0, 0, 0] and [0.25, 0.25, 0.25, 0.25] that produce the same dot product with x = [1, 1, 1, 1], L2 prefers the diffuse one (sum of squares = 0.25) over the spiky one (sum of squares = 1). Diffuse weights use all the input information, not just one pixel.
L2 Regularization: Spiky vs Diffuse

Both weight vectors give the same dot product with x = [1,1,1,1], but their L2 penalties differ. Drag λ to see how regularization trades off data loss and weight penalty.

λ0.50

Higher λ means stronger regularization: we care more about keeping weights small and less about fitting the training data perfectly. This is a bias-variance tradeoff. Too little regularization: overfitting. Too much: underfitting. The sweet spot is found via the validation set.

Why does L2 regularization prefer diffuse weights [0.25, 0.25, 0.25, 0.25] over spiky weights [1, 0, 0, 0]?

Chapter 7: Linear Classifier Explorer — The Showcase

Now let's see everything in action. Below is a 2D dataset with 3 classes. You control the weight matrix W and bias vector b directly. Watch the decision boundaries shift as you drag the sliders, and see the loss update in real time.

What to look for: Each row of W defines a direction in 2D space. The decision boundary between two classes is a straight line where their scores are equal. Adjusting a weight row rotates or tilts that class's boundary. The bias shifts it without rotating. Try to minimize the loss by getting the boundaries to separate the classes.
Interactive Linear Classifier

3 classes in 2D. Background color = predicted class. Drag sliders to adjust W and b. Loss is computed with cross-entropy. Try to minimize it!

Class 0 (orange): w=[w00, w01], b0
w001.5
w010.5
b00.0
Class 1 (teal): w=[w10, w11], b1
w10-1.0
w111.5
b10.0
Class 2 (purple): w=[w20, w21], b2
w200.0
w21-2.0
b20.0
Loss: —

Try these experiments:

Key observation: The decision boundaries are always straight lines (or hyperplanes in higher dimensions). This is the fundamental limitation of linear classifiers — they can only carve space with flat cuts. Curved or circular boundaries are impossible.

Chapter 8: Practical Concerns

Before training a linear classifier on real data, there are a few critical preprocessing steps that make a huge difference in practice.

Data preprocessing: mean subtraction. Center your data by subtracting the mean image (computed over the training set) from every image. This ensures the data cloud sits around the origin, which helps the optimizer. Without centering, the bias has to do extra work to shift the decision boundary to where the data actually lives.

Why center? If all pixel values are positive (0-255), the gradient on W is always the same sign. Centering (so values range from roughly −128 to +127) lets the weights receive both positive and negative gradients, making optimization much smoother.

Normalization. After centering, optionally divide by the standard deviation so each feature has unit variance. This puts all features on the same scale. Common in practice: normalize to [−1, 1] or zero-mean unit-variance.

Effect of Mean Subtraction

Raw data (left) vs centered data (right). Centering shifts the cloud to the origin, making it easier for linear boundaries through the origin to separate classes.

Weight initialization. Initialize W with small random numbers (e.g., drawn from a Gaussian with std 0.01). If W starts at zero, all classes produce the same score and the gradients are symmetric — the weights never differentiate. Small random values break this symmetry. The bias b can safely start at zero.

python
# Standard preprocessing pipeline
X_train -= np.mean(X_train, axis=0)  # mean subtraction
X_train /= np.std(X_train, axis=0)   # normalization

# Apply SAME transform to test data
X_test -= train_mean                   # use training stats!
X_test /= train_std

# Weight initialization
W = 0.01 * np.random.randn(C, D)
b = np.zeros(C)
Critical rule: Always compute preprocessing statistics (mean, std) on the training set only, then apply them to validation and test sets. If you compute the mean on the full dataset, you're leaking test information into training — the same contamination we warned about with hyperparameter tuning.
Why must you compute the mean from the training set and apply it to the test set, rather than computing a separate mean for each?

Chapter 9: Beyond — Why Linear Is Not Enough

We've built a complete linear classifier: a score function (f = Wx + b), a loss function (hinge or cross-entropy), and a regularizer (L2). This is a powerful framework, but it has a hard limitation: it can only draw linear decision boundaries.

Many real-world datasets aren't linearly separable. Think of a bullseye pattern: class A in the center, class B in a ring around it. No straight line can separate them. Or consider XOR: two classes arranged diagonally. A linear classifier is helpless.

The path forward: What if we stacked multiple linear classifiers, with a nonlinear function (like max(0, x)) in between? The first layer would compute intermediate features; the second would classify based on those features. Congratulations — you've just invented a neural network.
PropertyLinear ClassifierNeural Network
Decision boundaryStraight lines / hyperplanesArbitrary curves
FeaturesRaw pixels (given)Learned hierarchically
Templates per class1 (blurry average)Many (distributed across neurons)
ParametersW (C×D) + bMultiple W matrices + biases
CIFAR-10 accuracy~40%96%+

The key lessons from linear classification:

What's next: We have a loss function, but how do we actually find the W that minimizes it? That's optimization — specifically, gradient descent. We compute the gradient of the loss with respect to W, and take a small step in the opposite direction. Repeat until convergence.
What is the fundamental limitation of a linear classifier that neural networks overcome?