Linear Classification — From Templates to Decision Boundaries

Chapter 0: Why Linear Classification?

In the previous lesson, we built a k-NN classifier. It works, but it has a fatal flaw: to classify one test image, you compare it against every single training image. With 50,000 training images, that's 50,000 comparisons per prediction. Slow. Memory-hungry. Impractical.

What if, instead of memorizing all the training data, we could distill it into a small set of numbers — a compact summary that captures what each class "looks like"? Then prediction would be instant: just multiply the image by those numbers and read off the scores.

The big idea: A linear classifier compresses all the training data into a weight matrix W and a bias vector b. Prediction is a single matrix multiply: f(x) = Wx + b. Training is slow (learning the weights), but prediction is blazing fast — the exact opposite of k-NN.

k-NN vs Linear Classifier

k-NN stores everything and compares at test time. A linear classifier learns a compact summary (W, b) during training, then prediction is a single multiply.

This shift — from "memorize data" to "learn parameters" — is called a parametric model. The parameters (W, b) are learned during training and then the training data can be thrown away. This is the foundation of all deep learning.

What is the key advantage of a parametric model over k-NN?

It uses more data during training It distills training data into parameters, so prediction doesn't need the training set It doesn't require labeled data

Chapter 1: The Score Function — f = Wx + b

Here's the setup. You have an image, flattened into a column vector x of D numbers. You have C classes. You want C scores — one per class — where the highest score is the predicted class. The score function maps pixels to scores:

f(x) = W · x + b

Where W is a C×D matrix (C rows, D columns), x is a D×1 vector, and b is a C×1 bias vector. The result is a C×1 vector of scores — one score per class.

Concrete example: CIFAR-10 has 10 classes and 32×32×3 = 3,072-pixel images. So W is 10×3072 (30,720 numbers), x is 3072×1, b is 10×1. The output is 10 scores. The whole classifier is just 30,730 numbers.

Score Function: f = Wx + b

A tiny example: 4-pixel image, 3 classes. Each row of W produces one class score. The bias shifts scores up or down. Click "New Weights" to randomize.

Each row of W gets dot-producted with the image x. The dot product measures agreement: if the weight row and the image pixels point in the same direction, the score is high. If they point in opposite directions, the score is low. The bias b is an offset that lets some classes start with a higher or lower baseline score, independent of the data — useful when classes are imbalanced.

In f = Wx + b, what are the dimensions of W for 10 classes and 3072-pixel images?

10 × 3072 — one row per class, one column per pixel 3072 × 10 — one row per pixel, one column per class 10 × 10 — one row and column per class

Chapter 2: Weight Templates — What W Learns

Here's a beautiful way to think about what the weight matrix is doing. Each row of W is a D-dimensional vector — the same size as an image. You can literally reshape that row back into an image and look at it. What do you see? A blurry template for that class.

The dot product between a weight row and an image is highest when the image looks like the template. So the classifier is basically asking: "Of all my templates, which one does this image resemble most?" The class whose template best matches wins.

The template interpretation: Row i of W is the classifier's "mental image" of what class i looks like. The score for class i is how well the input image matches that template (dot product), plus a bias offset. Classification = template matching.

Weight Templates

Each row of W reshaped into a small image. Notice how the "template" looks like a blurry average of the class. Input images score high on the template they resemble most.

This reveals a fundamental limitation. A linear classifier learns one template per class. But real classes have multiple modes — a car can face left or right, a horse can be brown or white. The single template must somehow average over all these variations, which is why the learned templates look blurry. A horse template might have two heads (one facing each way).

This is exactly the limitation that neural networks overcome: they learn hierarchical features, not just a single template per class.

Why do the learned weight templates look blurry?

Each template must average over all variations of a class (poses, colors, backgrounds), producing a blurry composite The camera resolution is too low The bias vector adds noise to the templates

Chapter 3: Hinge Loss — The SVM Way

We have a score function. But how do we know if the scores are good? If the correct class has the highest score, great. If not, we need a number that tells us how badly we're doing. That number is the loss function.

The multiclass SVM loss (also called hinge loss) says: the correct class score should be higher than every incorrect class score by at least a margin of 1. If it is, no penalty. If not, the penalty grows linearly with how much we're violating the margin.

L_i = ∑_{j ≠ y_i} max(0, s_j − s_{y_i} + 1)

Where s_j is the score for class j, s_{y_i} is the score for the correct class, and the sum runs over all incorrect classes. Let's unpack this with a worked example.

Worked example: Three classes, scores = [3.2, 5.1, -1.7], correct class = 0 (score 3.2).
• Class 1: max(0, 5.1 − 3.2 + 1) = max(0, 2.9) = 2.9
• Class 2: max(0, −1.7 − 3.2 + 1) = max(0, −3.9) = 0
Loss = 2.9 + 0 = 2.9. Class 1 scores too high relative to the correct class. Class 2 is fine (already below by more than the margin).

Hinge Loss Calculator

Drag the sliders to set scores for 3 classes. The correct class is Class 0. Watch the loss change as scores shift. The loss is zero only when the correct class beats all others by at least 1.

s₀ (correct)3.2

s₁5.1

s₂-1.7

The "max(0, ...)" is why it's called a hinge — the function bends at zero like a door hinge. If the margin is satisfied, the loss is flat at zero. Once violated, it ramps up linearly. The loss doesn't care how much the correct class wins by, as long as it wins by at least 1.

If scores are [10, 2, 3] and the correct class is 0, what is the SVM loss?

0 — the correct class beats all others by more than the margin of 1 5 — the sum of all score differences 10 — the score of the correct class

Chapter 4: Cross-Entropy Loss — The Softmax Way

The hinge loss only cares whether the correct class wins by a margin. It doesn't interpret the scores as probabilities. The cross-entropy loss (used with softmax) takes a different philosophy: it converts scores into probabilities and then punishes low probability on the correct class.

First, convert raw scores to probabilities using the softmax function:

P(y = k | x) = e^s_k / ∑_j e^s_j

This exponentiates each score, then normalizes so they sum to 1. Now we have a probability distribution over classes. The loss is the negative log probability of the correct class:

L_i = −log P(y_i | x_i) = −log( e^s_{y_i} / ∑_j e^s_j )

Why negative log? If the probability of the correct class is 1.0 (perfect), −log(1.0) = 0. If the probability is 0.01 (terrible), −log(0.01) = 4.6. The worse the prediction, the higher the penalty — and it grows logarithmically, punishing confident wrong answers severely.

Softmax & Cross-Entropy

Adjust the scores. Watch the softmax probabilities and cross-entropy loss update. The correct class is Class 0.

s₀ (correct)3.2

s₁1.3

s₂2.1

A practical trick: before computing softmax, subtract the maximum score from all scores. This doesn't change the probabilities (it cancels in the fraction) but prevents numerical overflow from large exponentials. This is called the log-sum-exp trick.

python
def softmax_loss(scores, correct_class):
    scores -= np.max(scores)           # numerical stability
    exp_scores = np.exp(scores)
    probs = exp_scores / np.sum(exp_scores)
    loss = -np.log(probs[correct_class])
    return loss, probs

What does the softmax function do to raw class scores?

Subtracts the mean score from each Converts them to a probability distribution that sums to 1 Keeps only the highest score and sets others to zero

Chapter 5: Hinge vs Cross-Entropy

Both loss functions measure "how wrong are the scores," but they have different personalities. The hinge loss is satisfied once the correct class wins by a margin — after that, it stops caring. The cross-entropy loss always wants the correct class probability to be higher, even if it's already winning.

Think of it this way: Hinge loss is a pass/fail exam — you either clear the bar or you don't. Cross-entropy is a continuous score — getting 95% is good but 99% is better. Cross-entropy never stops pushing for improvement.

Property	Hinge (SVM)	Cross-Entropy (Softmax)
Scores mean...	Arbitrary margins	Unnormalized log probabilities
Happy when...	Correct class wins by ≥ 1	Correct class probability → 1
Gradient behavior	Zero once margin satisfied	Always nonzero (always pushes)
Minimum value	0 (achievable)	0 (approached but never reached)
Probabilistic?	No	Yes — gives calibrated probabilities

Hinge vs Cross-Entropy: Side by Side

Same scores, two different loss functions. Drag the correct-class score and watch how each loss responds. Notice how hinge hits zero and stays there, while cross-entropy keeps decreasing.

s_correct2.0

In practice, cross-entropy is more popular in modern deep learning because it provides well-calibrated probabilities and its gradients never vanish (it always has a learning signal). But hinge loss is still used in some settings, especially SVMs and contrastive learning.

If the correct class already has the highest score by a large margin, which loss function still produces a nonzero gradient?

Hinge loss — it keeps pushing even after the margin is satisfied Cross-entropy — it always wants the correct class probability closer to 1 Neither — both are zero when the correct class is winning

Chapter 6: Regularization — Prefer Simple Weights

Imagine two weight matrices that produce exactly the same scores on the training data. Which one should we prefer? The one with smaller, more spread-out weights. Why? Because it generalizes better. A classifier that relies on a few very large weights is fragile — it's memorizing quirks of the training data rather than learning general patterns.

We encode this preference by adding a regularization penalty to the loss:

L = (1/N) ∑_i L_i + λ R(W)

The first term is the average data loss (hinge or cross-entropy). The second is the regularization loss, weighted by a hyperparameter λ. The most common choice is L2 regularization:

R(W) = ∑_k ∑_l W_k,l²

Occam's razor for weights: L2 regularization penalizes the sum of squared weights. Given two weight vectors [1, 0, 0, 0] and [0.25, 0.25, 0.25, 0.25] that produce the same dot product with x = [1, 1, 1, 1], L2 prefers the diffuse one (sum of squares = 0.25) over the spiky one (sum of squares = 1). Diffuse weights use all the input information, not just one pixel.

L2 Regularization: Spiky vs Diffuse

Both weight vectors give the same dot product with x = [1,1,1,1], but their L2 penalties differ. Drag λ to see how regularization trades off data loss and weight penalty.

λ0.50

Higher λ means stronger regularization: we care more about keeping weights small and less about fitting the training data perfectly. This is a bias-variance tradeoff. Too little regularization: overfitting. Too much: underfitting. The sweet spot is found via the validation set.

Why does L2 regularization prefer diffuse weights [0.25, 0.25, 0.25, 0.25] over spiky weights [1, 0, 0, 0]?

The sum of squared weights is smaller for diffuse weights, so the regularization penalty is lower Diffuse weights make the bias vector unnecessary Spiky weights cause numerical overflow

Chapter 7: Linear Classifier Explorer — The Showcase

Now let's see everything in action. Below is a 2D dataset with 3 classes. You control the weight matrix W and bias vector b directly. Watch the decision boundaries shift as you drag the sliders, and see the loss update in real time.

What to look for: Each row of W defines a direction in 2D space. The decision boundary between two classes is a straight line where their scores are equal. Adjusting a weight row rotates or tilts that class's boundary. The bias shifts it without rotating. Try to minimize the loss by getting the boundaries to separate the classes.

Interactive Linear Classifier

3 classes in 2D. Background color = predicted class. Drag sliders to adjust W and b. Loss is computed with cross-entropy. Try to minimize it!

      Class 0 (orange): w=[w00, w01], b0
    

w₀₀1.5

w₀₁0.5

b₀0.0

Class 1 (teal): w=[w10, w11], b1

w₁₀-1.0

w₁₁1.5

b₁0.0

Class 2 (purple): w=[w20, w21], b2

w₂₀0.0

w₂₁-2.0

b₂0.0

      Loss: —
    

Try these experiments:

Set all weights to zero. The boundaries collapse — every point gets the same score.
Try to find weights that separate all three classes with zero misclassifications.
Click "New Data" and notice that some datasets can't be separated by straight lines at all.

Key observation: The decision boundaries are always straight lines (or hyperplanes in higher dimensions). This is the fundamental limitation of linear classifiers — they can only carve space with flat cuts. Curved or circular boundaries are impossible.

Chapter 8: Practical Concerns

Before training a linear classifier on real data, there are a few critical preprocessing steps that make a huge difference in practice.

Data preprocessing: mean subtraction. Center your data by subtracting the mean image (computed over the training set) from every image. This ensures the data cloud sits around the origin, which helps the optimizer. Without centering, the bias has to do extra work to shift the decision boundary to where the data actually lives.

Why center? If all pixel values are positive (0-255), the gradient on W is always the same sign. Centering (so values range from roughly −128 to +127) lets the weights receive both positive and negative gradients, making optimization much smoother.

Normalization. After centering, optionally divide by the standard deviation so each feature has unit variance. This puts all features on the same scale. Common in practice: normalize to [−1, 1] or zero-mean unit-variance.

Effect of Mean Subtraction

Raw data (left) vs centered data (right). Centering shifts the cloud to the origin, making it easier for linear boundaries through the origin to separate classes.

Weight initialization. Initialize W with small random numbers (e.g., drawn from a Gaussian with std 0.01). If W starts at zero, all classes produce the same score and the gradients are symmetric — the weights never differentiate. Small random values break this symmetry. The bias b can safely start at zero.

python
# Standard preprocessing pipeline
X_train -= np.mean(X_train, axis=0)  # mean subtraction
X_train /= np.std(X_train, axis=0)   # normalization

# Apply SAME transform to test data
X_test -= train_mean                   # use training stats!
X_test /= train_std

# Weight initialization
W = 0.01 * np.random.randn(C, D)
b = np.zeros(C)

Critical rule: Always compute preprocessing statistics (mean, std) on the training set only, then apply them to validation and test sets. If you compute the mean on the full dataset, you're leaking test information into training — the same contamination we warned about with hyperparameter tuning.

Why must you compute the mean from the training set and apply it to the test set, rather than computing a separate mean for each?

Using test set statistics leaks information about the test data into preprocessing, compromising evaluation integrity The test set is too small to compute a reliable mean Computing two means is too slow

Chapter 9: Beyond — Why Linear Is Not Enough

We've built a complete linear classifier: a score function (f = Wx + b), a loss function (hinge or cross-entropy), and a regularizer (L2). This is a powerful framework, but it has a hard limitation: it can only draw linear decision boundaries.

Many real-world datasets aren't linearly separable. Think of a bullseye pattern: class A in the center, class B in a ring around it. No straight line can separate them. Or consider XOR: two classes arranged diagonally. A linear classifier is helpless.

The path forward: What if we stacked multiple linear classifiers, with a nonlinear function (like max(0, x)) in between? The first layer would compute intermediate features; the second would classify based on those features. Congratulations — you've just invented a neural network.

Property	Linear Classifier	Neural Network
Decision boundary	Straight lines / hyperplanes	Arbitrary curves
Features	Raw pixels (given)	Learned hierarchically
Templates per class	1 (blurry average)	Many (distributed across neurons)
Parameters	W (C×D) + b	Multiple W matrices + biases
CIFAR-10 accuracy	~40%	96%+

The key lessons from linear classification:

Score function: f = Wx + b maps inputs to class scores
Loss function: measures how wrong the scores are (hinge or cross-entropy)
Regularization: prefer simple (diffuse) weights to prevent overfitting
Optimization: find W and b that minimize total loss (next topic: gradient descent)

What's next: We have a loss function, but how do we actually find the W that minimizes it? That's optimization — specifically, gradient descent. We compute the gradient of the loss with respect to W, and take a small step in the opposite direction. Repeat until convergence.

What is the fundamental limitation of a linear classifier that neural networks overcome?

Linear classifiers can only draw straight decision boundaries, so they fail on non-linearly-separable data Linear classifiers are too slow Linear classifiers can't handle more than two classes

Linear ClassificationFrom Zero