How does a computer look at a grid of numbers and decide "that's a cat"?
You glance at a photo and instantly know: that's a cat. No effort. No thought. You've been doing it since you were a toddler. But how would you write a program to do it?
Here's the brutal truth: to a computer, an image isn't a picture. It's a giant grid of numbers. A 256×256 color image is 256 × 256 × 3 = 196,608 numbers, each between 0 and 255. Somewhere in that grid of numbers, there's a "cat." But the computer has no eyes, no visual cortex, no concept of whiskers or ears. It just sees numbers.
A tiny 8×8 "image." Each cell is a pixel intensity (0=black, 255=white). To you, it might look like a pattern. To the computer, it's just 64 numbers.
You can't write an if-statement for "cat." You can't say if pixel[100][50] > 128 then cat. Cats come in every color, every pose, every background. Instead, we take a data-driven approach: collect thousands of labeled images, and let the algorithm figure out the patterns.
Even though a cat is always a cat, the pixels that represent it change drastically. Think about all the ways the numbers in that grid can shift while the semantic meaning stays the same. Computer vision researchers call these semantic gaps.
Viewpoint variation. A cat photographed from the front vs the side vs above produces completely different pixel patterns. Same cat, totally different numbers.
Illumination. Photograph a white cat in a dark room and the pixels are mostly dark. Same cat in sunlight — pixels are bright. The numbers flip, but it's the same object.
Deformation. Cats are flexible. A sleeping cat looks nothing like a leaping cat. The pixel grid warps and stretches.
Occlusion. A cat behind a fence — you only see pieces. Somehow your brain fills in the gaps. A pixel-comparison algorithm can't.
Background clutter. A tabby cat on a pile of autumn leaves. The colors blend. The object and background have similar pixel values.
Each grid is a tiny "image" of the same class (a diagonal line), but with different transformations applied. Pixel values differ wildly, yet they're all the same thing.
This is why image classification is hard. An algorithm that just compares pixel values will be fooled by every single one of these variations. We need something smarter.
Let's start with the simplest possible idea. You have a training set of labeled images. When a new test image arrives, you compare it to every single training image, find the one that's most similar, and copy its label. That's the Nearest Neighbor (NN) classifier.
But what does "most similar" mean for images? We need a way to measure the distance between two images. The simplest approach: treat each image as a long vector of pixel values, and compare them element by element.
For two images I1 and I2, each with N pixels, the L1 (Manhattan) distance is:
You take the absolute difference at each pixel and sum them all up. A small total means the images have similar pixel values. A large total means they look very different — at the pixel level, at least.
A test point (white ×) finds its nearest neighbor in the training set. The label of the nearest point becomes the prediction. Click to move the test point.
python import numpy as np class NearestNeighbor: def train(self, X, y): self.Xtr = X # just memorize training data self.ytr = y def predict(self, X): num_test = X.shape[0] Ypred = np.zeros(num_test) for i in range(num_test): distances = np.sum(np.abs(self.Xtr - X[i]), axis=1) Ypred[i] = self.ytr[np.argmin(distances)] return Ypred
We used L1 (Manhattan) distance above — sum of absolute differences. But there's another popular choice: L2 (Euclidean) distance, which sums the squared differences and takes the square root.
What's the difference in practice? L2 penalizes large deviations much more than L1. If one pixel is off by 100, L1 adds 100 to the total. L2 adds 10,000 (before the square root). So L2 prefers many small disagreements over one large disagreement.
All points at the same distance from the center. Orange = L1 (diamond). Teal = L2 (circle). Drag the slider to change the radius.
| Property | L1 (Manhattan) | L2 (Euclidean) |
|---|---|---|
| Formula | ∑|differences| | √(∑differences²) |
| Unit ball shape | Diamond | Circle |
| Sensitivity to outlier pixels | Less sensitive | More sensitive |
| When to use | Features are independent | Features are correlated |
In practice, you try both and see which works better on your data. Neither is universally superior. This choice — L1 or L2 — is a hyperparameter: a design decision you make before training, not something the algorithm learns from data.
Nearest Neighbor has a problem: it's too sensitive to noise. If one training example is mislabeled or just weird, it creates a small island of wrong predictions around it. The fix is beautifully simple: instead of asking the one closest neighbor, ask the k closest neighbors and let them vote.
The k-Nearest Neighbors (k-NN) algorithm:
The white × is your test point. The orange and teal dots are two classes. Increase k to see more neighbors vote. Click to move the test point.
Notice how k=1 gives a jagged boundary that traces every individual point, while k=5 or k=7 gives a smoother, more generalizable boundary. The choice of k is another hyperparameter.
We now have choices to make: what value of k? L1 or L2? These are hyperparameters — settings that aren't learned from data, but chosen by the practitioner before training. How do we pick the right ones?
Tempting but catastrophically wrong: try different hyperparameters, evaluate on the test set, pick whichever scores highest. Why is this wrong? Because you've just contaminated your test set. It's no longer a fair evaluation — you've effectively optimized for it. Your reported accuracy will be overly optimistic.
The solution: split your data into three sets.
Validation accuracy for different values of k. The best k on the validation set isn't always 1. Too small = overfitting to noise. Too large = underfitting.
You train with different hyperparameters, evaluate each on the validation set, pick the best, and then report the test set accuracy. This is the honest way to do machine learning.
What if your dataset is small? A single validation split might be unreliable — you might just get lucky (or unlucky) with which examples landed in the validation set. k-fold cross-validation fixes this by using every example for both training and validation.
Here's how it works. Split the training data into k equal folds. For each fold: hold it out as validation, train on the other k−1 folds, measure accuracy. Average the k accuracy scores. This gives you a much more stable estimate of performance.
Each row is one fold rotation. Teal = training data. Orange = validation fold. Every data block serves as validation exactly once.
In practice, 5-fold or 10-fold CV is common. However, it's computationally expensive (5× or 10× more training), so for large datasets with reliable validation splits, a single split is usually fine. Cross-validation is most valuable when data is scarce.
| Folds | Advantage | Disadvantage |
|---|---|---|
| 3 | Fast, good for large datasets | High variance in estimate |
| 5 | Good balance of bias/variance | 5× training cost |
| 10 | Low bias, stable estimate | 10× training cost |
| N (LOO) | Minimal bias | Extremely expensive |
Now let's put it all together. Below is a 2D scatter plot with three colored classes. You are the classifier designer. Adjust k and switch between L1 and L2 distance to see how the decision boundary changes in real time.
Three classes in 2D. The background color shows the predicted class at every point. Adjust k and distance metric to see the boundary reshape live.
Try these experiments:
k-NN is elegant and simple. On CIFAR-10 (60,000 tiny images, 10 classes), the best k-NN achieves about 38% accuracy. That's better than random guessing (10%), but far from the 96%+ that modern neural networks achieve. What goes wrong?
The curse of dimensionality. A CIFAR-10 image is 32×32×3 = 3,072 dimensions. To densely cover a 3,072-dimensional space, you'd need an astronomically large training set. The number of data points needed grows exponentially with dimension. In high dimensions, all points are approximately the same distance apart — "nearest" becomes meaningless.
To cover 1D with 4 points is easy. To cover 2D you need 4²=16. To cover 3D you need 4³=64. The number of points needed grows exponentially.
Pixel distance is semantically meaningless. Consider three images: (A) a cat facing left, (B) the same cat shifted one pixel right, (C) a completely different image with similar overall brightness. B is semantically identical to A, but differs at every single pixel. C is semantically different, but might have a smaller pixel distance. L1/L2 on raw pixels measures the wrong thing.
Prediction is slow. For N training images, each prediction requires N distance computations. With 50,000 training images and 10,000 test images, that's 500 million comparisons. This is fundamentally at odds with real-time applications.
k-NN on raw pixels taught us something important: the representation matters more than the classifier. Comparing images pixel-by-pixel is asking the wrong question. What we need is a way to transform images into a feature space where similar objects are close and different objects are far apart.
This insight leads directly to the next topic: linear classifiers. Instead of memorizing data and comparing at test time, a linear classifier learns a template (weight matrix) during training. Prediction is a single matrix multiply — blazing fast. And from linear classifiers, the road leads to neural networks, which learn the features themselves.
| Method | Features | Train Time | Predict Time | CIFAR-10 Acc. |
|---|---|---|---|---|
| k-NN (raw pixels) | None (raw) | O(1) | O(N) | ~38% |
| Linear Classifier | Raw or hand-crafted | O(epochs) | O(1) | ~40% |
| Neural Network | Learned | O(epochs) | O(1) | 96%+ |
The key lessons from this chapter: