Computer Vision Foundations

Image Classification
From Zero

How does a computer look at a grid of numbers and decide "that's a cat"?

Prerequisites: Basic arithmetic + Coordinate geometry. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Image Classification?

You glance at a photo and instantly know: that's a cat. No effort. No thought. You've been doing it since you were a toddler. But how would you write a program to do it?

Here's the brutal truth: to a computer, an image isn't a picture. It's a giant grid of numbers. A 256×256 color image is 256 × 256 × 3 = 196,608 numbers, each between 0 and 255. Somewhere in that grid of numbers, there's a "cat." But the computer has no eyes, no visual cortex, no concept of whiskers or ears. It just sees numbers.

The core problem: Given a grid of numbers (an image), assign it a label from a fixed set of categories. That's image classification — the most fundamental task in computer vision. Everything else (detection, segmentation, generation) builds on it.
What the Computer Sees

A tiny 8×8 "image." Each cell is a pixel intensity (0=black, 255=white). To you, it might look like a pattern. To the computer, it's just 64 numbers.

You can't write an if-statement for "cat." You can't say if pixel[100][50] > 128 then cat. Cats come in every color, every pose, every background. Instead, we take a data-driven approach: collect thousands of labeled images, and let the algorithm figure out the patterns.

Collect
Gather thousands of labeled images
Train
Algorithm learns patterns from examples
Predict
Given a new image, output a label
Why can't we write explicit rules (if/else) to classify images?

Chapter 1: The Challenge — Why Vision is Hard

Even though a cat is always a cat, the pixels that represent it change drastically. Think about all the ways the numbers in that grid can shift while the semantic meaning stays the same. Computer vision researchers call these semantic gaps.

Key insight: The same object produces wildly different pixel arrays depending on conditions. A classifier must somehow see through all this variation to the underlying identity.

Viewpoint variation. A cat photographed from the front vs the side vs above produces completely different pixel patterns. Same cat, totally different numbers.

Illumination. Photograph a white cat in a dark room and the pixels are mostly dark. Same cat in sunlight — pixels are bright. The numbers flip, but it's the same object.

Deformation. Cats are flexible. A sleeping cat looks nothing like a leaping cat. The pixel grid warps and stretches.

Occlusion. A cat behind a fence — you only see pieces. Somehow your brain fills in the gaps. A pixel-comparison algorithm can't.

Background clutter. A tabby cat on a pile of autumn leaves. The colors blend. The object and background have similar pixel values.

Same Class, Different Pixels

Each grid is a tiny "image" of the same class (a diagonal line), but with different transformations applied. Pixel values differ wildly, yet they're all the same thing.

This is why image classification is hard. An algorithm that just compares pixel values will be fooled by every single one of these variations. We need something smarter.

Which of these is NOT a challenge for image classification?

Chapter 2: Nearest Neighbor — The Simplest Classifier

Let's start with the simplest possible idea. You have a training set of labeled images. When a new test image arrives, you compare it to every single training image, find the one that's most similar, and copy its label. That's the Nearest Neighbor (NN) classifier.

Think of it this way: You've never seen a Samoyed before, but someone shows you a photo of a fluffy white dog and asks "what is this?" You flip through your mental photo album, find the most similar image (a picture of an American Eskimo Dog), and say "probably that." That's nearest neighbor classification.

But what does "most similar" mean for images? We need a way to measure the distance between two images. The simplest approach: treat each image as a long vector of pixel values, and compare them element by element.

For two images I1 and I2, each with N pixels, the L1 (Manhattan) distance is:

dL1(I1, I2) = ∑p |I1p − I2p|

You take the absolute difference at each pixel and sum them all up. A small total means the images have similar pixel values. A large total means they look very different — at the pixel level, at least.

Nearest Neighbor in Action

A test point (white ×) finds its nearest neighbor in the training set. The label of the nearest point becomes the prediction. Click to move the test point.

Two-phase API. Every classifier has two phases: train (memorize the data) and predict (compare new input to memorized data). For NN, training is instant (just store everything) but prediction is painfully slow — you compare against every stored example. This is the worst possible tradeoff: we want fast prediction, not fast training.
python
import numpy as np

class NearestNeighbor:
    def train(self, X, y):
        self.Xtr = X       # just memorize training data
        self.ytr = y

    def predict(self, X):
        num_test = X.shape[0]
        Ypred = np.zeros(num_test)
        for i in range(num_test):
            distances = np.sum(np.abs(self.Xtr - X[i]), axis=1)
            Ypred[i] = self.ytr[np.argmin(distances)]
        return Ypred
What is the time complexity of predicting ONE test image with Nearest Neighbor?

Chapter 3: L1 vs L2 Distance

We used L1 (Manhattan) distance above — sum of absolute differences. But there's another popular choice: L2 (Euclidean) distance, which sums the squared differences and takes the square root.

dL2(I1, I2) = √( ∑p (I1p − I2p)² )

What's the difference in practice? L2 penalizes large deviations much more than L1. If one pixel is off by 100, L1 adds 100 to the total. L2 adds 10,000 (before the square root). So L2 prefers many small disagreements over one large disagreement.

Geometric intuition: The set of all points at distance 1 from the origin forms a shape. For L2 it's a circle. For L1 it's a diamond (rotated square). This means L1 and L2 carve up space differently, leading to different classification boundaries.
L1 vs L2 Distance Balls

All points at the same distance from the center. Orange = L1 (diamond). Teal = L2 (circle). Drag the slider to change the radius.

Radius80
PropertyL1 (Manhattan)L2 (Euclidean)
Formula∑|differences|√(∑differences²)
Unit ball shapeDiamondCircle
Sensitivity to outlier pixelsLess sensitiveMore sensitive
When to useFeatures are independentFeatures are correlated

In practice, you try both and see which works better on your data. Neither is universally superior. This choice — L1 or L2 — is a hyperparameter: a design decision you make before training, not something the algorithm learns from data.

How does L2 distance differ from L1 when one pixel is very different but all others match?

Chapter 4: k-Nearest Neighbors

Nearest Neighbor has a problem: it's too sensitive to noise. If one training example is mislabeled or just weird, it creates a small island of wrong predictions around it. The fix is beautifully simple: instead of asking the one closest neighbor, ask the k closest neighbors and let them vote.

Democracy beats dictatorship. With k=1, a single noisy training point corrupts an entire region. With k=5, that noisy point gets outvoted by its four sensible neighbors. Higher k = smoother, more robust decision boundaries. But too high k = boundaries so smooth they ignore local structure.

The k-Nearest Neighbors (k-NN) algorithm:

1. Compute distances
Find distance from test point to every training point
2. Find k nearest
Sort by distance, take the top k
3. Vote
Most common label among k neighbors wins
k-NN Voting

The white × is your test point. The orange and teal dots are two classes. Increase k to see more neighbors vote. Click to move the test point.

k1

Notice how k=1 gives a jagged boundary that traces every individual point, while k=5 or k=7 gives a smoother, more generalizable boundary. The choice of k is another hyperparameter.

Why does increasing k tend to produce smoother decision boundaries?

Chapter 5: Hyperparameters — Never Touch the Test Set

We now have choices to make: what value of k? L1 or L2? These are hyperparameters — settings that aren't learned from data, but chosen by the practitioner before training. How do we pick the right ones?

Tempting but catastrophically wrong: try different hyperparameters, evaluate on the test set, pick whichever scores highest. Why is this wrong? Because you've just contaminated your test set. It's no longer a fair evaluation — you've effectively optimized for it. Your reported accuracy will be overly optimistic.

The golden rule: The test set is a sealed envelope. You open it exactly once — at the very end, when you've finalized every design choice. If you peek earlier, your results are meaningless.

The solution: split your data into three sets.

Training Set
The algorithm learns from this data (~70-80%)
Validation Set
Tune hyperparameters against this (~10-15%)
Test Set
Final evaluation only. Touch once. (~10-15%)
Hyperparameter Tuning

Validation accuracy for different values of k. The best k on the validation set isn't always 1. Too small = overfitting to noise. Too large = underfitting.

You train with different hyperparameters, evaluate each on the validation set, pick the best, and then report the test set accuracy. This is the honest way to do machine learning.

Why is tuning hyperparameters on the test set a bad idea?

Chapter 6: Cross-Validation

What if your dataset is small? A single validation split might be unreliable — you might just get lucky (or unlucky) with which examples landed in the validation set. k-fold cross-validation fixes this by using every example for both training and validation.

Here's how it works. Split the training data into k equal folds. For each fold: hold it out as validation, train on the other k−1 folds, measure accuracy. Average the k accuracy scores. This gives you a much more stable estimate of performance.

Why "k-fold"? Each data point gets to be in the validation set exactly once. Nothing is wasted. With 5 folds, you effectively get 5 different train/val splits from the same data, and average the results. More folds = less bias but more computation.
5-Fold Cross-Validation

Each row is one fold rotation. Teal = training data. Orange = validation fold. Every data block serves as validation exactly once.

In practice, 5-fold or 10-fold CV is common. However, it's computationally expensive (5× or 10× more training), so for large datasets with reliable validation splits, a single split is usually fine. Cross-validation is most valuable when data is scarce.

FoldsAdvantageDisadvantage
3Fast, good for large datasetsHigh variance in estimate
5Good balance of bias/variance5× training cost
10Low bias, stable estimate10× training cost
N (LOO)Minimal biasExtremely expensive
In 5-fold cross-validation, how many times is each data point used for validation?

Chapter 7: k-NN Explorer — The Showcase

Now let's put it all together. Below is a 2D scatter plot with three colored classes. You are the classifier designer. Adjust k and switch between L1 and L2 distance to see how the decision boundary changes in real time.

What to look for: At k=1, the boundary hugs every individual point — it's jagged and fragile. As you increase k, the boundary smooths out. Switch between L1 and L2 to see how the diamond-vs-circle geometry changes which regions belong to which class.
Interactive k-NN Decision Boundary

Three classes in 2D. The background color shows the predicted class at every point. Adjust k and distance metric to see the boundary reshape live.

k1
k=1, L2 distance

Try these experiments:

Key observation: There is no single "best" k or distance metric that works for all data. This is why we need validation — the data tells us which hyperparameters work best for this particular problem.

Chapter 8: Limitations — Why k-NN Fails on Images

k-NN is elegant and simple. On CIFAR-10 (60,000 tiny images, 10 classes), the best k-NN achieves about 38% accuracy. That's better than random guessing (10%), but far from the 96%+ that modern neural networks achieve. What goes wrong?

The curse of dimensionality. A CIFAR-10 image is 32×32×3 = 3,072 dimensions. To densely cover a 3,072-dimensional space, you'd need an astronomically large training set. The number of data points needed grows exponentially with dimension. In high dimensions, all points are approximately the same distance apart — "nearest" becomes meaningless.

The Curse of Dimensionality

To cover 1D with 4 points is easy. To cover 2D you need 4²=16. To cover 3D you need 4³=64. The number of points needed grows exponentially.

Dimensions1

Pixel distance is semantically meaningless. Consider three images: (A) a cat facing left, (B) the same cat shifted one pixel right, (C) a completely different image with similar overall brightness. B is semantically identical to A, but differs at every single pixel. C is semantically different, but might have a smaller pixel distance. L1/L2 on raw pixels measures the wrong thing.

The fundamental problem: Pixel distance ≠ semantic similarity. Two images can be pixel-wise close but semantically unrelated, or pixel-wise far apart but semantically identical. We need learned features — representations that capture meaning, not raw pixel values.

Prediction is slow. For N training images, each prediction requires N distance computations. With 50,000 training images and 10,000 test images, that's 500 million comparisons. This is fundamentally at odds with real-time applications.

Why does pixel-level distance fail for image classification?

Chapter 9: Beyond — From Pixels to Learned Features

k-NN on raw pixels taught us something important: the representation matters more than the classifier. Comparing images pixel-by-pixel is asking the wrong question. What we need is a way to transform images into a feature space where similar objects are close and different objects are far apart.

This insight leads directly to the next topic: linear classifiers. Instead of memorizing data and comparing at test time, a linear classifier learns a template (weight matrix) during training. Prediction is a single matrix multiply — blazing fast. And from linear classifiers, the road leads to neural networks, which learn the features themselves.

The progression: k-NN (no learning, compares raw pixels) → Linear Classifier (learns weights, but uses raw pixels) → Neural Network (learns both the features AND the classifier). Each step removes a limitation of the previous one.
MethodFeaturesTrain TimePredict TimeCIFAR-10 Acc.
k-NN (raw pixels)None (raw)O(1)O(N)~38%
Linear ClassifierRaw or hand-craftedO(epochs)O(1)~40%
Neural NetworkLearnedO(epochs)O(1)96%+

The key lessons from this chapter:

What's next: The linear classifier replaces "memorize and compare" with "learn a template." It's the first step toward deep learning.
What is the key advantage of a linear classifier over k-NN?