From recognizing specific objects to understanding entire scenes: classification, detection, segmentation, and vision+language.
You glance at a photo and instantly know: that is a golden retriever sitting on a red couch in a living room. You identified the object class (dog), the specific breed (golden retriever), located it in space (center of frame, on the couch), and understood the scene context (indoors, living room). All in a fraction of a second.
Teaching computers to do this is the central challenge of visual recognition. It spans a hierarchy of tasks, from simple to complex:
| Task | Question Answered | Output |
|---|---|---|
| Instance recognition | Is this the exact same object? | Match / no match |
| Image classification | What category is in this image? | Class label |
| Object detection | Where are the objects? | Bounding boxes + labels |
| Semantic segmentation | What class is each pixel? | Per-pixel labels |
| Instance segmentation | Which pixels belong to which object? | Per-pixel instance masks |
Toggle between recognition tasks to see what each one produces from the same image.
Instance recognition asks: "Is this the exact same object I have seen before?" Not "is this a dog?" but "is this my dog?" This matters for tasks like finding a specific product on a shelf, matching a building from different angles, or verifying an identity.
The classical pipeline has three steps:
Modern approaches use deep learned descriptors. A CNN (trained with contrastive or triplet loss) maps images to compact vectors. Two images of the same object produce nearby vectors, regardless of viewpoint or lighting changes. The entire image becomes a single descriptor — no feature extraction or matching needed.
The triplet loss pushes the anchor a closer to a positive match p and away from a negative n by a margin α.
Image classification answers a simpler question: "What category does this image belong to?" Given a photo, output a label like "cat", "car", or "mountain." This is the task that launched the deep learning revolution.
The milestone moment: ImageNet 2012. AlexNet, a deep CNN, cut the error rate from 26% to 16% — nearly halving it overnight. Every subsequent winner used deeper networks:
| Year | Model | Top-5 Error | Key Idea |
|---|---|---|---|
| 2012 | AlexNet | 16.4% | Deep CNN + GPU training |
| 2014 | VGGNet | 7.3% | Very deep, small filters |
| 2014 | GoogLeNet | 6.7% | Inception modules |
| 2015 | ResNet | 3.6% | Skip connections (152 layers) |
| 2020 | ViT | ~1.5% | Vision transformer (patches + attention) |
Face recognition is a special case of classification. Modern systems like ArcFace learn embeddings where faces of the same person cluster together. They achieve superhuman accuracy on verification benchmarks, but raise important questions about privacy, bias, and consent.
Features learned for one task transfer to others. Early layers are generic; deep layers specialize.
Classification tells you what is in the image. Object detection tells you where: it outputs a bounding box (rectangle) and a class label for every object instance.
The evolution of detectors:
| Era | Method | Key Idea |
|---|---|---|
| 2001 | Viola-Jones | Haar features + cascaded classifiers. Real-time face detection. |
| 2005 | HOG + SVM | Histogram of gradients + linear SVM. Pedestrian detection. |
| 2014 | R-CNN | Region proposals + CNN features. Two-stage pipeline. |
| 2015 | Faster R-CNN | Learned region proposals (RPN). End-to-end trainable. |
| 2016 | YOLO / SSD | Single-shot: predict boxes and classes in one pass. Real-time. |
Evaluation: Detection accuracy is measured by mean Average Precision (mAP). A prediction counts as correct if its bounding box overlaps the ground truth by at least 50% (IoU ≥ 0.5).
Intersection over Union measures how well two boxes overlap. An IoU of 1.0 means perfect alignment; 0.0 means no overlap at all.
Drag the predicted box (dashed) to see how IoU changes with overlap.
Semantic segmentation assigns a class label to every pixel in the image. Instead of a single label ("street scene") or bounding boxes ("car at position X"), you get a dense map: this pixel is road, this pixel is car, this pixel is sky.
The breakthrough architecture: Fully Convolutional Networks (FCN). Take a classification CNN (like VGGNet), replace the final fully connected layers with convolutional layers, and upsample back to full resolution. Now the network can accept any input size and output a per-pixel prediction.
Key architectures:
Watch data flow through an encoder-decoder with skip connections. The encoder captures semantics; the decoder recovers spatial detail.
Semantic segmentation labels pixels by class, but it cannot distinguish between individual objects of the same class. If two cars overlap, all car pixels get the same label. Instance segmentation solves this: it gives each object its own mask.
Mask R-CNN (2017) extended Faster R-CNN with a mask prediction branch. For each detected object, it predicts not just a bounding box and class, but a pixel-level mask inside the box. The key innovation was RoIAlign, which precisely aligns feature maps to each region of interest without rounding errors.
Pose estimation takes this further by predicting the spatial configuration of a person's body. Instead of just a mask, the network outputs the 2D location of keypoints: shoulders, elbows, wrists, hips, knees, ankles. Systems like OpenPose detect multiple people's poses in real time, enabling applications from sports analytics to sign language recognition.
Images are snapshots. Videos are sequences. Understanding video requires reasoning about time: actions unfold over frames, objects move and interact, scenes change.
Key video understanding tasks:
Architectures for video must process the temporal dimension. Two-stream networks process appearance (RGB frames) and motion (optical flow) separately, then fuse. 3D convolutions (C3D, I3D) extend 2D conv kernels to space+time. Video transformers (TimeSformer, ViViT) apply attention across both spatial and temporal tokens.
Vision does not exist in isolation. Humans describe what they see in words, ask questions about images, and follow instructions that reference visual content. Vision-language models bridge the gap between seeing and saying.
| Task | Input | Output |
|---|---|---|
| Image captioning | Image | "A dog playing fetch in a park" |
| Visual QA | Image + "What color is the car?" | "Red" |
| Visual grounding | Image + "the red car" | Bounding box around the red car |
| Text-to-image | "A cat wearing a hat" | Generated image |
Modern vision-language models (GPT-4V, Gemini, LLaVA) go further. They combine a vision encoder (like ViT) with a large language model, enabling free-form conversation about images. You can ask "What is wrong with this circuit diagram?" and get a detailed technical answer.
CLIP aligns image and text embeddings in a shared space. Matching pairs cluster together.
Let's visualize how a sliding-window detector works. The detector slides across the image at multiple scales, scoring each window. High-scoring windows become detections. Non-maximum suppression (NMS) removes overlapping boxes, keeping only the best one per object.
Watch the detector scan the image. High-confidence boxes appear, then NMS removes duplicates.
Recognition is the crown jewel of computer vision. Nearly every other topic connects to it:
| Concept | Used In |
|---|---|
| CNN backbones (ResNet, ViT) | Ch 5 (architectures), Ch 9 (flow estimation), Ch 12 (depth) |
| Instance recognition / retrieval | Ch 7 (feature matching), Ch 8 (panorama recognition) |
| Object detection | Ch 9 (tracking), Ch 11 (SLAM), autonomous driving |
| Semantic segmentation | Ch 10 (matting), Ch 12 (stereo), Ch 13 (3D reconstruction) |
| Vision-language models | Ch 14 (neural rendering), robotics, embodied AI |
| Pose estimation | Ch 13 (body modeling), Ch 11 (camera pose) |