Machines that see: from pixel grids to 3D understanding. A brief history, four research paradigms, and the grand challenges ahead.
You open your eyes and instantly understand the world. That chair is three feet away. The coffee mug is behind the laptop. Your friend is smiling. You did all of that in a fraction of a second, without any conscious effort. Now imagine asking a computer to do the same thing from a flat grid of numbers.
That is the problem of computer vision: extracting meaningful information about the physical world from images or video. It sounds simple. Marvin Minsky famously assigned it as a summer project to an undergraduate in 1966. Six decades later, we are still working on it.
Why is it so hard? Because an image is a 2D projection of a 3D world. Depth is lost. Shadows look like edges. A white wall under yellow light looks the same as a yellow wall under white light. The information is deeply ambiguous, and the computer must infer what was lost.
Two very different 3D scenes can project to the same 2D image. Click "Switch Scene" to toggle between them. Both produce the identical silhouette.
Computer graphics takes a 3D scene description — geometry, materials, lights, camera — and produces a 2D image. This is the forward problem, and it is well-defined: given the inputs, there is exactly one correct output image.
Computer vision is the reverse: given the 2D image, recover the 3D scene. This is the inverse problem, and it is fundamentally ill-posed. A single pixel value could result from infinitely many combinations of surface color, lighting angle, and surface orientation.
To make progress, we use constraints: physics tells us how light behaves, statistics tells us what scenes look like, and geometry tells us how 3D projects to 2D. Every algorithm in this book is, at its core, a way of adding enough constraints to make the inverse problem tractable.
Computer vision has gone through several distinct eras, each with its own dominant paradigm. Understanding this history helps you see why today's methods look the way they do.
1970s — Blocks World. The field began with toy problems: recognizing polyhedra from line drawings. Researchers believed vision was a "solved in a summer" problem. Edge detection and line labeling algorithms tried to reconstruct 3D shapes from 2D contours.
1980s — Mathematical Foundations. David Marr's influential framework proposed three levels of analysis: computational theory, algorithm, and implementation. Shape-from-X methods (shading, texture, focus) used physics-based models. Regularization and Markov Random Fields gave optimization-based approaches.
1990s — Geometry Renaissance. Multi-view geometry matured. Structure from motion, bundle adjustment, and projective invariants enabled 3D reconstruction from photographs. Feature-based methods like SIFT emerged.
2000s — Learning and Recognition. Bag-of-words, HOG descriptors, deformable part models. ImageNet and benchmark datasets drove progress. SVM classifiers dominated object recognition.
2010s — Deep Learning Revolution. AlexNet (2012) shattered ImageNet records. CNNs replaced hand-crafted features. Object detection (R-CNN, YOLO), semantic segmentation (FCN), and generative models (GANs) transformed every subfield.
Click each era to see its defining contributions.
Szeliski identifies four broad approaches that have shaped the field. Most modern systems blend several of these together.
Build mathematical models of image formation. Understand the physics of light, optics, and geometry. Use these models to derive algorithms. Examples: shape from shading, radiometric calibration, multi-view geometry.
Treat vision as inference under uncertainty. Use probability, Bayesian estimation, and loss functions. Model noise and prior knowledge. Examples: MRFs for segmentation, Kalman filters for tracking.
Build systems that work in practice. Focus on robust algorithms, efficient implementations, and real-world performance. Care about speed, memory, and failure modes. Examples: RANSAC, image pyramids, real-time tracking.
Let the data define the solution. Train neural networks on large datasets. Learn features, representations, and even entire pipelines end-to-end. Examples: CNNs for classification, GANs for synthesis, NeRFs for 3D.
What can vision systems actually do? Szeliski organizes the field around a set of core tasks, each representing a fundamental question about the visual world.
| Task | Question | Chapter |
|---|---|---|
| Image classification | What is in this image? | 6 |
| Object detection | Where are the objects? | 6 |
| Semantic segmentation | What is each pixel? | 6 |
| Optical flow | How are things moving? | 9 |
| Stereo matching | How far away is everything? | 12 |
| Structure from motion | What is the 3D shape? | 11 |
| Image stitching | Can we build a panorama? | 8 |
| Neural rendering | Can we synthesize new views? | 14 |
Before any algorithm runs, we need to understand what an image actually is to a computer. It is not a picture — it is a rectangular grid of numbers.
A grayscale image of width W and height H is a 2D array of intensity values, typically integers from 0 (black) to 255 (white). A color image has three such grids — one each for red, green, and blue — stacked into a 3D array of shape H × W × 3.
Each element is called a pixel (picture element). The pixel at position (x, y) stores the light intensity that fell on that sensor cell during exposure. Everything in computer vision starts from this grid of numbers.
Hover over the grid to see pixel values. Each cell is one pixel with an intensity from 0-255.
Humans perceive color through three types of cone cells sensitive to short (S, blue), medium (M, green), and long (L, red) wavelengths. This is why we represent color with three values. But RGB is just one way to organize those three numbers.
RGB stores red, green, blue intensities directly — what the sensor measures. HSV (Hue, Saturation, Value) separates color identity from brightness, which is useful for detection under varying illumination. YCbCr separates luminance (Y) from chrominance (Cb, Cr), which is how JPEG compresses images — our eyes are less sensitive to color detail than brightness detail.
Drag the RGB sliders to mix a color. Watch the corresponding HSV and YCbCr values update in real time.
Let's put it all together. Below is an interactive simulation of a simplified computer vision pipeline. You'll load a synthetic scene, watch pixels form, apply basic operations, and see the extracted features.
Step through the pipeline: Scene → Projection → Pixels → Edges → Features. Use the buttons to advance each stage and the noise slider to see how noise affects detection.
This chapter introduced the landscape of computer vision. Every subsequent chapter dives deep into one piece of this puzzle.
| Next Chapter | What You'll Learn |
|---|---|
| Ch 2: Image Formation | How 3D scenes become 2D images: geometry, optics, cameras |
| Ch 3: Image Processing | Filtering, convolution, Fourier transforms, pyramids |
| Ch 5: Deep Learning | Neural networks, CNNs, and the learning revolution |
| Ch 7: Features | Harris corners, SIFT, edges — the building blocks of matching |