Introduction — Szeliski, Chapter 1

Chapter 0: Why Vision?

You open your eyes and instantly understand the world. That chair is three feet away. The coffee mug is behind the laptop. Your friend is smiling. You did all of that in a fraction of a second, without any conscious effort. Now imagine asking a computer to do the same thing from a flat grid of numbers.

That is the problem of computer vision: extracting meaningful information about the physical world from images or video. It sounds simple. Marvin Minsky famously assigned it as a summer project to an undergraduate in 1966. Six decades later, we are still working on it.

Why is it so hard? Because an image is a 2D projection of a 3D world. Depth is lost. Shadows look like edges. A white wall under yellow light looks the same as a yellow wall under white light. The information is deeply ambiguous, and the computer must infer what was lost.

The core challenge: Vision is an inverse problem. Rendering a 3D scene into a 2D image is straightforward (computer graphics). Recovering the 3D scene from a 2D image is massively under-determined — many different scenes can produce the exact same photograph.

Projection Ambiguity

Two very different 3D scenes can project to the same 2D image. Click "Switch Scene" to toggle between them. Both produce the identical silhouette.

Scene A

Why is computer vision fundamentally difficult?

Cameras are too low resolution A 2D image is an ambiguous projection of the 3D world — information is lost Computers cannot process images fast enough

Chapter 1: The Inverse Problem

Computer graphics takes a 3D scene description — geometry, materials, lights, camera — and produces a 2D image. This is the forward problem, and it is well-defined: given the inputs, there is exactly one correct output image.

Computer vision is the reverse: given the 2D image, recover the 3D scene. This is the inverse problem, and it is fundamentally ill-posed. A single pixel value could result from infinitely many combinations of surface color, lighting angle, and surface orientation.

To make progress, we use constraints: physics tells us how light behaves, statistics tells us what scenes look like, and geometry tells us how 3D projects to 2D. Every algorithm in this book is, at its core, a way of adding enough constraints to make the inverse problem tractable.

Key insight: The forward model (graphics) is our most powerful tool for solving the inverse problem (vision). If you understand how images are formed, you can work backwards to recover what formed them. This is why Chapter 2 of Szeliski's book is entirely about image formation.

3D Scene

Geometry + materials + lights

↓ Forward (Graphics)

2D Image

Flat grid of pixel values

↑ Inverse (Vision) — ill-posed!

Recovered Scene

Depth, objects, motion, semantics

What makes the inverse problem (vision) harder than the forward problem (graphics)?

Many different 3D scenes can produce the same 2D image, so the solution is ambiguous Graphics requires more computation Images contain too many pixels

Chapter 2: A Brief History

Computer vision has gone through several distinct eras, each with its own dominant paradigm. Understanding this history helps you see why today's methods look the way they do.

1970s — Blocks World. The field began with toy problems: recognizing polyhedra from line drawings. Researchers believed vision was a "solved in a summer" problem. Edge detection and line labeling algorithms tried to reconstruct 3D shapes from 2D contours.

1980s — Mathematical Foundations. David Marr's influential framework proposed three levels of analysis: computational theory, algorithm, and implementation. Shape-from-X methods (shading, texture, focus) used physics-based models. Regularization and Markov Random Fields gave optimization-based approaches.

1990s — Geometry Renaissance. Multi-view geometry matured. Structure from motion, bundle adjustment, and projective invariants enabled 3D reconstruction from photographs. Feature-based methods like SIFT emerged.

2000s — Learning and Recognition. Bag-of-words, HOG descriptors, deformable part models. ImageNet and benchmark datasets drove progress. SVM classifiers dominated object recognition.

2010s — Deep Learning Revolution. AlexNet (2012) shattered ImageNet records. CNNs replaced hand-crafted features. Object detection (R-CNN, YOLO), semantic segmentation (FCN), and generative models (GANs) transformed every subfield.

Where we are now: Modern computer vision combines classical geometric reasoning with deep learning. Self-supervised methods, vision transformers, neural radiance fields, and foundation models are pushing the frontier — but the fundamental inverse problem remains.

Timeline of Computer Vision

Click each era to see its defining contributions.

Which event in 2012 triggered the deep learning revolution in computer vision?

The invention of SIFT features AlexNet winning the ImageNet challenge by a huge margin using a CNN The first digital camera was released

Chapter 3: Four Approaches

Szeliski identifies four broad approaches that have shaped the field. Most modern systems blend several of these together.

1. Scientific

Build mathematical models of image formation. Understand the physics of light, optics, and geometry. Use these models to derive algorithms. Examples: shape from shading, radiometric calibration, multi-view geometry.

2. Statistical

Treat vision as inference under uncertainty. Use probability, Bayesian estimation, and loss functions. Model noise and prior knowledge. Examples: MRFs for segmentation, Kalman filters for tracking.

3. Engineering

Build systems that work in practice. Focus on robust algorithms, efficient implementations, and real-world performance. Care about speed, memory, and failure modes. Examples: RANSAC, image pyramids, real-time tracking.

4. Learning-Based

Let the data define the solution. Train neural networks on large datasets. Learn features, representations, and even entire pipelines end-to-end. Examples: CNNs for classification, GANs for synthesis, NeRFs for 3D.

The modern synthesis: The best systems today combine all four: physics-based image formation models (scientific), probabilistic loss functions (statistical), efficient GPU implementations (engineering), and deep neural networks (learning). No single approach suffices alone.

Which approach treats vision as inference under uncertainty using probability and priors?

Scientific Statistical Engineering

Chapter 4: Grand Challenges

What can vision systems actually do? Szeliski organizes the field around a set of core tasks, each representing a fundamental question about the visual world.

Task	Question	Chapter
Image classification	What is in this image?	6
Object detection	Where are the objects?	6
Semantic segmentation	What is each pixel?	6
Optical flow	How are things moving?	9
Stereo matching	How far away is everything?	12
Structure from motion	What is the 3D shape?	11
Image stitching	Can we build a panorama?	8
Neural rendering	Can we synthesize new views?	14

The book's arc: Szeliski's 14 chapters trace a path from how images are formed (Ch 2), to how we process them (Ch 3), fit models (Ch 4), learn features (Ch 5-6), detect and match features (Ch 7), align images (Ch 8-9), enhance photos (Ch 10), and finally reconstruct and render 3D scenes (Ch 11-14).

Which task asks "what is each pixel?" by assigning a semantic label to every pixel in the image?

Object detection Image classification Semantic segmentation

Chapter 5: The Pixel Grid

Before any algorithm runs, we need to understand what an image actually is to a computer. It is not a picture — it is a rectangular grid of numbers.

A grayscale image of width W and height H is a 2D array of intensity values, typically integers from 0 (black) to 255 (white). A color image has three such grids — one each for red, green, and blue — stacked into a 3D array of shape H × W × 3.

Each element is called a pixel (picture element). The pixel at position (x, y) stores the light intensity that fell on that sensor cell during exposure. Everything in computer vision starts from this grid of numbers.

Image as a Number Grid

Hover over the grid to see pixel values. Each cell is one pixel with an intensity from 0-255.

Convention: In most image libraries (OpenCV, PIL), the origin (0,0) is at the top-left corner. Row index increases downward (y-axis), column index increases rightward (x-axis). This is opposite to the standard math convention where y increases upward.

A color image of size 640×480 has how many total values stored?

640 × 480 = 307,200 640 + 480 = 1,120 640 × 480 × 3 = 921,600 (three channels: R, G, B)

Chapter 6: Color Spaces

Humans perceive color through three types of cone cells sensitive to short (S, blue), medium (M, green), and long (L, red) wavelengths. This is why we represent color with three values. But RGB is just one way to organize those three numbers.

RGB stores red, green, blue intensities directly — what the sensor measures. HSV (Hue, Saturation, Value) separates color identity from brightness, which is useful for detection under varying illumination. YCbCr separates luminance (Y) from chrominance (Cb, Cr), which is how JPEG compresses images — our eyes are less sensitive to color detail than brightness detail.

Color Space Explorer

Drag the RGB sliders to mix a color. Watch the corresponding HSV and YCbCr values update in real time.

R 180

G 100

B 60

Why does JPEG separate luminance from chrominance before compression?

Human eyes are less sensitive to color detail than brightness detail, so chrominance can be compressed more aggressively It makes the file smaller by removing one channel RGB values are too large to store

Chapter 7: Showcase — Vision Pipeline

Let's put it all together. Below is an interactive simulation of a simplified computer vision pipeline. You'll load a synthetic scene, watch pixels form, apply basic operations, and see the extracted features.

Interactive Vision Pipeline

Step through the pipeline: Scene → Projection → Pixels → Edges → Features. Use the buttons to advance each stage and the noise slider to see how noise affects detection.

Noise 10

The fundamental pipeline: Nearly every classical vision system follows this pattern: capture image → preprocess (denoise, normalize) → detect features (edges, corners, blobs) → match/recognize → output (labels, 3D model, stitched panorama). Deep learning collapses many of these steps into a single end-to-end network, but the conceptual stages remain.

Chapter 8: Connections

This chapter introduced the landscape of computer vision. Every subsequent chapter dives deep into one piece of this puzzle.

Next Chapter	What You'll Learn
Ch 2: Image Formation	How 3D scenes become 2D images: geometry, optics, cameras
Ch 3: Image Processing	Filtering, convolution, Fourier transforms, pyramids
Ch 5: Deep Learning	Neural networks, CNNs, and the learning revolution
Ch 7: Features	Harris corners, SIFT, edges — the building blocks of matching

Szeliski's advice: "Pick topics that are fun and can be used on your own photographs, and try to push your creative boundaries to come up with surprising results." The best way to learn vision is to implement algorithms on your own images.

What is the most important concept from this introduction chapter?

Computer vision is an inverse problem: recovering 3D structure from 2D images, which requires constraints to make it tractable Cameras are complicated machines Deep learning has replaced all classical methods

What Is Computer Vision?