Szeliski, Chapter 10

Computational Photography

Creating new images from old ones: HDR imaging, super-resolution, image matting, texture synthesis, and inpainting.

Prerequisites: Chapter 3 (image processing), Chapter 5 (deep learning) helpful for modern methods.
10
Chapters
6+
Simulations
0
Assumed CV Knowledge

Chapter 0: Why Computational Photography?

Your camera captures what is there. Computational photography creates what could be there. It uses algorithms to overcome hardware limitations and produce images that no single exposure could capture.

Examples you use every day:

The shift: Traditional photography = optics + sensor. Computational photography = optics + sensor + algorithm. The algorithm becomes part of the camera. Google's Night Sight, Apple's Deep Fusion, Samsung's AI processing — every phone photo you take is already computationally enhanced.
What fundamentally distinguishes computational photography from traditional photography?

Chapter 1: Photometric Calibration

Before combining images, you need to understand how your camera converts light into pixel values. This mapping is not linear.

The radiometric response function g maps scene irradiance E to pixel intensity I:

I = g(E · Δt)

where Δt is exposure time. Recovering g is essential for HDR, because you need to convert pixel values back to physical irradiance before merging.

Other calibration targets:

EffectWhat It IsHow to Correct
VignettingBrightness falloff toward image edges (lens geometry)Divide by measured falloff pattern
NoiseSignal-dependent (Poisson) + read noise (Gaussian)Noise level function: σ2 = aI + b
Optical blurPoint spread function varies across the fieldDeconvolution (inverse filtering)
Debevec's method: To recover the response function g, take multiple exposures of the same scene. The same scene point has different pixel values at different exposures. Since the irradiance is constant but the exposure time varies, you can solve for g at each pixel value. This is the foundation of HDR imaging.
Why must we recover the camera's radiometric response function before creating an HDR image?

Chapter 2: High Dynamic Range Imaging

A real scene can have a brightness range of 10,000:1 or more (sunlit window vs. shadowed corner). A standard camera captures maybe 256:1. HDR imaging recovers the full range by merging multiple exposures.

The pipeline:

The weighting function: Pixels near 0 (underexposed) have high noise. Pixels near 255 (overexposed) are clipped. The optimal weight function peaks in the middle of the range and falls off at both extremes. A common choice: a hat function (triangle) centered at 128.
Exposure Bracketing

Three exposures capture different parts of the dynamic range. The merge recovers the full range.

Why are multiple exposures needed for HDR?

Chapter 3: Tone Mapping

An HDR radiance map may have a 100,000:1 range. But your display shows at most 1,000:1. Tone mapping compresses the HDR range into a displayable range while preserving visual detail.

ApproachHow It WorksCharacter
GlobalApply the same curve to every pixel (log, gamma, Reinhard)Fast, simple, may lose local contrast
LocalAdapt compression based on local neighborhood (Durand bilateral, exposure fusion)Preserves local detail, can create halos
Exposure fusionBlend the original exposures directly (no HDR merge), weighted by quality metricsNo response function needed. Practical.
Reinhard's operator: Lout = L / (1 + L). This simple formula compresses high luminances while preserving low ones (a photographic "shoulder" curve). Adding a local adaptation term makes it: Lout(x,y) = L(x,y) / (1 + Llocal(x,y)), where Llocal is the average luminance in a neighborhood. This preserves local contrast even in extreme dynamic range scenes.
Tone Mapping Curves

Different curves compress the HDR range differently. Notice how each preserves or sacrifices detail.

Why do local tone mapping operators often produce better results than global ones?

Chapter 4: Super-Resolution

Super-resolution increases image resolution beyond what the sensor captured. Classical methods combine multiple shifted low-resolution images. Modern methods use deep learning to hallucinate plausible high-frequency detail from a single image.

ApproachHow It WorksLimitation
Multi-image SRAlign and fuse multiple slightly shifted capturesNeeds multiple images; limited gain
Example-based SRLearn LR → HR patch mappings from training dataLimited to trained domains
Deep SR (SRCNN, EDSR)End-to-end CNN: input LR, output HRMay hallucinate incorrect detail
Diffusion-based SRUse a generative model conditioned on the LR imagePhotorealistic but not always faithful
The hallucination problem: A deep SR network trained on faces might add eyelashes that were not there, or change a facial expression. The network fills in plausible detail, but "plausible" is not "correct." This matters critically for medical imaging and forensics, where every pixel must be faithful to reality. The tension between perceptual quality and faithfulness is a central issue in modern super-resolution.

Deblurring is related: given a blurred image, estimate the blur kernel and recover the sharp original. Blind deconvolution (unknown kernel) is ill-posed, but deep networks (like DeblurGAN) learn to invert common motion and defocus blurs.

What is the fundamental risk of using deep learning for super-resolution?

Chapter 5: Image Matting

Image matting extracts a foreground object with a soft alpha matte — not just a binary mask, but a per-pixel opacity value between 0 and 1. This captures semi-transparent elements like hair, fur, smoke, and glass.

The compositing equation:

I = αF + (1 − α)B

Given the observed image I, we want to recover the foreground color F, background color B, and alpha α at each pixel. That is 7 unknowns per pixel (RGBA_F + RGB_B) from just 3 observations (RGB_I) — massively underdetermined.

Blue/green screen matting simplifies the problem enormously. With a known constant background color B, the equation becomes solvable. This is why movie studios use green screens: the known background eliminates most unknowns, leaving only F and α to estimate.

For natural image matting (no controlled background), the user provides a trimap: definite foreground, definite background, and an unknown transition region. The algorithm estimates α only in the unknown region. Modern deep learning methods (e.g., ViTMatte) can predict alpha mattes from minimal user input or even automatically.

Why is natural image matting an underdetermined problem?

Chapter 6: Texture Synthesis

Texture synthesis generates arbitrarily large images from a small example patch, maintaining the visual appearance and statistical properties of the original.

Classic approaches:

The Gram matrix captures texture statistics. For a CNN feature map, the Gram matrix Gij = ∑ FiFj measures which features co-activate. Two images with similar Gram matrices "feel" like the same texture, even if the spatial layout is completely different. This insight powers neural style transfer: match one image's Gram matrices while keeping another's content.
Patch-Based Texture Growth

Watch a small texture patch grow into a larger region by copying and blending patches.

What does the Gram matrix of CNN features capture about an image?

Chapter 7: Inpainting

Inpainting fills in missing or removed regions of an image with plausible content. Think of it as the reverse of matting: instead of extracting an object, you remove it and fill the hole.

Evolution of inpainting:

From filling to creating: Early inpainting strictly copied existing content. Deep methods can imagine what should be there. Remove a person from a photo, and the network generates the occluded background — a wall, a tree, a horizon — that was never visible. This blurs the line between restoration and generation, raising questions about image authenticity.
What capability do deep inpainting methods have that classical patch-based methods lack?

Chapter 8: Showcase — HDR Merge Simulator

Simulate merging three exposures into an HDR image. Adjust the exposure slider to see how each capture covers a different brightness range.

HDR Exposure Merge

Each bar represents a scene region's brightness. Different exposures capture different ranges.

Scene dynamic range 10 stops
Dynamic range in stops: Each "stop" doubles the light. A scene with 10 stops of dynamic range has a 1024:1 brightness ratio. A typical camera sensor captures 8-10 stops in one exposure. Bright sunlit scenes easily exceed 15 stops — impossible to capture in a single shot.

Chapter 9: Connections

ConceptUsed In
HDR / tone mappingCh 2 (photometric image formation), photography, display technology
Super-resolutionCh 5 (deep learning), satellite imaging, medical imaging
Image mattingCh 6 (segmentation), Ch 14 (IBR), film VFX
Texture synthesisCh 13 (texture maps), Ch 14 (rendering), game development
InpaintingCh 6 (semantic understanding), photo editing, restoration
Neural style transferCh 5 (CNNs), Ch 14 (neural rendering), artistic tools
Szeliski's perspective: "Computational photography is where vision meets creation. The same algorithms that understand images can also synthesize them. The line between analysis and synthesis is disappearing — modern generative models can both understand what's in an image and create entirely new visual content."
Which computational photography technique is essential for extracting actors from green screen footage?