Szeliski, Chapter 14

Image-Based Rendering

Synthesizing novel views from captured images: view interpolation, layered depth images, light fields, and neural radiance fields (NeRFs).

Prerequisites: Chapter 11 (SfM), Chapter 12 (depth estimation), Chapter 13 (3D reconstruction).
10
Chapters
5+
Simulations
0
Assumed CV Knowledge

Chapter 0: Why Image-Based Rendering?

You have 50 photos of a cathedral taken from different angles. You want to create a smooth, photorealistic fly-through as if a virtual camera glided between the viewpoints. Traditional computer graphics would require a perfect 3D model with accurate materials and lighting. But what if you could just blend between the images themselves?

That is the promise of image-based rendering (IBR): synthesize novel views by combining existing photographs, using 3D geometry as a guide but leaning on the captured pixels for photorealism.

Why it matters: IBR powers Google Street View (smooth transitions between panoramas), Photo Tourism / Photosynth (exploring landmarks from tourist photos), movie VFX (bullet-time, virtual camera moves), real estate walkthroughs, and the newest frontier — neural rendering (NeRFs, 3D Gaussian Splatting) for VR/AR content.
The IBR Spectrum

IBR methods trade off between how much 3D geometry they need and how many images they require.

What is the core idea of image-based rendering?

Chapter 1: View Interpolation

The seminal IBR technique (Chen and Williams, 1993). You have two reference images with their depth maps and camera poses. To render a novel view "in between," warp both images to the virtual camera using the depth maps, then blend them.

1. Forward warp
Project each pixel from the reference image to the novel view using its depth
2. Z-buffer resolve
When multiple pixels land on the same location, keep the closest one
3. Merge two views
Where one view has a hole (disocclusion), use the other view's data
4. Fill remaining holes
Inpaint small cracks from forward warping with background color
View-dependent texture maps: Instead of associating a separate depth map with each image, Debevec's Facade system (1996) builds one coarse 3D model and paints it with different source images depending on the virtual camera's position. The weighting is inversely proportional to the angle between the virtual view and each source view. Even with crude geometry, the blended textures create a strong illusion of detailed 3D — because the parallax between views supplies the missing geometric detail.
Photo Tourism: Snavely, Seitz, and Szeliski (2006) applied view interpolation to thousands of unordered tourist photos of landmarks. SfM recovers camera poses and a sparse point cloud. Each photo gets a planar proxy. Transitions between photos are stabilized by warping through 3D. This became Microsoft's Photosynth and later influenced Google's Photo Tours.
View Interpolation

Blend between two views. At t=0 you see the left view; at t=1, the right. The in-between is synthesized by depth-based warping and blending.

Blend t 0.50

View morphing: When the scene is non-rigid (a person smiling in one image, frowning in another), pure geometric warping fails. View morphing (Seitz and Dyer, 1996) combines geometric warping (from depth) with image morphing (from correspondences) to create plausible in-between views of deformable objects. First rectify both images, morph in rectified space, then un-rectify. The result: a smooth transition that respects both geometry and appearance.

What causes "holes" in a view-interpolated image?

Chapter 2: Layered Depth Images

A standard depth map stores one depth per pixel. But when you warp to a novel view, the area behind a foreground object is revealed — and there is no data there. A layered depth image (LDI) stores multiple depth-color pairs at each pixel, capturing the hidden layers.

Think of it as a stack of transparent cards with cutouts: the front card has the foreground; peeking through the holes, you see the card behind.

Sprites with depth: An LDI can be organized into layers (sprites), each a planar image with a per-pixel depth offset. The foreground person is one sprite; the background wall is another. Rendering is just alpha-compositing the layers in depth order from the novel view. This is far cheaper than re-running stereo for every virtual camera position.
RepresentationWhat It StoresTrade-off
Single depth mapOne (color, depth) per pixelSimple but holes on warp
LDIVariable-length list of (color, depth) per pixelHandles disocclusion, complex data structure
Multi-plane image (MPI)Fixed set of RGBA planes at discrete depthsGPU-friendly, used in modern neural IBR
Layered spritesSeparated foreground/background RGBA + depthCompact, editable, fast rendering
Multi-plane images (MPIs): The modern evolution of LDIs. An MPI is a stack of RGBA images at fixed depth planes. A neural network predicts the MPI from a stereo pair (Zhou et al., 2018; Flynn et al., 2019). Rendering a novel view is just alpha-compositing the planes from back to front, shifted by the appropriate disparity. This is differentiable, GPU-accelerated, and has become a building block for real-time neural view synthesis.
What problem does a layered depth image solve compared to a standard depth map?

Chapter 3: Light Fields

Every ray of light in a scene can be parameterized by where it crosses two parallel planes: (u, v) on the camera plane and (s, t) on the focal plane. The complete collection of all such rays is the 4D light field L(u, v, s, t).

If you capture a dense enough light field, you can synthesize any novel view by simply looking up the right rays — no 3D geometry needed at all.

Two-plane parameterization: Levoy and Hanrahan (1996) and Gortler et al. (1996) independently proposed this idea. A camera array captures a grid of views. Each view contributes rays parameterized by (u, v) = camera position and (s, t) = pixel coordinates. To render a novel view, resample the 4D function. With enough cameras, interpolation produces photorealistic results even for complex scenes with reflections and transparency.
Light Field Sampling

A camera array captures rays. Each camera position (u,v) sees each scene point through a different pixel (s,t). Move the virtual camera to see different slices of the 4D light field.

Virtual camera u 50
VariantKey Idea
Light fieldDense camera grid, two-plane parameterization (Levoy & Hanrahan, 1996)
LumigraphSame parameterization but uses approximate geometry to improve interpolation (Gortler et al., 1996)
Unstructured LumigraphArbitrary (non-grid) camera positions, geometry-aware blending (Buehler et al., 2001)
Surface light fieldStore view-dependent appearance on the surface of a 3D model. Captures specular effects.
Concentric mosaicsCamera on a rotating arm captures a ring of views. 3D parameterization (Shum & He, 1999).
Synthetic refocusing: A light field encodes every ray, including those from slightly different viewpoints. By integrating rays that converge on a specific depth plane, you can synthetically refocus the image after capture. This is exactly what Lytro cameras did — and what computational photography on smartphones now approximates with dual cameras.

Practical challenges: Capturing a dense light field requires hundreds or thousands of cameras (or a moving camera on a gantry). Storage is enormous: a 100×100 camera grid at 1 megapixel per camera is 10 billion rays. Compression exploits the redundancy between nearby views (most of the scene is the same), but the data demands remain a key limitation for consumer applications.

Sparse light fields: Real-world capture is typically sparse (tens of views, not thousands). The Lumigraph (Gortler et al., 1996) uses approximate geometry to improve interpolation between sparse views. The Unstructured Lumigraph (Buehler et al., 2001) handles arbitrary camera positions by blending views weighted by angular proximity, penalizing views that see the surface at grazing angles.

Why does a dense light field not require 3D geometry for novel view synthesis?

Chapter 4: Environment Mattes

An environment matte captures how an object interacts with light from its surroundings. Place a glass vase in front of different backgrounds — it refracts, reflects, and transmits light differently each time. An environment matte records this mapping: for each pixel of the object, which background rays contribute to its appearance?

Capture process: photograph the object in front of a set of known background patterns (often sinusoidal patterns at different frequencies). From the response at each pixel, decode which background rays map to that pixel (Zongker et al., 1999).

Relighting transparent objects: Once you have the environment matte, you can composite the object onto any new background. The glass vase will correctly refract the new scene — without any ray tracing. The matte encodes the object's optical transfer function.
Higher-dimensional light fields: An environment matte is really a 2D-to-2D mapping: for each output pixel (x, y), it records which input ray (s, t) arrives there. For a moving object, this becomes a higher-dimensional function. The spectrum of approaches from full 3D geometry to pure image-based rendering forms a continuum — more geometry means fewer images needed, and vice versa.
What does an environment matte capture?

Chapter 5: Video-Based Rendering

Still images are one thing. But what if your source data is video? Video-based rendering creates novel video experiences by analyzing, re-arranging, and re-synthesizing video footage.

TechniqueWhat It Does
Video texturesFind good loop points in a video and seamlessly stitch them. A candle flame, a waterfall, or a flag in the wind loops forever without a visible cut (Schodl et al., 2000).
Video-based animationRe-order video segments to match new audio. Video Rewrite (Bregler et al., 1997) re-animated a person's lips to say new words by stitching mouth segments from existing footage.
CinemagraphsA still photo with one region looping as video — water flowing, hair blowing. Freeze the static parts, loop the dynamic parts.
3D videoMultiple synchronized video cameras reconstruct a time-varying 3D model. Watch a performance from any angle (Kanade et al., 1997).
Video walkthroughsStreet-level video stitched into a navigable experience — the technology behind Google Street View transitions.
Video textures in detail: The key insight is finding where a video "almost repeats." Compute a frame-to-frame similarity matrix (matching cost between all pairs of frames). Good loop points appear as low-cost entries away from the diagonal. During playback, when approaching a good transition point, randomly jump to the matching frame. The viewer perceives an infinite, non-repeating video of a naturally dynamic scene.
From Video Rewrite to deepfakes: Bregler's 1997 Video Rewrite was the first to re-animate a person's face from existing footage. Today's face-swap and lip-sync deepfakes are direct descendants. The ethical implications — misinformation, consent violations — have become one of the most urgent challenges in computer vision.
How does a video texture create the illusion of an infinitely looping natural scene?

Chapter 6: Neural Rendering

Classical IBR uses explicit geometry (meshes, depth maps, voxels) and hand-crafted blending. Neural rendering replaces one or more of these components with learned neural networks. The result: photorealistic novel views even from sparse, noisy input.

The spectrum of neural rendering approaches:

CategoryWhat the Network DoesExample
Learned blendingNetwork predicts blending weights for source viewsDeepBlending (Hedman et al., 2018)
Learned geometryNetwork predicts depth or MPI from imagesStereo Magnification (Zhou et al., 2018)
Neural volumes3D volume of learned features, decoded to color per rayNeural Volumes (Lombardi et al., 2019)
Implicit sceneMLP maps (x, y, z, direction) to (color, density)NeRF (Mildenhall et al., 2020)
Explicit primitivesMillions of learned 3D Gaussians, splatted to screen3D Gaussian Splatting (Kerbl et al., 2023)
The key shift: In classical IBR, the scene representation is designed by humans (mesh, voxel grid, light field). In neural rendering, it is learned from data. The network discovers the best way to encode geometry, appearance, and view-dependent effects. This is why neural methods handle reflections, translucency, and fine detail far better than classical approaches.

DeepBlending (Hedman et al., 2018) keeps the classical rendering pipeline but replaces the hand-tuned blending weights with a learned network. Given multiple source views rendered to the novel viewpoint, a CNN predicts per-pixel blending weights. The network learns to trust close, frontal views and distrust distant, grazing ones — a principled version of the heuristics used in unstructured Lumigraphs.

Differentiable rendering: The key enabler of neural rendering is making the rendering process differentiable. If you can compute the gradient of the rendered image with respect to scene parameters (geometry, appearance, camera), you can optimize those parameters using gradient descent to match the observed photographs. This is how NeRFs and Gaussian Splatting train: minimize the photometric loss between rendered and captured views.
What distinguishes neural rendering from classical image-based rendering?

Chapter 7: NeRFs and Beyond

Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) represent a scene as a continuous volumetric function mapping 3D position and viewing direction to color and density:

Fθ(x, y, z, θ, φ) → (r, g, b, σ)

where (x, y, z) is the 3D position, (θ, φ) is the viewing direction, (r, g, b) is the emitted color, and σ is the volume density. The function is parameterized by an MLP with weights θ.

How NeRF renders: To render a pixel, cast a ray from the camera through that pixel. Sample points along the ray. Query the MLP at each sample point to get color and density. Accumulate color using volume rendering (alpha compositing front-to-back): C = ∑ Ti · αi · ci, where Ti is the transmittance (how much light has not been absorbed yet) and αi = 1 − exp(−σi · δi) is the opacity at sample i.
NeRF Volume Rendering

A ray passes through a volume. At each sample, the network predicts color and density. High-density regions contribute more to the final pixel color.

View angle 50
MethodSpeedKey Idea
NeRF (2020)Hours to train, seconds/frameMLP + positional encoding + hierarchical sampling
Instant-NGP (2022)Minutes to train, real-timeMulti-resolution hash grid replaces MLP queries
3D Gaussian Splatting (2023)Minutes to train, 100+ FPSExplicit 3D Gaussians, rasterized not ray-marched
Zip-NeRF (2023)~30 min trainAnti-aliased NeRF with hash grid, state-of-art quality
3D Gaussian Splatting: Instead of an implicit MLP, represent the scene as millions of 3D Gaussian primitives, each with a position, covariance (shape), color, and opacity. Render by projecting ("splatting") each Gaussian onto the screen and alpha-compositing. Training optimizes the Gaussians via gradient descent on photometric loss. Result: NeRF-quality rendering at 100+ FPS, editable, and exportable. This has rapidly become the dominant approach for real-time neural view synthesis.
How does NeRF render a single pixel?

Chapter 8: Showcase — Volume Rendering Along a Ray

This simulation shows the core of NeRF rendering. A ray travels through a volume. At each sample, the "network" reports a density (how opaque) and a color. Watch how the accumulated pixel color builds up as the ray passes through dense regions.

Ray Marching Through a NeRF

Green bars show density at each sample. The accumulated color bar fills as the ray integrates. Adjust object positions and density to see how the rendering changes.

Object 1 position 25
Object 2 position 65
Peak density 5
What to observe: The first dense object absorbs most of the transmittance, so the second object contributes less to the final color even if equally dense. This is exactly how real volume rendering works: closer opaque surfaces dominate. Move object 1 and 2 to see the occlusion effect. Reduce density to make both objects semi-transparent.

Chapter 9: Connections

ConceptUsed In
View interpolationGoogle Street View, VR walkthroughs, stabilized image transitions
Light fieldsLytro cameras, computational refocusing, VR video
Layered depth / MPIReal-time neural view synthesis, 3D photos, portrait mode
Video texturesCinemagraphs, dynamic wallpapers, game environments
NeRF / neural renderingVR/AR content creation, digital twins, autonomous driving simulation
3D Gaussian SplattingReal-time novel view synthesis, game-ready 3D capture, live events
Volume rendering equationMedical imaging (CT), scientific visualization, atmospheric rendering
Szeliski's perspective: "Image-based rendering began as a way to cheat: avoid expensive 3D modeling by leaning on captured photographs. With neural radiance fields, the line between 'image-based' and 'model-based' has dissolved entirely. The representation is the rendering algorithm is the optimization objective. We have come full circle."
What recent method achieves NeRF-quality novel view synthesis at over 100 FPS using explicit primitives?