Synthesizing novel views from captured images: view interpolation, layered depth images, light fields, and neural radiance fields (NeRFs).
You have 50 photos of a cathedral taken from different angles. You want to create a smooth, photorealistic fly-through as if a virtual camera glided between the viewpoints. Traditional computer graphics would require a perfect 3D model with accurate materials and lighting. But what if you could just blend between the images themselves?
That is the promise of image-based rendering (IBR): synthesize novel views by combining existing photographs, using 3D geometry as a guide but leaning on the captured pixels for photorealism.
IBR methods trade off between how much 3D geometry they need and how many images they require.
The seminal IBR technique (Chen and Williams, 1993). You have two reference images with their depth maps and camera poses. To render a novel view "in between," warp both images to the virtual camera using the depth maps, then blend them.
Blend between two views. At t=0 you see the left view; at t=1, the right. The in-between is synthesized by depth-based warping and blending.
View morphing: When the scene is non-rigid (a person smiling in one image, frowning in another), pure geometric warping fails. View morphing (Seitz and Dyer, 1996) combines geometric warping (from depth) with image morphing (from correspondences) to create plausible in-between views of deformable objects. First rectify both images, morph in rectified space, then un-rectify. The result: a smooth transition that respects both geometry and appearance.
A standard depth map stores one depth per pixel. But when you warp to a novel view, the area behind a foreground object is revealed — and there is no data there. A layered depth image (LDI) stores multiple depth-color pairs at each pixel, capturing the hidden layers.
Think of it as a stack of transparent cards with cutouts: the front card has the foreground; peeking through the holes, you see the card behind.
| Representation | What It Stores | Trade-off |
|---|---|---|
| Single depth map | One (color, depth) per pixel | Simple but holes on warp |
| LDI | Variable-length list of (color, depth) per pixel | Handles disocclusion, complex data structure |
| Multi-plane image (MPI) | Fixed set of RGBA planes at discrete depths | GPU-friendly, used in modern neural IBR |
| Layered sprites | Separated foreground/background RGBA + depth | Compact, editable, fast rendering |
Every ray of light in a scene can be parameterized by where it crosses two parallel planes: (u, v) on the camera plane and (s, t) on the focal plane. The complete collection of all such rays is the 4D light field L(u, v, s, t).
If you capture a dense enough light field, you can synthesize any novel view by simply looking up the right rays — no 3D geometry needed at all.
A camera array captures rays. Each camera position (u,v) sees each scene point through a different pixel (s,t). Move the virtual camera to see different slices of the 4D light field.
| Variant | Key Idea |
|---|---|
| Light field | Dense camera grid, two-plane parameterization (Levoy & Hanrahan, 1996) |
| Lumigraph | Same parameterization but uses approximate geometry to improve interpolation (Gortler et al., 1996) |
| Unstructured Lumigraph | Arbitrary (non-grid) camera positions, geometry-aware blending (Buehler et al., 2001) |
| Surface light field | Store view-dependent appearance on the surface of a 3D model. Captures specular effects. |
| Concentric mosaics | Camera on a rotating arm captures a ring of views. 3D parameterization (Shum & He, 1999). |
Practical challenges: Capturing a dense light field requires hundreds or thousands of cameras (or a moving camera on a gantry). Storage is enormous: a 100×100 camera grid at 1 megapixel per camera is 10 billion rays. Compression exploits the redundancy between nearby views (most of the scene is the same), but the data demands remain a key limitation for consumer applications.
Sparse light fields: Real-world capture is typically sparse (tens of views, not thousands). The Lumigraph (Gortler et al., 1996) uses approximate geometry to improve interpolation between sparse views. The Unstructured Lumigraph (Buehler et al., 2001) handles arbitrary camera positions by blending views weighted by angular proximity, penalizing views that see the surface at grazing angles.
An environment matte captures how an object interacts with light from its surroundings. Place a glass vase in front of different backgrounds — it refracts, reflects, and transmits light differently each time. An environment matte records this mapping: for each pixel of the object, which background rays contribute to its appearance?
Capture process: photograph the object in front of a set of known background patterns (often sinusoidal patterns at different frequencies). From the response at each pixel, decode which background rays map to that pixel (Zongker et al., 1999).
Still images are one thing. But what if your source data is video? Video-based rendering creates novel video experiences by analyzing, re-arranging, and re-synthesizing video footage.
| Technique | What It Does |
|---|---|
| Video textures | Find good loop points in a video and seamlessly stitch them. A candle flame, a waterfall, or a flag in the wind loops forever without a visible cut (Schodl et al., 2000). |
| Video-based animation | Re-order video segments to match new audio. Video Rewrite (Bregler et al., 1997) re-animated a person's lips to say new words by stitching mouth segments from existing footage. |
| Cinemagraphs | A still photo with one region looping as video — water flowing, hair blowing. Freeze the static parts, loop the dynamic parts. |
| 3D video | Multiple synchronized video cameras reconstruct a time-varying 3D model. Watch a performance from any angle (Kanade et al., 1997). |
| Video walkthroughs | Street-level video stitched into a navigable experience — the technology behind Google Street View transitions. |
Classical IBR uses explicit geometry (meshes, depth maps, voxels) and hand-crafted blending. Neural rendering replaces one or more of these components with learned neural networks. The result: photorealistic novel views even from sparse, noisy input.
The spectrum of neural rendering approaches:
| Category | What the Network Does | Example |
|---|---|---|
| Learned blending | Network predicts blending weights for source views | DeepBlending (Hedman et al., 2018) |
| Learned geometry | Network predicts depth or MPI from images | Stereo Magnification (Zhou et al., 2018) |
| Neural volumes | 3D volume of learned features, decoded to color per ray | Neural Volumes (Lombardi et al., 2019) |
| Implicit scene | MLP maps (x, y, z, direction) to (color, density) | NeRF (Mildenhall et al., 2020) |
| Explicit primitives | Millions of learned 3D Gaussians, splatted to screen | 3D Gaussian Splatting (Kerbl et al., 2023) |
DeepBlending (Hedman et al., 2018) keeps the classical rendering pipeline but replaces the hand-tuned blending weights with a learned network. Given multiple source views rendered to the novel viewpoint, a CNN predicts per-pixel blending weights. The network learns to trust close, frontal views and distrust distant, grazing ones — a principled version of the heuristics used in unstructured Lumigraphs.
Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) represent a scene as a continuous volumetric function mapping 3D position and viewing direction to color and density:
where (x, y, z) is the 3D position, (θ, φ) is the viewing direction, (r, g, b) is the emitted color, and σ is the volume density. The function is parameterized by an MLP with weights θ.
A ray passes through a volume. At each sample, the network predicts color and density. High-density regions contribute more to the final pixel color.
| Method | Speed | Key Idea |
|---|---|---|
| NeRF (2020) | Hours to train, seconds/frame | MLP + positional encoding + hierarchical sampling |
| Instant-NGP (2022) | Minutes to train, real-time | Multi-resolution hash grid replaces MLP queries |
| 3D Gaussian Splatting (2023) | Minutes to train, 100+ FPS | Explicit 3D Gaussians, rasterized not ray-marched |
| Zip-NeRF (2023) | ~30 min train | Anti-aliased NeRF with hash grid, state-of-art quality |
This simulation shows the core of NeRF rendering. A ray travels through a volume. At each sample, the "network" reports a density (how opaque) and a color. Watch how the accumulated pixel color builds up as the ray passes through dense regions.
Green bars show density at each sample. The accumulated color bar fills as the ray integrates. Adjust object positions and density to see how the rendering changes.
| Concept | Used In |
|---|---|
| View interpolation | Google Street View, VR walkthroughs, stabilized image transitions |
| Light fields | Lytro cameras, computational refocusing, VR video |
| Layered depth / MPI | Real-time neural view synthesis, 3D photos, portrait mode |
| Video textures | Cinemagraphs, dynamic wallpapers, game environments |
| NeRF / neural rendering | VR/AR content creation, digital twins, autonomous driving simulation |
| 3D Gaussian Splatting | Real-time novel view synthesis, game-ready 3D capture, live events |
| Volume rendering equation | Medical imaging (CT), scientific visualization, atmospheric rendering |