How neural networks conjure 3D scenes from ordinary photographs — and why one approach uses rays while the other throws paint.
You take 50 photos of a statue from different angles. Your brain can reconstruct the 3D shape. Can a neural network do the same? This is the problem of novel view synthesis: given a set of images and their camera poses, render the scene from any new viewpoint.
Traditional approaches (structure from motion, multi-view stereo) extract explicit geometry like point clouds or meshes. Neural approaches like NeRF and 3D Gaussian Splatting take a radically different path: they learn an implicit or parametric representation of the scene that can be rendered directly.
Multiple cameras observe a 3D object from different angles. The goal: reconstruct what the object looks like from any viewpoint.
To render a pixel, you shoot a ray from the camera through the scene. Along this ray, you sample points and ask: "What color and density exists here?" Then you composite these samples from front to back using the volume rendering equation.
σ is density (how opaque the material is), c is color, and T is transmittance (how much light makes it through to this point). Dense regions block light; empty space passes it through.
In practice, we discretize this integral. For N sample points along a ray, the discrete volume rendering equation is:
Let's work a concrete example with 3 samples. Suppose σ = [0.1, 5.0, 0.2] and the step size δ = 0.1 for all samples. Colors are c1 = red, c2 = green, c3 = blue.
| Sample | σ | α = 1−exp(−σδ) | T (accumulated) | Weight = T·α |
|---|---|---|---|---|
| 1 (red) | 0.1 | 0.010 | 1.000 | 0.010 |
| 2 (green) | 5.0 | 0.394 | 0.990 | 0.390 |
| 3 (blue) | 0.2 | 0.020 | 0.601 | 0.012 |
Sample 2 dominates — it has high density (σ=5.0) and most of the light hasn't been absorbed yet (T=0.99). So the final pixel is mostly green. This is how NeRF renders: evaluate every sample, weight by density and transmittance, sum up.
A ray travels through a scene, sampling density and color at each point. Denser regions contribute more to the final pixel color.
The volume rendering integral uses transmittance T(t) = exp(−∫0t σ(s)ds). This isn't arbitrary — it comes from the Beer-Lambert law of light absorption through a medium.
Your task: Starting from "a thin slab of thickness dt with density σ absorbs a fraction σ·dt of the remaining light," derive the continuous transmittance formula T(t) = exp(−∫σ(s)ds) and then discretize it into Ti = ∏j<i(1 − αj).
Step 1: Beer-Lambert law says a thin slab absorbs proportionally to its density: dI = −σ(t) · I(t) · dt
Step 2: Rearrange: dI/I = −σ(t)dt. Integrate from 0 to t: ln(I(t)) − ln(I(0)) = −∫0tσ(s)ds
Step 3: Exponentiate: I(t)/I(0) = exp(−∫0tσ(s)ds). This ratio IS the transmittance: T(t) = exp(−∫0tσ(s)ds)
Step 4 (discretize): Break the ray into steps of size δi. Each step transmits a fraction exp(−σi·δi) = (1 − αi) of the light. The total transmittance to sample i is the product of all preceding pass-through fractions: Ti = ∏j<i(1 − αj)
The key insight: The exponential decay form is not a choice — it's the unique solution to "each slab absorbs proportionally to its density." The product form in discrete rendering is just the discretized version of the same physics.
python import numpy as np def volume_render(sigmas, deltas, colors): sigmas = np.array(sigmas) deltas = np.array(deltas) colors = np.array(colors) # Step 1: alpha_i = 1 - exp(-sigma_i * delta_i) alphas = 1.0 - np.exp(-sigmas * deltas) # Step 2: T_i = product of (1-alpha_j) for j < i # T[0] = 1, T[1] = (1-a[0]), T[2] = (1-a[0])(1-a[1]), ... transmittance = np.cumprod(1.0 - alphas) transmittance = np.concatenate([[1.0], transmittance[:-1]]) # Step 3: weights and final color weights = transmittance * alphas # [N] pixel_color = (weights[:, None] * colors).sum(axis=0) # [3] return pixel_color, weights
Ti decreases because each sample absorbs some light. If sample 2 has σ=100 with δ=0.1, then α2 = 1−exp(−10) ≈ 1.0 (almost fully opaque). T3 = T2·(1−α2) ≈ T2·0 ≈ 0. So sample 3 contributes NOTHING to the pixel — all light was absorbed before reaching it. This is exactly how a solid wall works: you can't see what's behind it because the wall absorbs all incoming light. Volume rendering naturally handles occlusion through this exponential decay of transmittance.
NeRF represents a 3D scene as a continuous function: given a 3D point (x, y, z) and viewing direction (theta, phi), it outputs the color (r, g, b) and density (sigma) at that point. This function is parameterized by a simple MLP (multilayer perceptron).
Click anywhere in the scene to query the NeRF network at that point. It returns color and density.
The training loop is straightforward: randomly pick a training image, randomly sample rays from it, render those rays using volume rendering, compare the rendered pixel colors to the ground-truth pixels using MSE loss, and backpropagate. Training from ~50–200 posed images takes about 1–2 days on a single GPU.
MLPs are biased toward learning smooth, low-frequency functions. But scenes have sharp edges, fine textures, and high-frequency detail. Positional encoding solves this by mapping the input coordinates to a higher-dimensional space using sinusoidal functions.
Without positional encoding, the MLP can only learn smooth blobs. With it, sharp edges and fine detail emerge. Adjust L (number of frequency bands).
NeRF uses L=10 frequency bands for position (x,y,z) and L=4 for viewing direction (θ,φ). For each coordinate, the encoding produces 2L values (sin and cos at each frequency). So a 3D position [3] becomes [3 + 3×2×10] = [63] values, and a 2D direction [2] becomes [2 + 2×2×4] = [18] values. The MLP input is 63+18 = 81 dimensions. Why fewer bands for direction? View-dependent effects (like specular highlights) are lower-frequency than geometry — you don't need sharp angular precision.
sin(29πx)) oscillates 512 times across the scene — plenty of resolution for fine detail.An MLP with ReLU activations is biased toward low-frequency functions (the "spectral bias" of neural networks). Positional encoding fixes this by mapping inputs to sinusoids at increasing frequencies.
Your task: Show that for two nearby points x and x+ε, their raw inputs are nearly identical (|x − (x+ε)| = ε), but after positional encoding with frequency band 2k, the encoded inputs can differ by up to 2kπε. Explain why this means the MLP can now distinguish points that are 1/2k apart.
Step 1: Without PE, two points at x=0.500 and x=0.501 give the MLP inputs that differ by 0.001. The MLP must squeeze all scene detail into this tiny input range — it's like painting the Mona Lisa with a 3-pixel brush.
Step 2: With PE at frequency band k: γk(x) = sin(2kπx). The difference between our two points: sin(2kπ·0.501) − sin(2kπ·0.500) ≈ 2kπ·0.001·cos(2kπ·0.500).
Step 3: At k=9 (the highest band in NeRF): this difference can be up to 512π·0.001 ≈ 1.6. A change of 1.6 in the MLP's input space is HUGE — comparable to moving across the entire scene in the raw coordinate. The MLP now "sees" these nearby points as far apart in its feature space.
Step 4: The multi-scale structure matters. Low frequencies (k=0,1) capture coarse structure. High frequencies (k=8,9) capture fine edges. The MLP learns to combine these scales, much like Fourier analysis decomposes signals into frequency components.
The key insight: Positional encoding is not "adding information" — the input already contains the position. It's AMPLIFYING the difference between nearby positions so the MLP's smooth learned function can capture sharp transitions. It converts a spatial resolution problem into a function approximation problem the MLP can actually solve.
Evaluating the MLP at every point along a ray is expensive. NeRF uses a hierarchical sampling strategy: first, sample uniformly (coarse pass), then concentrate more samples where density is high (fine pass). This focuses compute where it matters.
Blue dots = coarse uniform samples. Green dots = fine importance-weighted samples near the surface.
Original NeRF takes hours to train and seconds to render a single frame. Instant-NGP (NVIDIA, 2022) slashed training to seconds and rendering to real-time by replacing the MLP with a multi-resolution hash table.
Instead of a deep MLP that must process every point through 8 layers, Instant-NGP looks up precomputed features in a hash table indexed by spatial position. This is massively parallel and cache-friendly.
| Method | Training Time | Render Speed | Key Technique |
|---|---|---|---|
| Original NeRF | ~1 day | ~30s/frame | Deep MLP |
| Instant-NGP | ~5 seconds | Real-time | Hash grid encoding |
| TensoRF | ~30 min | ~1s/frame | Tensor factorization |
| Plenoxels | ~11 min | ~15fps | Sparse voxel grid |
Compare the two approaches: deep MLP requires sequential layers, hash grid is a fast parallel lookup. Toggle to compare.
3DGS takes a completely different approach from NeRF. Instead of an implicit function queried along rays, it represents the scene as millions of 3D Gaussians — each one a colored, oriented ellipsoid with position, covariance, color, and opacity.
To render: project each Gaussian onto the screen (splatting), sort by depth, and alpha-composite them front to back. No ray marching, no MLP evaluation — just rasterization. This is extremely fast.
Each ellipse is a 3D Gaussian projected to 2D. More Gaussians = more detail. Adjust the count and see them splat!
| Per-Gaussian Parameter | Meaning | Count |
|---|---|---|
| Position μ | 3D center of the Gaussian | 3 |
| Covariance Σ | Shape, size, orientation (via quaternion + scale) | 7 |
| Color | Spherical harmonics coefficients | 48 |
| Opacity α | Transparency | 1 |
Initialization comes from Structure from Motion (SfM) — a classical algorithm that estimates a sparse point cloud (~100K–1M points) from the input images. Each SfM point becomes one Gaussian. During training, 3DGS applies adaptive densification: Gaussians with large gradients in under-reconstructed regions are split (large ones become two smaller ones) or cloned (small ones are duplicated nearby). Gaussians with opacity below 0.005 are pruned. A typical trained scene has 1–5 million Gaussians.
[3], scale [3], rotation quaternion [4], opacity [1], and spherical harmonics color coefficients [48] (degree 3, which gives 16 coefficients × 3 RGB channels). That's ~59 parameters per Gaussian. For 1M Gaussians: ~59M parameters, or about 230 MB at float32. Rendering is pure rasterization: project all Gaussians to 2D, sort by depth per tile, alpha-composite front-to-back. No neural network at inference — just GPU rasterization. This is why it hits 100+ FPS.Training takes ~20–40 minutes on a single GPU. The loss combines pixel-level MSE with SSIM (structural similarity): L = (1−λ)·L1 + λ·LSSIM with λ=0.2. Both NeRF and 3DGS optimize per-scene — you train a separate model for each scene from its specific images. There is no generalization across scenes (that's what later methods like feed-forward 3DGS tackle).
In 3DGS, each Gaussian has an opacity αi that depends on its intrinsic opacity AND how much the pixel overlaps with the Gaussian's 2D projection. The final pixel color uses front-to-back alpha compositing, which is structurally identical to NeRF's volume rendering.
Your task: Show that 3DGS's per-pixel compositing C = ∑i ci · αi · ∏j<i(1−αj) is mathematically equivalent to the discrete volume rendering formula from Ch1. Then explain why the effective αi for a Gaussian depends on the Gaussian's 2D distance from the pixel center.
The equivalence: Both NeRF and 3DGS compute pixels the same way: C = ∑i (front-to-back) colori × opacityi × accumulated_transparencyi. The math is identical. Only the source of (color, opacity) differs.
In NeRF: αi = 1−exp(−σiδi), coming from the MLP's density output integrated over a ray step.
In 3DGS: αi = oi · exp(−½(p−μ'i)TΣ'i−1(p−μ'i)). This is the Gaussian's intrinsic opacity TIMES its spatial falloff at this pixel. The 2D covariance Σ' comes from projecting the 3D covariance using the Jacobian of the camera projection: Σ' = J W Σ WT JT.
The deep connection: Front-to-back alpha compositing IS discrete volume rendering. Whether you discretize by sampling points along a ray (NeRF) or by sorting primitives by depth (3DGS), you're computing the same integral. 3DGS just uses a different basis — Gaussians instead of point samples.
The key insight: This shared mathematical structure is why both methods are differentiable and can be optimized with the same MSE loss. They're two parameterizations of the same rendering equation. The speed difference comes entirely from the representation (MLP queries vs rasterization), not the compositing math.
The fundamental difference between NeRF and 3DGS is how they render:
| NeRF (Ray Marching) | 3DGS (Rasterization) | |
|---|---|---|
| Approach | Shoot rays, sample points, query MLP | Project primitives to screen, sort, blend |
| Representation | Implicit (continuous function) | Explicit (millions of Gaussians) |
| Training speed | Hours to days | Minutes |
| Render speed | Seconds per frame | 100+ FPS |
| Memory | Low (small MLP) | High (millions of Gaussians) |
| Quality | Excellent | Excellent (often better) |
Toggle between the two rendering paradigms to see the conceptual difference.
The industry approach (2024-2025):
Static scene: Compressed 3DGS. 3DGS is the only option that can hit 90fps. But raw 3DGS is too large. Solutions: (1) Quantize SH coefficients to int8 (48 floats → 48 bytes, 4× compression). (2) Use fewer SH bands (degree 1 instead of 3, 12 coefficients instead of 48). (3) Prune low-opacity Gaussians aggressively. Result: ~50-80MB for a room, 100+ FPS on mobile GPUs.
Dynamic objects: Deformable 3DGS or tracked mesh + texture. For the cat: a per-frame deformation field that moves Gaussian centers (D-3DGS), or fall back to classical mesh tracking with neural texture. Some systems use a small MLP that predicts Gaussian offsets from a pose code, trained on a short capture sequence.
Memory: Level-of-detail (LOD). Only load full-detail Gaussians for nearby surfaces. Distant geometry uses fewer, larger Gaussians. Tile-based culling skips Gaussians outside the current view frustum — crucial for 360° scenes where only ~90° is visible at once.
Foveated rendering: The eye's periphery has much lower acuity. Render the foveal region (central 10°) at full resolution with all Gaussians. Periphery: 4× fewer pixels, skip small Gaussians. This cuts total rasterization work by ~3-4×.
What if you could generate a 3D object from a text prompt or a single image? Generative 3D combines NeRF/3DGS with diffusion models to create 3D content without any multi-view input.
| Method | Input | Key Idea |
|---|---|---|
| DreamFusion | Text prompt | Score Distillation Sampling (SDS) from 2D diffusion |
| Zero-1-to-3 | Single image | Viewpoint-conditioned diffusion |
| Magic3D | Text prompt | Coarse-to-fine with mesh extraction |
| LGM | 4 images | Feed-forward 3DGS generation |
| GaussianDreamer | Text prompt | SDS applied to Gaussian splatting |
A 2D diffusion model guides the optimization of a 3D NeRF representation. The NeRF renders views, the diffusion model critiques them.
The deep connection: SDS uses a 2D diffusion model as a 3D-consistent critic. The score function ∇xlog p(x) from diffusion becomes a SUPERVISION signal for the 3D representation. You're distilling the 2D prior into 3D geometry. This is the same "iterative refinement guided by a learned model" pattern seen in diffusion denoising, RLHF reward optimization, and even Kalman filtering (using a model to refine estimates).
Where else in ML do you see "use a pretrained model's gradients to optimize something else entirely"? (Hint: think about CLIP-guided generation, neural style transfer, adversarial training...)
Neural 3D representations are transforming robotics. Robots need to understand 3D geometry to grasp objects, navigate spaces, and plan motions. NeRF and 3DGS provide rich scene representations that go beyond flat depth maps.
| Application | How NeRF/3DGS Helps |
|---|---|
| Grasp planning | Dense 3D geometry for contact-rich manipulation |
| Navigation | Photorealistic simulation for training navigation policies |
| Sim-to-real | Reconstruct real environments as training scenes |
| Language-guided | Pair 3D features with CLIP for "find the mug" queries |
| Deformables | Model soft objects and cloth for manipulation |
A robot uses a 3DGS representation to plan grasps. Colored regions show semantic understanding overlaid on 3D structure.
The deep connection: 2D foundation models (CLIP, DINO, SAM) learn powerful per-pixel features from billions of images. NeRF/3DGS provides the 3D scaffolding to "lift" these 2D features into consistent 3D representations. This is the bridge between 2D perception (what VLMs do) and 3D understanding (what robots need). The pattern: use geometry to aggregate 2D information into 3D — the same principle behind multi-view stereo, Structure from Motion, and visual SLAM.
VLAs (Vision-Language-Action models) need 3D spatial reasoning but are trained on 2D images. How might 3DGS + feature fields give VLAs implicit 3D understanding without explicit 3D supervision?
You now understand how neural networks reconstruct 3D worlds from photographs. From NeRF's elegant ray marching to 3DGS's blazing-fast splatting, these techniques are redefining what's possible in graphics, VR, and robotics.