The Complete Beginner's Path

Understand NeRF & 3D Gaussian Splatting

How neural networks conjure 3D scenes from ordinary photographs — and why one approach uses rays while the other throws paint.

Prerequisites: Basic intuition for 3D space + Curiosity about graphics. That's it.
10
Chapters
9+
Simulations
0
Graphics Background Needed

Chapter 0: 3D from 2D

You take 50 photos of a statue from different angles. Your brain can reconstruct the 3D shape. Can a neural network do the same? This is the problem of novel view synthesis: given a set of images and their camera poses, render the scene from any new viewpoint.

Traditional approaches (structure from motion, multi-view stereo) extract explicit geometry like point clouds or meshes. Neural approaches like NeRF and 3D Gaussian Splatting take a radically different path: they learn an implicit or parametric representation of the scene that can be rendered directly.

The setup: Input = photos + camera positions. Output = a representation that can render the scene from any new camera position. No 3D scanner needed — just ordinary photographs.
Multi-View Input

Multiple cameras observe a 3D object from different angles. The goal: reconstruct what the object looks like from any viewpoint.

Camera count8
Check: What is the input to a NeRF or 3DGS system?

Chapter 1: Volume Rendering

To render a pixel, you shoot a ray from the camera through the scene. Along this ray, you sample points and ask: "What color and density exists here?" Then you composite these samples from front to back using the volume rendering equation.

C(r) = ∫ T(t) · σ(t) · c(t) dt    where T(t) = exp(−∫σ(s)ds)

σ is density (how opaque the material is), c is color, and T is transmittance (how much light makes it through to this point). Dense regions block light; empty space passes it through.

In practice, we discretize this integral. For N sample points along a ray, the discrete volume rendering equation is:

C = ∑i Ti · αi · ci    where αi = 1 − exp(−σi · δi)    and Ti = ∏j<i (1 − αj)

Let's work a concrete example with 3 samples. Suppose σ = [0.1, 5.0, 0.2] and the step size δ = 0.1 for all samples. Colors are c1 = red, c2 = green, c3 = blue.

Sampleσα = 1−exp(−σδ)T (accumulated)Weight = T·α
1 (red)0.10.0101.0000.010
2 (green)5.00.3940.9900.390
3 (blue)0.20.0200.6010.012

Sample 2 dominates — it has high density (σ=5.0) and most of the light hasn't been absorbed yet (T=0.99). So the final pixel is mostly green. This is how NeRF renders: evaluate every sample, weight by density and transmittance, sum up.

Ray Marching Through a Volume

A ray travels through a scene, sampling density and color at each point. Denser regions contribute more to the final pixel color.

Density scale8
The key insight: Volume rendering is differentiable. This means we can use gradient descent to optimize whatever produces the density and color values — and that "whatever" is the neural network in NeRF.
Check: In volume rendering, what does transmittance T(t) represent?
🔨 Derivation Derive the Transmittance Formula from Beer-Lambert Law ✓ ATTEMPTED

The volume rendering integral uses transmittance T(t) = exp(−∫0t σ(s)ds). This isn't arbitrary — it comes from the Beer-Lambert law of light absorption through a medium.

Your task: Starting from "a thin slab of thickness dt with density σ absorbs a fraction σ·dt of the remaining light," derive the continuous transmittance formula T(t) = exp(−∫σ(s)ds) and then discretize it into Ti = ∏j<i(1 − αj).

If I(t) is the light intensity remaining after traveling distance t through a medium with density σ(t), and a thin slab absorbs a fraction σ(t)·dt of the light: write dI/dt = ?. What's the relationship between I lost and the density?
You should have dI/I = −σ(t)dt. Integrate both sides. The left side gives ln(I(t)/I(0)). What does T(t) = I(t)/I(0) equal?
For a discrete step from ti to tii, the fraction of light absorbed in that step is 1 − exp(−σi·δi). Call this αi. What fraction passes through? How does the product of all pass-through fractions give you Ti?

Step 1: Beer-Lambert law says a thin slab absorbs proportionally to its density: dI = −σ(t) · I(t) · dt

Step 2: Rearrange: dI/I = −σ(t)dt. Integrate from 0 to t: ln(I(t)) − ln(I(0)) = −∫0tσ(s)ds

Step 3: Exponentiate: I(t)/I(0) = exp(−∫0tσ(s)ds). This ratio IS the transmittance: T(t) = exp(−∫0tσ(s)ds)

Step 4 (discretize): Break the ray into steps of size δi. Each step transmits a fraction exp(−σi·δi) = (1 − αi) of the light. The total transmittance to sample i is the product of all preceding pass-through fractions: Ti = ∏j<i(1 − αj)

The key insight: The exponential decay form is not a choice — it's the unique solution to "each slab absorbs proportionally to its density." The product form in discrete rendering is just the discretized version of the same physics.

💻 Build It Implement Volume Rendering from Scratch ✓ ATTEMPTED
You've seen the discrete volume rendering equation: C = ∑ Ti·αi·ci. Now implement it. Given arrays of densities, step sizes, and colors along a ray, compute the final pixel color.
signature def volume_render(sigmas, deltas, colors): """ Args: sigmas: np.array of shape [N] - density at each sample deltas: np.array of shape [N] - step size between samples colors: np.array of shape [N, 3] - RGB color at each sample Returns: pixel_color: np.array of shape [3] - final composited RGB weights: np.array of shape [N] - per-sample weight (T_i * alpha_i) """
Test case
sigmas = [0.1, 5.0, 0.2], deltas = [0.1, 0.1, 0.1]
colors = [[1,0,0], [0,1,0], [0,0,1]]
Expected weights ≈ [0.010, 0.390, 0.012]
Expected pixel ≈ [0.010, 0.390, 0.012] (mostly green)
T0 = 1 (no light absorbed before the first sample). T1 = (1−α0). T2 = (1−α0)(1−α1). Use np.cumprod on (1−α) and prepend a 1 at the front, dropping the last element.
python
import numpy as np

def volume_render(sigmas, deltas, colors):
    sigmas = np.array(sigmas)
    deltas = np.array(deltas)
    colors = np.array(colors)

    # Step 1: alpha_i = 1 - exp(-sigma_i * delta_i)
    alphas = 1.0 - np.exp(-sigmas * deltas)

    # Step 2: T_i = product of (1-alpha_j) for j < i
    # T[0] = 1, T[1] = (1-a[0]), T[2] = (1-a[0])(1-a[1]), ...
    transmittance = np.cumprod(1.0 - alphas)
    transmittance = np.concatenate([[1.0], transmittance[:-1]])

    # Step 3: weights and final color
    weights = transmittance * alphas  # [N]
    pixel_color = (weights[:, None] * colors).sum(axis=0)  # [3]

    return pixel_color, weights
Bonus: What happens to the weights if you have a single extremely dense sample (σ=100)? What if ALL samples have density 0? Trace through the math to verify your implementation handles these edge cases.
Checkpoint — Before you move on
Explain in your own words: why does the transmittance Ti DECREASE along the ray, and what happens to later samples' contributions when an early sample has very high density? Why does this make physical sense?
✓ Gate cleared
Model Answer

Ti decreases because each sample absorbs some light. If sample 2 has σ=100 with δ=0.1, then α2 = 1−exp(−10) ≈ 1.0 (almost fully opaque). T3 = T2·(1−α2) ≈ T2·0 ≈ 0. So sample 3 contributes NOTHING to the pixel — all light was absorbed before reaching it. This is exactly how a solid wall works: you can't see what's behind it because the wall absorbs all incoming light. Volume rendering naturally handles occlusion through this exponential decay of transmittance.

Chapter 2: Neural Radiance Fields (NeRF)

NeRF represents a 3D scene as a continuous function: given a 3D point (x, y, z) and viewing direction (theta, phi), it outputs the color (r, g, b) and density (sigma) at that point. This function is parameterized by a simple MLP (multilayer perceptron).

Fθ(x, y, z, θ, φ) → (r, g, b, σ)
Input
(x, y, z) position + (θ, φ) view direction
MLP Network
8 layers, 256 channels, skip connections
Output
(r, g, b) color + σ density
NeRF: Query the Scene Function

Click anywhere in the scene to query the NeRF network at that point. It returns color and density.

Why viewing direction? The same surface looks different from different angles (specular highlights, reflections). By conditioning on view direction, NeRF can represent view-dependent effects like the shine on a glossy object.
Realization — the computational cost: To render a single 800×800 image, you shoot one ray per pixel: 640,000 rays. Each ray samples 64 points (coarse) + 128 points (fine) = 192 MLP evaluations per ray. Total: 640,000 × 192 = ~123 million MLP evaluations per image. Each evaluation passes through 8 layers of 256-wide MLPs. That's why original NeRF takes ~30 seconds per frame. The MLP itself is tiny (~1.2M parameters), but you evaluate it 100+ million times.

The training loop is straightforward: randomly pick a training image, randomly sample rays from it, render those rays using volume rendering, compare the rendered pixel colors to the ground-truth pixels using MSE loss, and backpropagate. Training from ~50–200 posed images takes about 1–2 days on a single GPU.

Check: What does a NeRF MLP take as input?

Chapter 3: Positional Encoding & the MLP

MLPs are biased toward learning smooth, low-frequency functions. But scenes have sharp edges, fine textures, and high-frequency detail. Positional encoding solves this by mapping the input coordinates to a higher-dimensional space using sinusoidal functions.

γ(p) = [sin(20πp), cos(20πp), ..., sin(2L-1πp), cos(2L-1πp)]
Positional Encoding: Low vs High Frequency

Without positional encoding, the MLP can only learn smooth blobs. With it, sharp edges and fine detail emerge. Adjust L (number of frequency bands).

Frequency bands L0

NeRF uses L=10 frequency bands for position (x,y,z) and L=4 for viewing direction (θ,φ). For each coordinate, the encoding produces 2L values (sin and cos at each frequency). So a 3D position [3] becomes [3 + 3×2×10] = [63] values, and a 2D direction [2] becomes [2 + 2×2×4] = [18] values. The MLP input is 63+18 = 81 dimensions. Why fewer bands for direction? View-dependent effects (like specular highlights) are lower-frequency than geometry — you don't need sharp angular precision.

Same idea as transformers: This is the same positional encoding concept used in transformers, but applied to 3D spatial coordinates. Higher frequencies let the network distinguish between nearby points — essential for sharp geometry. Without PE, two points 1mm apart produce nearly identical MLP inputs, so the network can't tell them apart. With PE at L=10, the highest frequency (sin(29πx)) oscillates 512 times across the scene — plenty of resolution for fine detail.
Check: Why does NeRF need positional encoding?
🔨 Derivation Why Sinusoidal Encoding Enables High-Frequency Learning ✓ ATTEMPTED

An MLP with ReLU activations is biased toward low-frequency functions (the "spectral bias" of neural networks). Positional encoding fixes this by mapping inputs to sinusoids at increasing frequencies.

Your task: Show that for two nearby points x and x+ε, their raw inputs are nearly identical (|x − (x+ε)| = ε), but after positional encoding with frequency band 2k, the encoded inputs can differ by up to 2kπε. Explain why this means the MLP can now distinguish points that are 1/2k apart.

d/dx sin(2kπx) = 2kπ cos(2kπx). The maximum rate of change of the k-th frequency band is 2kπ. So even for tiny ε, the encoded value changes by up to 2kπε.
sin(2kπx) changes fastest when cos(2kπx) = ±1 (crossing zero) and is stationary when cos(2kπx) = 0 (at peaks/troughs). That's why we use BOTH sin and cos — when one is flat, the other is steep. Together they always provide discrimination.
The Nyquist theorem says we can distinguish features at half the period of our highest frequency. Period of sin(2kπx) is 1/2k. So with L=10, the highest frequency (k=9) has period 1/512 — we can resolve features that are 1/1024 of the scene width apart. For a 1-meter scene, that's ~1mm detail.

Step 1: Without PE, two points at x=0.500 and x=0.501 give the MLP inputs that differ by 0.001. The MLP must squeeze all scene detail into this tiny input range — it's like painting the Mona Lisa with a 3-pixel brush.

Step 2: With PE at frequency band k: γk(x) = sin(2kπx). The difference between our two points: sin(2kπ·0.501) − sin(2kπ·0.500) ≈ 2kπ·0.001·cos(2kπ·0.500).

Step 3: At k=9 (the highest band in NeRF): this difference can be up to 512π·0.001 ≈ 1.6. A change of 1.6 in the MLP's input space is HUGE — comparable to moving across the entire scene in the raw coordinate. The MLP now "sees" these nearby points as far apart in its feature space.

Step 4: The multi-scale structure matters. Low frequencies (k=0,1) capture coarse structure. High frequencies (k=8,9) capture fine edges. The MLP learns to combine these scales, much like Fourier analysis decomposes signals into frequency components.

The key insight: Positional encoding is not "adding information" — the input already contains the position. It's AMPLIFYING the difference between nearby positions so the MLP's smooth learned function can capture sharp transitions. It converts a spatial resolution problem into a function approximation problem the MLP can actually solve.

💥 Break-It Lab What Dies When You Remove NeRF Components? ✓ ATTEMPTED
A working NeRF produces sharp, view-dependent renderings. Each component serves a specific purpose. Toggle them off and observe what dies.
Remove Positional Encoding ACTIVE
Failure mode: Without PE, the MLP can only learn smooth, low-frequency functions. Sharp edges disappear. Fine textures blur into uniform color blobs. The rendered image looks like it was shot through frosted glass — the overall structure is visible but all detail is gone. PSNR drops 3-5 dB.
Remove View-Dependent Color ACTIVE
Failure mode: Without view direction input, color cannot change with viewing angle. Specular highlights, reflections, and glossy surfaces become flat and matte. A shiny car looks like matte plastic. The model can still reconstruct diffuse (Lambertian) surfaces perfectly, but anything glossy or reflective is wrong.
Reduce to 8 Samples/Ray ACTIVE
Failure mode: With too few samples, the discrete volume rendering approximation becomes coarse. Thin structures (fences, hair, wires) disappear because no sample lands on them. Surfaces get "banded" artifacts where the step size is visible. The noise is worst in regions with rapid density changes (object edges).

Chapter 4: Sampling Strategies

Evaluating the MLP at every point along a ray is expensive. NeRF uses a hierarchical sampling strategy: first, sample uniformly (coarse pass), then concentrate more samples where density is high (fine pass). This focuses compute where it matters.

Coarse Pass
64 uniform samples along the ray
Fine Pass
128 additional samples concentrated near surfaces
Hierarchical Sampling

Blue dots = coarse uniform samples. Green dots = fine importance-weighted samples near the surface.

Surface position60%
Importance sampling intuition: Why waste evaluations in empty air? After the coarse pass reveals where the surface is, the fine pass puts most of its budget right at the surface boundary — exactly where the details matter.
Check: Why does NeRF use hierarchical sampling?

Chapter 5: Speed — Instant-NGP

Original NeRF takes hours to train and seconds to render a single frame. Instant-NGP (NVIDIA, 2022) slashed training to seconds and rendering to real-time by replacing the MLP with a multi-resolution hash table.

Instead of a deep MLP that must process every point through 8 layers, Instant-NGP looks up precomputed features in a hash table indexed by spatial position. This is massively parallel and cache-friendly.

MethodTraining TimeRender SpeedKey Technique
Original NeRF~1 day~30s/frameDeep MLP
Instant-NGP~5 secondsReal-timeHash grid encoding
TensoRF~30 min~1s/frameTensor factorization
Plenoxels~11 min~15fpsSparse voxel grid
MLP vs Hash Grid Lookup

Compare the two approaches: deep MLP requires sequential layers, hash grid is a fast parallel lookup. Toggle to compare.

The hash trick: At each resolution level, spatial positions are hashed to a fixed-size table. Collisions are resolved by the neural network learning to work around them. This trades a tiny bit of quality for massive speed gains.
Check: What makes Instant-NGP so much faster than original NeRF?

Chapter 6: 3D Gaussian Splatting

3DGS takes a completely different approach from NeRF. Instead of an implicit function queried along rays, it represents the scene as millions of 3D Gaussians — each one a colored, oriented ellipsoid with position, covariance, color, and opacity.

To render: project each Gaussian onto the screen (splatting), sort by depth, and alpha-composite them front to back. No ray marching, no MLP evaluation — just rasterization. This is extremely fast.

G(x) = exp(−½ (x − μ)T Σ−1 (x − μ))
Gaussian Splatting Visualization

Each ellipse is a 3D Gaussian projected to 2D. More Gaussians = more detail. Adjust the count and see them splat!

Gaussians100
Per-Gaussian ParameterMeaningCount
Position μ3D center of the Gaussian3
Covariance ΣShape, size, orientation (via quaternion + scale)7
ColorSpherical harmonics coefficients48
Opacity αTransparency1

Initialization comes from Structure from Motion (SfM) — a classical algorithm that estimates a sparse point cloud (~100K–1M points) from the input images. Each SfM point becomes one Gaussian. During training, 3DGS applies adaptive densification: Gaussians with large gradients in under-reconstructed regions are split (large ones become two smaller ones) or cloned (small ones are duplicated nearby). Gaussians with opacity below 0.005 are pruned. A typical trained scene has 1–5 million Gaussians.

Realization — parameter budget: Each Gaussian stores: mean [3], scale [3], rotation quaternion [4], opacity [1], and spherical harmonics color coefficients [48] (degree 3, which gives 16 coefficients × 3 RGB channels). That's ~59 parameters per Gaussian. For 1M Gaussians: ~59M parameters, or about 230 MB at float32. Rendering is pure rasterization: project all Gaussians to 2D, sort by depth per tile, alpha-composite front-to-back. No neural network at inference — just GPU rasterization. This is why it hits 100+ FPS.
Why Gaussians? They're differentiable (can be optimized with gradient descent), fast to project (closed-form 2D projection), and naturally handle smooth surfaces. Plus, they tile-sort beautifully on GPUs.

Training takes ~20–40 minutes on a single GPU. The loss combines pixel-level MSE with SSIM (structural similarity): L = (1−λ)·L1 + λ·LSSIM with λ=0.2. Both NeRF and 3DGS optimize per-scene — you train a separate model for each scene from its specific images. There is no generalization across scenes (that's what later methods like feed-forward 3DGS tackle).

Check: How does 3DGS render an image?
🔨 Derivation 3DGS Alpha Compositing — From Gaussians to Pixels ✓ ATTEMPTED

In 3DGS, each Gaussian has an opacity αi that depends on its intrinsic opacity AND how much the pixel overlaps with the Gaussian's 2D projection. The final pixel color uses front-to-back alpha compositing, which is structurally identical to NeRF's volume rendering.

Your task: Show that 3DGS's per-pixel compositing C = ∑i ci · αi · ∏j<i(1−αj) is mathematically equivalent to the discrete volume rendering formula from Ch1. Then explain why the effective αi for a Gaussian depends on the Gaussian's 2D distance from the pixel center.

NeRF: C = ∑ Tiαici where Ti=∏j<i(1−αj). 3DGS: C = ∑ ciαij<i(1−αj). They're identical! The ∏ term IS the transmittance. The only difference is what determines αi.
A 2D Gaussian centered at μ' with covariance Σ' evaluated at pixel position p gives: G(p) = exp(−½(p−μ')TΣ'−1(p−μ')). The effective alpha at that pixel is: αeff = oi · G(p), where oi is the Gaussian's learned opacity. Far from the center, G(p) → 0, so the Gaussian doesn't affect that pixel.
The ∏j<i(1−αj) term means ORDER matters. Gaussian j must be in front of i (smaller depth) to block it. If you sort wrong, a back-facing Gaussian could "occlude" one in front. 3DGS sorts per 16×16 tile, not globally — O(N log N) per tile, not per pixel.

The equivalence: Both NeRF and 3DGS compute pixels the same way: C = ∑i (front-to-back) colori × opacityi × accumulated_transparencyi. The math is identical. Only the source of (color, opacity) differs.

In NeRF: αi = 1−exp(−σiδi), coming from the MLP's density output integrated over a ray step.

In 3DGS: αi = oi · exp(−½(p−μ'i)TΣ'i−1(p−μ'i)). This is the Gaussian's intrinsic opacity TIMES its spatial falloff at this pixel. The 2D covariance Σ' comes from projecting the 3D covariance using the Jacobian of the camera projection: Σ' = J W Σ WT JT.

The deep connection: Front-to-back alpha compositing IS discrete volume rendering. Whether you discretize by sampling points along a ray (NeRF) or by sorting primitives by depth (3DGS), you're computing the same integral. 3DGS just uses a different basis — Gaussians instead of point samples.

The key insight: This shared mathematical structure is why both methods are differentiable and can be optimized with the same MSE loss. They're two parameterizations of the same rendering equation. The speed difference comes entirely from the representation (MLP queries vs rasterization), not the compositing math.

Chapter 7: Rasterization vs Ray Marching

The fundamental difference between NeRF and 3DGS is how they render:

NeRF (Ray Marching)3DGS (Rasterization)
ApproachShoot rays, sample points, query MLPProject primitives to screen, sort, blend
RepresentationImplicit (continuous function)Explicit (millions of Gaussians)
Training speedHours to daysMinutes
Render speedSeconds per frame100+ FPS
MemoryLow (small MLP)High (millions of Gaussians)
QualityExcellentExcellent (often better)
Ray Marching vs Rasterization

Toggle between the two rendering paradigms to see the conceptual difference.

Realization — why 1000× faster: The speed gap isn't about better algorithms — it's about what hardware does well. NeRF renders per-pixel: for each of 640K pixels, march a ray, evaluate an MLP ~192 times. That's 123M sequential-ish neural network forward passes — compute-bound on matrix multiplications. 3DGS renders per-primitive: project 1M Gaussians to 2D (embarrassingly parallel), sort them per 16×16 tile, alpha-composite. This is pure rasterization — the exact workload GPUs were designed for over 30 years. No neural network runs at inference time. The bottleneck shifts from compute to memory bandwidth, and modern GPUs have enormous bandwidth.
The verdict (2024+): 3DGS has largely overtaken NeRF for most practical applications due to its speed advantage. But NeRF-style implicit representations still win for memory-constrained scenarios and certain generative tasks.
Check: What is the main practical advantage of 3DGS over NeRF?
⚔ Adversarial: Your NeRF produces sharp images from training views but completely fails on novel views 30° away. Training loss is excellent. What's wrong?
You've trained a NeRF on 50 images of a shiny car. From the training camera positions, renders look photorealistic (PSNR > 32dB). But when you move the virtual camera 30° from any training view, the rendering shows severe artifacts: floating blobs, incorrect specular highlights, and geometry that seems to "follow" the camera.
🏗 Design Challenge You're the Architect: Real-Time VR 3D Reconstruction ✓ ATTEMPTED
You're building a 3D reconstruction system for a VR headset. Users walk around a room, and the headset must render photorealistic views of previously captured objects from any angle, in real time. The scene includes both static objects (furniture) and dynamic objects (a pet cat moving around).
Frame rate
90 FPS stereo (two eyes = 180 renders/sec)
Latency
<11ms motion-to-photon
Resolution
2160×2160 per eye
Scene
360° room, ~20m², static + dynamic objects
Hardware
Mobile GPU (Quest 3 class: ~2 TFLOPS)
Memory
6 GB VRAM shared with OS
1. NeRF vs 3DGS vs hybrid for the static environment? Consider render budget: 11ms for 4.6M pixels per eye at 2 TFLOPS.
2. How do you handle the dynamic cat? You can't retrain per-frame. What representation allows real-time updates?
3. 1M Gaussians at 230MB won't fit alongside the OS. How do you compress? What quality do you sacrifice?
4. The user can look anywhere in 360°. Do you render the full scene or use foveated/culled rendering? How?

The industry approach (2024-2025):

Static scene: Compressed 3DGS. 3DGS is the only option that can hit 90fps. But raw 3DGS is too large. Solutions: (1) Quantize SH coefficients to int8 (48 floats → 48 bytes, 4× compression). (2) Use fewer SH bands (degree 1 instead of 3, 12 coefficients instead of 48). (3) Prune low-opacity Gaussians aggressively. Result: ~50-80MB for a room, 100+ FPS on mobile GPUs.

Dynamic objects: Deformable 3DGS or tracked mesh + texture. For the cat: a per-frame deformation field that moves Gaussian centers (D-3DGS), or fall back to classical mesh tracking with neural texture. Some systems use a small MLP that predicts Gaussian offsets from a pose code, trained on a short capture sequence.

Memory: Level-of-detail (LOD). Only load full-detail Gaussians for nearby surfaces. Distant geometry uses fewer, larger Gaussians. Tile-based culling skips Gaussians outside the current view frustum — crucial for 360° scenes where only ~90° is visible at once.

Foveated rendering: The eye's periphery has much lower acuity. Render the foveal region (central 10°) at full resolution with all Gaussians. Periphery: 4× fewer pixels, skip small Gaussians. This cuts total rasterization work by ~3-4×.

Chapter 8: Generative 3D

What if you could generate a 3D object from a text prompt or a single image? Generative 3D combines NeRF/3DGS with diffusion models to create 3D content without any multi-view input.

MethodInputKey Idea
DreamFusionText promptScore Distillation Sampling (SDS) from 2D diffusion
Zero-1-to-3Single imageViewpoint-conditioned diffusion
Magic3DText promptCoarse-to-fine with mesh extraction
LGM4 imagesFeed-forward 3DGS generation
GaussianDreamerText promptSDS applied to Gaussian splatting
Score Distillation: Text to 3D

A 2D diffusion model guides the optimization of a 3D NeRF representation. The NeRF renders views, the diffusion model critiques them.

SDS intuition: Render the NeRF from a random viewpoint. Add noise. Ask the diffusion model "does this look like [the text prompt]?" Use its gradient to update the NeRF. Repeat from many angles. The NeRF converges to a 3D object that looks right from every direction.
Check: How does DreamFusion create 3D from text?
🔗 Pattern Recognition
Score Distillation = Iterative Refinement via Learned Prior
This Lesson (SDS)
Render NeRF → add noise → diffusion model predicts clean version → gradient back to NeRF. Repeat from many angles until the 3D representation matches the diffusion model's learned distribution.
Diffusion Lesson
Start from noise → score function predicts denoising direction → take step → repeat. The score function encodes a learned prior of "what images look like." → Diffusion

The deep connection: SDS uses a 2D diffusion model as a 3D-consistent critic. The score function ∇xlog p(x) from diffusion becomes a SUPERVISION signal for the 3D representation. You're distilling the 2D prior into 3D geometry. This is the same "iterative refinement guided by a learned model" pattern seen in diffusion denoising, RLHF reward optimization, and even Kalman filtering (using a model to refine estimates).

Where else in ML do you see "use a pretrained model's gradients to optimize something else entirely"? (Hint: think about CLIP-guided generation, neural style transfer, adversarial training...)

Chapter 9: 3D in Robotics

Neural 3D representations are transforming robotics. Robots need to understand 3D geometry to grasp objects, navigate spaces, and plan motions. NeRF and 3DGS provide rich scene representations that go beyond flat depth maps.

ApplicationHow NeRF/3DGS Helps
Grasp planningDense 3D geometry for contact-rich manipulation
NavigationPhotorealistic simulation for training navigation policies
Sim-to-realReconstruct real environments as training scenes
Language-guidedPair 3D features with CLIP for "find the mug" queries
DeformablesModel soft objects and cloth for manipulation
3D Scene Understanding for Robots

A robot uses a 3DGS representation to plan grasps. Colored regions show semantic understanding overlaid on 3D structure.

Feature Fields: LERF and similar methods distill CLIP features into NeRF/3DGS representations. This creates a 3D scene where you can point at any location and get a language-aligned feature vector — enabling queries like "where is the coffee mug?" directly in 3D space.
🔗 Pattern Recognition
Multi-View Geometry + 2D Features = 3D Understanding
This Lesson (Feature Fields)
Distill CLIP/DINO features from 2D images into a 3D representation. Each point in 3D space gets a language-aligned feature vector derived from multiple 2D views.
VLM Lesson
Vision-Language Models encode 2D images into feature vectors aligned with text. The same CLIP encoder that powers feature fields also powers VLMs. → VLM

The deep connection: 2D foundation models (CLIP, DINO, SAM) learn powerful per-pixel features from billions of images. NeRF/3DGS provides the 3D scaffolding to "lift" these 2D features into consistent 3D representations. This is the bridge between 2D perception (what VLMs do) and 3D understanding (what robots need). The pattern: use geometry to aggregate 2D information into 3D — the same principle behind multi-view stereo, Structure from Motion, and visual SLAM.

VLAs (Vision-Language-Action models) need 3D spatial reasoning but are trained on 2D images. How might 3DGS + feature fields give VLAs implicit 3D understanding without explicit 3D supervision?

⚔ Adversarial: Your 3DGS model renders beautifully on your RTX 4090 (24GB VRAM), but when you deploy to a customer's RTX 3060 (12GB), the scene loads but rendering produces black frames at 2 FPS. What went wrong?
The trained model has 3.2 million Gaussians. Each stores 59 parameters at float32. Loading seems to work (no OOM error), but the tile-sorting step takes 400ms per frame and the alpha compositing produces mostly black pixels.
"We don't see the world as it is, we see it as we render it."
— Adapted from Anaïs Nin

You now understand how neural networks reconstruct 3D worlds from photographs. From NeRF's elegant ray marching to 3DGS's blazing-fast splatting, these techniques are redefining what's possible in graphics, VR, and robotics.

Check: How do Feature Fields (like LERF) help robots?