microNeRF/3DGS — Neural 3D from 2D Images

Chapter 1: Volume Rendering

To render a pixel, you shoot a ray from the camera through the scene. Along this ray, you sample points and ask: "What color and density exists here?" Then you composite these samples from front to back using the volume rendering equation.

C(r) = ∫ T(t) · σ(t) · c(t) dt where T(t) = exp(−∫σ(s)ds)

σ is density (how opaque the material is), c is color, and T is transmittance (how much light makes it through to this point). Dense regions block light; empty space passes it through.

In practice, we discretize this integral. For N sample points along a ray, the discrete volume rendering equation is:

C = ∑_i T_i · α_i · c_i where α_i = 1 − exp(−σ_i · δ_i) and T_i = ∏_j<i (1 − α_j)

Let's work a concrete example with 3 samples. Suppose σ = [0.1, 5.0, 0.2] and the step size δ = 0.1 for all samples. Colors are c₁ = red, c₂ = green, c₃ = blue.

Sample	σ	α = 1−exp(−σδ)	T (accumulated)	Weight = T·α
1 (red)	0.1	0.010	1.000	0.010
2 (green)	5.0	0.394	0.990	0.390
3 (blue)	0.2	0.020	0.601	0.012

Sample 2 dominates — it has high density (σ=5.0) and most of the light hasn't been absorbed yet (T=0.99). So the final pixel is mostly green. This is how NeRF renders: evaluate every sample, weight by density and transmittance, sum up.

Ray Marching Through a Volume

A ray travels through a scene, sampling density and color at each point. Denser regions contribute more to the final pixel color.

Density scale8

The key insight: Volume rendering is differentiable. This means we can use gradient descent to optimize whatever produces the density and color values — and that "whatever" is the neural network in NeRF.

Check: In volume rendering, what does transmittance T(t) represent?

The color at point t The density at point t How much light has passed through without being absorbed up to point t

🔨 Derivation Derive the Transmittance Formula from Beer-Lambert Law ▶ ✓ ATTEMPTED

The volume rendering integral uses transmittance T(t) = exp(−∫₀^t σ(s)ds). This isn't arbitrary — it comes from the Beer-Lambert law of light absorption through a medium.

Your task: Starting from "a thin slab of thickness dt with density σ absorbs a fraction σ·dt of the remaining light," derive the continuous transmittance formula T(t) = exp(−∫σ(s)ds) and then discretize it into T_i = ∏_j<i(1 − α_j).

If I(t) is the light intensity remaining after traveling distance t through a medium with density σ(t), and a thin slab absorbs a fraction σ(t)·dt of the light: write dI/dt = ?. What's the relationship between I lost and the density?

You should have dI/I = −σ(t)dt. Integrate both sides. The left side gives ln(I(t)/I(0)). What does T(t) = I(t)/I(0) equal?

For a discrete step from t_i to t_i+δ_i, the fraction of light absorbed in that step is 1 − exp(−σ_i·δ_i). Call this α_i. What fraction passes through? How does the product of all pass-through fractions give you T_i?

Step 1: Beer-Lambert law says a thin slab absorbs proportionally to its density: dI = −σ(t) · I(t) · dt

Step 2: Rearrange: dI/I = −σ(t)dt. Integrate from 0 to t: ln(I(t)) − ln(I(0)) = −∫₀^tσ(s)ds

Step 3: Exponentiate: I(t)/I(0) = exp(−∫₀^tσ(s)ds). This ratio IS the transmittance: T(t) = exp(−∫₀^tσ(s)ds)

Step 4 (discretize): Break the ray into steps of size δ_i. Each step transmits a fraction exp(−σ_i·δ_i) = (1 − α_i) of the light. The total transmittance to sample i is the product of all preceding pass-through fractions: T_i = ∏_j<i(1 − α_j)

The key insight: The exponential decay form is not a choice — it's the unique solution to "each slab absorbs proportionally to its density." The product form in discrete rendering is just the discretized version of the same physics.

💻 Build It Implement Volume Rendering from Scratch ▶ ✓ ATTEMPTED

You've seen the discrete volume rendering equation: C = ∑ T_i·α_i·c_i. Now implement it. Given arrays of densities, step sizes, and colors along a ray, compute the final pixel color.

signature def volume_render(sigmas, deltas, colors): """ Args: sigmas: np.array of shape [N] - density at each sample deltas: np.array of shape [N] - step size between samples colors: np.array of shape [N, 3] - RGB color at each sample Returns: pixel_color: np.array of shape [3] - final composited RGB weights: np.array of shape [N] - per-sample weight (T_i * alpha_i) """

Test case

sigmas = [0.1, 5.0, 0.2], deltas = [0.1, 0.1, 0.1]
colors = [[1,0,0], [0,1,0], [0,0,1]]
Expected weights ≈ [0.010, 0.390, 0.012]
Expected pixel ≈ [0.010, 0.390, 0.012] (mostly green)

T₀ = 1 (no light absorbed before the first sample). T₁ = (1−α₀). T₂ = (1−α₀)(1−α₁). Use np.cumprod on (1−α) and prepend a 1 at the front, dropping the last element.

python
import numpy as np

def volume_render(sigmas, deltas, colors):
    sigmas = np.array(sigmas)
    deltas = np.array(deltas)
    colors = np.array(colors)

    # Step 1: alpha_i = 1 - exp(-sigma_i * delta_i)
    alphas = 1.0 - np.exp(-sigmas * deltas)

    # Step 2: T_i = product of (1-alpha_j) for j < i
    # T[0] = 1, T[1] = (1-a[0]), T[2] = (1-a[0])(1-a[1]), ...
    transmittance = np.cumprod(1.0 - alphas)
    transmittance = np.concatenate([[1.0], transmittance[:-1]])

    # Step 3: weights and final color
    weights = transmittance * alphas  # [N]
    pixel_color = (weights[:, None] * colors).sum(axis=0)  # [3]

    return pixel_color, weights

Bonus: What happens to the weights if you have a single extremely dense sample (σ=100)? What if ALL samples have density 0? Trace through the math to verify your implementation handles these edge cases.

Checkpoint — Before you move on

Explain in your own words: why does the transmittance T_i DECREASE along the ray, and what happens to later samples' contributions when an early sample has very high density? Why does this make physical sense?

✓ Gate cleared

Model Answer

T_i decreases because each sample absorbs some light. If sample 2 has σ=100 with δ=0.1, then α₂ = 1−exp(−10) ≈ 1.0 (almost fully opaque). T₃ = T₂·(1−α₂) ≈ T₂·0 ≈ 0. So sample 3 contributes NOTHING to the pixel — all light was absorbed before reaching it. This is exactly how a solid wall works: you can't see what's behind it because the wall absorbs all incoming light. Volume rendering naturally handles occlusion through this exponential decay of transmittance.

Chapter 2: Neural Radiance Fields (NeRF)

NeRF represents a 3D scene as a continuous function: given a 3D point (x, y, z) and viewing direction (theta, phi), it outputs the color (r, g, b) and density (sigma) at that point. This function is parameterized by a simple MLP (multilayer perceptron).

F_θ(x, y, z, θ, φ) → (r, g, b, σ)

Input

(x, y, z) position + (θ, φ) view direction

↓

MLP Network

8 layers, 256 channels, skip connections

↓

Output

(r, g, b) color + σ density

NeRF: Query the Scene Function

Click anywhere in the scene to query the NeRF network at that point. It returns color and density.

Why viewing direction? The same surface looks different from different angles (specular highlights, reflections). By conditioning on view direction, NeRF can represent view-dependent effects like the shine on a glossy object.

Realization — the computational cost: To render a single 800×800 image, you shoot one ray per pixel: 640,000 rays. Each ray samples 64 points (coarse) + 128 points (fine) = 192 MLP evaluations per ray. Total: 640,000 × 192 = ~123 million MLP evaluations per image. Each evaluation passes through 8 layers of 256-wide MLPs. That's why original NeRF takes ~30 seconds per frame. The MLP itself is tiny (~1.2M parameters), but you evaluate it 100+ million times.

The training loop is straightforward: randomly pick a training image, randomly sample rays from it, render those rays using volume rendering, compare the rendered pixel colors to the ground-truth pixels using MSE loss, and backpropagate. Training from ~50–200 posed images takes about 1–2 days on a single GPU.

Check: What does a NeRF MLP take as input?

A 3D position and viewing direction An image patch A triangle mesh

Chapter 3: Positional Encoding & the MLP

MLPs are biased toward learning smooth, low-frequency functions. But scenes have sharp edges, fine textures, and high-frequency detail. Positional encoding solves this by mapping the input coordinates to a higher-dimensional space using sinusoidal functions.

γ(p) = [sin(2⁰πp), cos(2⁰πp), ..., sin(2^L-1πp), cos(2^L-1πp)]

Positional Encoding: Low vs High Frequency

Without positional encoding, the MLP can only learn smooth blobs. With it, sharp edges and fine detail emerge. Adjust L (number of frequency bands).

Frequency bands L0

NeRF uses L=10 frequency bands for position (x,y,z) and L=4 for viewing direction (θ,φ). For each coordinate, the encoding produces 2L values (sin and cos at each frequency). So a 3D position [3] becomes [3 + 3×2×10] = [63] values, and a 2D direction [2] becomes [2 + 2×2×4] = [18] values. The MLP input is 63+18 = 81 dimensions. Why fewer bands for direction? View-dependent effects (like specular highlights) are lower-frequency than geometry — you don't need sharp angular precision.

Same idea as transformers: This is the same positional encoding concept used in transformers, but applied to 3D spatial coordinates. Higher frequencies let the network distinguish between nearby points — essential for sharp geometry. Without PE, two points 1mm apart produce nearly identical MLP inputs, so the network can't tell them apart. With PE at L=10, the highest frequency (sin(2⁹πx)) oscillates 512 times across the scene — plenty of resolution for fine detail.

Check: Why does NeRF need positional encoding?

To reduce the number of parameters MLPs are biased toward smooth functions; PE enables high-frequency detail To compress the 3D coordinates

🔨 Derivation Why Sinusoidal Encoding Enables High-Frequency Learning ▶ ✓ ATTEMPTED

An MLP with ReLU activations is biased toward low-frequency functions (the "spectral bias" of neural networks). Positional encoding fixes this by mapping inputs to sinusoids at increasing frequencies.

Your task: Show that for two nearby points x and x+ε, their raw inputs are nearly identical (|x − (x+ε)| = ε), but after positional encoding with frequency band 2^k, the encoded inputs can differ by up to 2^kπε. Explain why this means the MLP can now distinguish points that are 1/2^k apart.

d/dx sin(2^kπx) = 2^kπ cos(2^kπx). The maximum rate of change of the k-th frequency band is 2^kπ. So even for tiny ε, the encoded value changes by up to 2^kπε.

sin(2^kπx) changes fastest when cos(2^kπx) = ±1 (crossing zero) and is stationary when cos(2^kπx) = 0 (at peaks/troughs). That's why we use BOTH sin and cos — when one is flat, the other is steep. Together they always provide discrimination.

The Nyquist theorem says we can distinguish features at half the period of our highest frequency. Period of sin(2^kπx) is 1/2^k. So with L=10, the highest frequency (k=9) has period 1/512 — we can resolve features that are 1/1024 of the scene width apart. For a 1-meter scene, that's ~1mm detail.

Step 1: Without PE, two points at x=0.500 and x=0.501 give the MLP inputs that differ by 0.001. The MLP must squeeze all scene detail into this tiny input range — it's like painting the Mona Lisa with a 3-pixel brush.

Step 2: With PE at frequency band k: γ_k(x) = sin(2^kπx). The difference between our two points: sin(2^kπ·0.501) − sin(2^kπ·0.500) ≈ 2^kπ·0.001·cos(2^kπ·0.500).

Step 3: At k=9 (the highest band in NeRF): this difference can be up to 512π·0.001 ≈ 1.6. A change of 1.6 in the MLP's input space is HUGE — comparable to moving across the entire scene in the raw coordinate. The MLP now "sees" these nearby points as far apart in its feature space.

Step 4: The multi-scale structure matters. Low frequencies (k=0,1) capture coarse structure. High frequencies (k=8,9) capture fine edges. The MLP learns to combine these scales, much like Fourier analysis decomposes signals into frequency components.

The key insight: Positional encoding is not "adding information" — the input already contains the position. It's AMPLIFYING the difference between nearby positions so the MLP's smooth learned function can capture sharp transitions. It converts a spatial resolution problem into a function approximation problem the MLP can actually solve.

💥 Break-It Lab What Dies When You Remove NeRF Components? ▶ ✓ ATTEMPTED

A working NeRF produces sharp, view-dependent renderings. Each component serves a specific purpose. Toggle them off and observe what dies.

Remove Positional Encoding ACTIVE

Failure mode: Without PE, the MLP can only learn smooth, low-frequency functions. Sharp edges disappear. Fine textures blur into uniform color blobs. The rendered image looks like it was shot through frosted glass — the overall structure is visible but all detail is gone. PSNR drops 3-5 dB.

Remove View-Dependent Color ACTIVE

Failure mode: Without view direction input, color cannot change with viewing angle. Specular highlights, reflections, and glossy surfaces become flat and matte. A shiny car looks like matte plastic. The model can still reconstruct diffuse (Lambertian) surfaces perfectly, but anything glossy or reflective is wrong.

Reduce to 8 Samples/Ray ACTIVE

Failure mode: With too few samples, the discrete volume rendering approximation becomes coarse. Thin structures (fences, hair, wires) disappear because no sample lands on them. Surfaces get "banded" artifacts where the step size is visible. The noise is worst in regions with rapid density changes (object edges).

Chapter 4: Sampling Strategies

Evaluating the MLP at every point along a ray is expensive. NeRF uses a hierarchical sampling strategy: first, sample uniformly (coarse pass), then concentrate more samples where density is high (fine pass). This focuses compute where it matters.

Coarse Pass

64 uniform samples along the ray

↓

Fine Pass

128 additional samples concentrated near surfaces

Hierarchical Sampling

Blue dots = coarse uniform samples. Green dots = fine importance-weighted samples near the surface.

Surface position60%

Importance sampling intuition: Why waste evaluations in empty air? After the coarse pass reveals where the surface is, the fine pass puts most of its budget right at the surface boundary — exactly where the details matter.

Check: Why does NeRF use hierarchical sampling?

To focus samples near surfaces where detail matters, saving compute To make the network smaller To handle moving objects

Chapter 5: Speed — Instant-NGP

Original NeRF takes hours to train and seconds to render a single frame. Instant-NGP (NVIDIA, 2022) slashed training to seconds and rendering to real-time by replacing the MLP with a multi-resolution hash table.

Instead of a deep MLP that must process every point through 8 layers, Instant-NGP looks up precomputed features in a hash table indexed by spatial position. This is massively parallel and cache-friendly.

Method	Training Time	Render Speed	Key Technique
Original NeRF	~1 day	~30s/frame	Deep MLP
Instant-NGP	~5 seconds	Real-time	Hash grid encoding
TensoRF	~30 min	~1s/frame	Tensor factorization
Plenoxels	~11 min	~15fps	Sparse voxel grid

MLP vs Hash Grid Lookup

Compare the two approaches: deep MLP requires sequential layers, hash grid is a fast parallel lookup. Toggle to compare.

The hash trick: At each resolution level, spatial positions are hashed to a fixed-size table. Collisions are resolved by the neural network learning to work around them. This trades a tiny bit of quality for massive speed gains.

Check: What makes Instant-NGP so much faster than original NeRF?

It uses fewer training images It renders at lower resolution It replaces the deep MLP with fast hash table lookups

Chapter 6: 3D Gaussian Splatting

3DGS takes a completely different approach from NeRF. Instead of an implicit function queried along rays, it represents the scene as millions of 3D Gaussians — each one a colored, oriented ellipsoid with position, covariance, color, and opacity.

To render: project each Gaussian onto the screen (splatting), sort by depth, and alpha-composite them front to back. No ray marching, no MLP evaluation — just rasterization. This is extremely fast.

G(x) = exp(−½ (x − μ)^T Σ⁻¹ (x − μ))

Gaussian Splatting Visualization

Each ellipse is a 3D Gaussian projected to 2D. More Gaussians = more detail. Adjust the count and see them splat!

Gaussians100

Per-Gaussian Parameter	Meaning	Count
Position μ	3D center of the Gaussian	3
Covariance Σ	Shape, size, orientation (via quaternion + scale)	7
Color	Spherical harmonics coefficients	48
Opacity α	Transparency	1

Initialization comes from Structure from Motion (SfM) — a classical algorithm that estimates a sparse point cloud (~100K–1M points) from the input images. Each SfM point becomes one Gaussian. During training, 3DGS applies adaptive densification: Gaussians with large gradients in under-reconstructed regions are split (large ones become two smaller ones) or cloned (small ones are duplicated nearby). Gaussians with opacity below 0.005 are pruned. A typical trained scene has 1–5 million Gaussians.

Realization — parameter budget: Each Gaussian stores: mean [3], scale [3], rotation quaternion [4], opacity [1], and spherical harmonics color coefficients [48] (degree 3, which gives 16 coefficients × 3 RGB channels). That's ~59 parameters per Gaussian. For 1M Gaussians: ~59M parameters, or about 230 MB at float32. Rendering is pure rasterization: project all Gaussians to 2D, sort by depth per tile, alpha-composite front-to-back. No neural network at inference — just GPU rasterization. This is why it hits 100+ FPS.

Why Gaussians? They're differentiable (can be optimized with gradient descent), fast to project (closed-form 2D projection), and naturally handle smooth surfaces. Plus, they tile-sort beautifully on GPUs.

Training takes ~20–40 minutes on a single GPU. The loss combines pixel-level MSE with SSIM (structural similarity): L = (1−λ)·L₁ + λ·L_SSIM with λ=0.2. Both NeRF and 3DGS optimize per-scene — you train a separate model for each scene from its specific images. There is no generalization across scenes (that's what later methods like feed-forward 3DGS tackle).

Check: How does 3DGS render an image?

By evaluating an MLP at each pixel By projecting 3D Gaussians to 2D and alpha-compositing By ray tracing through a voxel grid

🔨 Derivation 3DGS Alpha Compositing — From Gaussians to Pixels ▶ ✓ ATTEMPTED

In 3DGS, each Gaussian has an opacity α_i that depends on its intrinsic opacity AND how much the pixel overlaps with the Gaussian's 2D projection. The final pixel color uses front-to-back alpha compositing, which is structurally identical to NeRF's volume rendering.

Your task: Show that 3DGS's per-pixel compositing C = ∑_i c_i · α_i · ∏_j<i(1−α_j) is mathematically equivalent to the discrete volume rendering formula from Ch1. Then explain why the effective α_i for a Gaussian depends on the Gaussian's 2D distance from the pixel center.

NeRF: C = ∑ T_iα_ic_i where T_i=∏_j<i(1−α_j). 3DGS: C = ∑ c_iα_i∏_j<i(1−α_j). They're identical! The ∏ term IS the transmittance. The only difference is what determines α_i.

A 2D Gaussian centered at μ' with covariance Σ' evaluated at pixel position p gives: G(p) = exp(−½(p−μ')^TΣ'⁻¹(p−μ')). The effective alpha at that pixel is: α_eff = o_i · G(p), where o_i is the Gaussian's learned opacity. Far from the center, G(p) → 0, so the Gaussian doesn't affect that pixel.

The ∏_j<i(1−α_j) term means ORDER matters. Gaussian j must be in front of i (smaller depth) to block it. If you sort wrong, a back-facing Gaussian could "occlude" one in front. 3DGS sorts per 16×16 tile, not globally — O(N log N) per tile, not per pixel.

The equivalence: Both NeRF and 3DGS compute pixels the same way: C = ∑_{i (front-to-back)} color_i × opacity_i × accumulated_transparency_i. The math is identical. Only the source of (color, opacity) differs.

In NeRF: α_i = 1−exp(−σ_iδ_i), coming from the MLP's density output integrated over a ray step.

In 3DGS: α_i = o_i · exp(−½(p−μ'_i)^TΣ'_i⁻¹(p−μ'_i)). This is the Gaussian's intrinsic opacity TIMES its spatial falloff at this pixel. The 2D covariance Σ' comes from projecting the 3D covariance using the Jacobian of the camera projection: Σ' = J W Σ W^T J^T.

The deep connection: Front-to-back alpha compositing IS discrete volume rendering. Whether you discretize by sampling points along a ray (NeRF) or by sorting primitives by depth (3DGS), you're computing the same integral. 3DGS just uses a different basis — Gaussians instead of point samples.

The key insight: This shared mathematical structure is why both methods are differentiable and can be optimized with the same MSE loss. They're two parameterizations of the same rendering equation. The speed difference comes entirely from the representation (MLP queries vs rasterization), not the compositing math.

Chapter 7: Rasterization vs Ray Marching

The fundamental difference between NeRF and 3DGS is how they render:

	NeRF (Ray Marching)	3DGS (Rasterization)
Approach	Shoot rays, sample points, query MLP	Project primitives to screen, sort, blend
Representation	Implicit (continuous function)	Explicit (millions of Gaussians)
Training speed	Hours to days	Minutes
Render speed	Seconds per frame	100+ FPS
Memory	Low (small MLP)	High (millions of Gaussians)
Quality	Excellent	Excellent (often better)

Ray Marching vs Rasterization

Toggle between the two rendering paradigms to see the conceptual difference.

Realization — why 1000× faster: The speed gap isn't about better algorithms — it's about what hardware does well. NeRF renders per-pixel: for each of 640K pixels, march a ray, evaluate an MLP ~192 times. That's 123M sequential-ish neural network forward passes — compute-bound on matrix multiplications. 3DGS renders per-primitive: project 1M Gaussians to 2D (embarrassingly parallel), sort them per 16×16 tile, alpha-composite. This is pure rasterization — the exact workload GPUs were designed for over 30 years. No neural network runs at inference time. The bottleneck shifts from compute to memory bandwidth, and modern GPUs have enormous bandwidth.

The verdict (2024+): 3DGS has largely overtaken NeRF for most practical applications due to its speed advantage. But NeRF-style implicit representations still win for memory-constrained scenarios and certain generative tasks.

Check: What is the main practical advantage of 3DGS over NeRF?

Much faster rendering (100+ FPS vs seconds per frame) Uses less memory Needs fewer input images

⚔ Adversarial: Your NeRF produces sharp images from training views but completely fails on novel views 30° away. Training loss is excellent. What's wrong?

You've trained a NeRF on 50 images of a shiny car. From the training camera positions, renders look photorealistic (PSNR > 32dB). But when you move the virtual camera 30° from any training view, the rendering shows severe artifacts: floating blobs, incorrect specular highlights, and geometry that seems to "follow" the camera.

The learning rate was too high, causing training instability The model overfits view-dependent effects — it "bakes" appearance into geometry, encoding specular highlights as density clouds rather than view-dependent color The positional encoding frequency is too low

🏗 Design Challenge You're the Architect: Real-Time VR 3D Reconstruction ▶ ✓ ATTEMPTED

You're building a 3D reconstruction system for a VR headset. Users walk around a room, and the headset must render photorealistic views of previously captured objects from any angle, in real time. The scene includes both static objects (furniture) and dynamic objects (a pet cat moving around).

Frame rate

90 FPS stereo (two eyes = 180 renders/sec)

Latency

<11ms motion-to-photon

Resolution

2160×2160 per eye

Scene

360° room, ~20m², static + dynamic objects

Hardware

Mobile GPU (Quest 3 class: ~2 TFLOPS)

Memory

6 GB VRAM shared with OS

1. NeRF vs 3DGS vs hybrid for the static environment? Consider render budget: 11ms for 4.6M pixels per eye at 2 TFLOPS.

2. How do you handle the dynamic cat? You can't retrain per-frame. What representation allows real-time updates?

3. 1M Gaussians at 230MB won't fit alongside the OS. How do you compress? What quality do you sacrifice?

4. The user can look anywhere in 360°. Do you render the full scene or use foveated/culled rendering? How?

The industry approach (2024-2025):

Static scene: Compressed 3DGS. 3DGS is the only option that can hit 90fps. But raw 3DGS is too large. Solutions: (1) Quantize SH coefficients to int8 (48 floats → 48 bytes, 4× compression). (2) Use fewer SH bands (degree 1 instead of 3, 12 coefficients instead of 48). (3) Prune low-opacity Gaussians aggressively. Result: ~50-80MB for a room, 100+ FPS on mobile GPUs.

Dynamic objects: Deformable 3DGS or tracked mesh + texture. For the cat: a per-frame deformation field that moves Gaussian centers (D-3DGS), or fall back to classical mesh tracking with neural texture. Some systems use a small MLP that predicts Gaussian offsets from a pose code, trained on a short capture sequence.

Memory: Level-of-detail (LOD). Only load full-detail Gaussians for nearby surfaces. Distant geometry uses fewer, larger Gaussians. Tile-based culling skips Gaussians outside the current view frustum — crucial for 360° scenes where only ~90° is visible at once.

Foveated rendering: The eye's periphery has much lower acuity. Render the foveal region (central 10°) at full resolution with all Gaussians. Periphery: 4× fewer pixels, skip small Gaussians. This cuts total rasterization work by ~3-4×.

Chapter 8: Generative 3D

What if you could generate a 3D object from a text prompt or a single image? Generative 3D combines NeRF/3DGS with diffusion models to create 3D content without any multi-view input.

Method	Input	Key Idea
DreamFusion	Text prompt	Score Distillation Sampling (SDS) from 2D diffusion
Zero-1-to-3	Single image	Viewpoint-conditioned diffusion
Magic3D	Text prompt	Coarse-to-fine with mesh extraction
LGM	4 images	Feed-forward 3DGS generation
GaussianDreamer	Text prompt	SDS applied to Gaussian splatting

Score Distillation: Text to 3D

A 2D diffusion model guides the optimization of a 3D NeRF representation. The NeRF renders views, the diffusion model critiques them.

SDS intuition: Render the NeRF from a random viewpoint. Add noise. Ask the diffusion model "does this look like [the text prompt]?" Use its gradient to update the NeRF. Repeat from many angles. The NeRF converges to a 3D object that looks right from every direction.

Check: How does DreamFusion create 3D from text?

Uses a 2D diffusion model's gradients to optimize a 3D NeRF Retrieves 3D models from a database Directly trains a 3D diffusion model

🔗 Pattern Recognition

Score Distillation = Iterative Refinement via Learned Prior

This Lesson (SDS)

Render NeRF → add noise → diffusion model predicts clean version → gradient back to NeRF. Repeat from many angles until the 3D representation matches the diffusion model's learned distribution.

Diffusion Lesson

Start from noise → score function predicts denoising direction → take step → repeat. The score function encodes a learned prior of "what images look like." → Diffusion

The deep connection: SDS uses a 2D diffusion model as a 3D-consistent critic. The score function ∇_xlog p(x) from diffusion becomes a SUPERVISION signal for the 3D representation. You're distilling the 2D prior into 3D geometry. This is the same "iterative refinement guided by a learned model" pattern seen in diffusion denoising, RLHF reward optimization, and even Kalman filtering (using a model to refine estimates).

Where else in ML do you see "use a pretrained model's gradients to optimize something else entirely"? (Hint: think about CLIP-guided generation, neural style transfer, adversarial training...)

Chapter 9: 3D in Robotics

Neural 3D representations are transforming robotics. Robots need to understand 3D geometry to grasp objects, navigate spaces, and plan motions. NeRF and 3DGS provide rich scene representations that go beyond flat depth maps.

Application	How NeRF/3DGS Helps
Grasp planning	Dense 3D geometry for contact-rich manipulation
Navigation	Photorealistic simulation for training navigation policies
Sim-to-real	Reconstruct real environments as training scenes
Language-guided	Pair 3D features with CLIP for "find the mug" queries
Deformables	Model soft objects and cloth for manipulation

3D Scene Understanding for Robots

A robot uses a 3DGS representation to plan grasps. Colored regions show semantic understanding overlaid on 3D structure.

Feature Fields: LERF and similar methods distill CLIP features into NeRF/3DGS representations. This creates a 3D scene where you can point at any location and get a language-aligned feature vector — enabling queries like "where is the coffee mug?" directly in 3D space.

🔗 Pattern Recognition

Multi-View Geometry + 2D Features = 3D Understanding

This Lesson (Feature Fields)

Distill CLIP/DINO features from 2D images into a 3D representation. Each point in 3D space gets a language-aligned feature vector derived from multiple 2D views.

VLM Lesson

Vision-Language Models encode 2D images into feature vectors aligned with text. The same CLIP encoder that powers feature fields also powers VLMs. → VLM

The deep connection: 2D foundation models (CLIP, DINO, SAM) learn powerful per-pixel features from billions of images. NeRF/3DGS provides the 3D scaffolding to "lift" these 2D features into consistent 3D representations. This is the bridge between 2D perception (what VLMs do) and 3D understanding (what robots need). The pattern: use geometry to aggregate 2D information into 3D — the same principle behind multi-view stereo, Structure from Motion, and visual SLAM.

VLAs (Vision-Language-Action models) need 3D spatial reasoning but are trained on 2D images. How might 3DGS + feature fields give VLAs implicit 3D understanding without explicit 3D supervision?

⚔ Adversarial: Your 3DGS model renders beautifully on your RTX 4090 (24GB VRAM), but when you deploy to a customer's RTX 3060 (12GB), the scene loads but rendering produces black frames at 2 FPS. What went wrong?

The trained model has 3.2 million Gaussians. Each stores 59 parameters at float32. Loading seems to work (no OOM error), but the tile-sorting step takes 400ms per frame and the alpha compositing produces mostly black pixels.

The 3060 doesn't support the required CUDA compute capability The Gaussians' spherical harmonics coefficients overflow float16 The per-tile sort buffer can't hold all overlapping Gaussians — the tile sort overflows, silently dropping Gaussians and causing missing/black pixels, while the sorting of 3.2M primitives exceeds the GPU's parallel sort capacity at this memory bandwidth

"We don't see the world as it is, we see it as we render it."

— Adapted from Anaïs Nin

You now understand how neural networks reconstruct 3D worlds from photographs. From NeRF's elegant ray marching to 3DGS's blazing-fast splatting, these techniques are redefining what's possible in graphics, VR, and robotics.

Check: How do Feature Fields (like LERF) help robots?

They make rendering faster They reduce memory usage They embed language-aligned features in 3D space for semantic queries

Understand NeRF & 3D Gaussian Splatting

Chapter 0: 3D from 2D