CS 231n — 3D Vision: From Pixels to Geometry

Roadmap

What You'll Master

01Why 3D? From Pixels to Geometry 02Explicit: Points, Meshes, Splines 03Implicit: SDFs, Voxels, Level Sets 04Learning on Point Clouds: PointNet 05Learning 3D from 2D Images 06Deep Implicit Functions 07Neural Radiance Fields (NeRF) 083D Gaussian Splatting 09Showcase: 3D Representation Explorer 10Structure & Connections

Chapter 01

Why 3D? — From Pixels to Geometry

Hold a coffee mug in front of you and look at it from one side. You see a cylinder with a handle. Now look at it from the top — you see a circle. The same 3D object produces completely different 2D images depending on your viewpoint. A 2D image classifier might correctly label both as "mug," but it has no idea that they're the same mug, or that the handle is on the back, or that the mug is 10cm tall.

This is the fundamental problem: a photograph destroys depth. The 3D world has occlusions (objects hiding behind other objects), scale ambiguity (a small object nearby looks identical to a large object far away), and viewpoint dependence (the same scene looks different from different angles). A flat 2D image captures none of this.

Inverse Graphics

Rendering is the forward problem: given a 3D scene (geometry, materials, lights, camera), produce a 2D image. Graphics engines do this in milliseconds. Inverse rendering is the backward problem: given a 2D image, recover the 3D scene. This is catastrophically harder.

Why? Because the mapping is many-to-one. Infinitely many 3D configurations can produce the same 2D image. A flat photograph of a sphere looks identical to a photograph of a painted circle, or a hemisphere, or an ellipsoid viewed at just the right angle. The information is simply gone.

The Rendering Equation (Simplified) I(x, y) = ∫ L(p, ω) · f_r(p, ω_i, ω_o) · cos(θ) dω
Pixel intensity = integral of incoming light × material reflectance × geometry over all directions. Forward is tractable; backward is ill-posed.

The rendering equation tells us that each pixel's color depends on geometry (the cos(θ) term — surface orientation), material (the reflectance function f_r), and lighting (the incoming radiance L). From a single image, you can't disentangle these three factors. That's why 3D vision from 2D images requires strong priors or multiple views.

Why It Matters

Robotics: A robot arm picking up objects must know where things are in 3D space. A 2D bounding box says "the mug is somewhere in this rectangle of pixels" — useless for grasping.

Autonomous driving: A car needs to know that the pedestrian is 15 meters away, not just "somewhere in the image." Depth is life or death.

AR/VR: Augmented reality must place virtual objects at correct 3D positions. If the geometry is wrong, the illusion breaks immediately.

Medical imaging: A CT scanner captures 3D slices. Reconstructing the 3D organ from these slices is literally 3D vision.

The Central Question

How do we represent 3D shapes computationally? The choice of representation determines what operations are easy, what's hard, how much memory we use, and what neural networks can learn. There is no single best representation — only tradeoffs. That's what this lecture is about.

The Ill-Posed Nature of 3D from 2D

Given a single 2D image, recovering 3D is mathematically under-determined. There are infinitely many 3D scenes that project to the same image. We need additional constraints: multiple views (stereo, multi-view), depth sensors (LiDAR, structured light), learned priors (neural networks trained on 3D data), or physics-based assumptions (smoothness, symmetry).

Chapter 02

Explicit Representations: Points, Meshes, Splines

An explicit representation directly encodes the surface: you store the actual 3D coordinates of points on or near the surface. Think of it like drawing a shape by placing dots, connecting them with lines, or writing a formula that traces out the curve. You can always answer "give me a point on the surface" easily. But asking "is this arbitrary point inside or outside?" is harder.

Point Clouds

The simplest 3D representation: an unordered set of (x, y, z) coordinates, sometimes with normals (n_x, n_y, n_z) or colors. That's it — no connectivity, no topology, just a bag of points floating in space.

Definition

Point Cloud

A set P = {p₁, p₂, ..., p_N} where each p_i ∈ ℝ³ (or ℝ⁶ with normals, or ℝ⁶⁺ with color/features). The set is unordered — there is no "first" or "last" point.

Where do point clouds come from? LiDAR sensors on self-driving cars shoot laser pulses and measure time-of-flight to get depth. Depth cameras (like the iPhone's Face ID sensor) use structured light or time-of-flight. 3D scanners use triangulation. Structure from Motion (SfM) algorithms triangulate points from multiple 2D images.

Point clouds are beautifully simple, but they have real limitations. There's no notion of a surface between points — you can't smoothly render them or compute volumes. You don't know which points are "connected." And they have no topology: is the shape a sphere or a torus? A point cloud alone can't tell you.

Polygon Meshes

A polygon mesh adds connectivity to points. You store a list of vertices (3D coordinates) and a list of faces (triplets of vertex indices forming triangles). Now you have a surface: the triangles tile together to form a watertight boundary.

Definition

Triangle Mesh

A pair (V, F) where V = {v₁, ..., v_N} is a set of vertices (v_i ∈ ℝ³) and F = {f₁, ..., f_M} is a set of triangular faces (f_j = (i, k, l) referencing three vertices). The mesh defines a piecewise-linear surface.

Meshes are the workhorse of computer graphics. The Digital Michelangelo Project scanned Michelangelo's David at 28 million vertices. Google Earth uses trillions of triangles to represent the planet. Every video game character, every Pixar movie, every CAD model — meshes.

Three fundamental mesh operations:

Operation	What It Does	When to Use
Subdivision	Splits each triangle into smaller triangles (upsample). Adds detail.	Close-up rendering, smooth surface approximation
Simplification	Merges triangles to reduce count (downsample). Removes detail.	Level-of-detail for distant objects, storage reduction
Regularization	Makes triangles more uniform in size/shape. Improves quality.	Numerical simulation (FEM), better rendering

Parametric Surfaces

Instead of storing discrete points, what if we stored a formula? A parametric surface is a function f: ℝ² → ℝ³ that maps two parameters (u, v) to a 3D point (x, y, z).

Parametric Surface f(u, v) = (x(u,v), y(u,v), z(u,v))
Example: a sphere with radius R: f(θ, φ) = (R sinφ cosθ, R sinφ sinθ, R cosφ)

Bézier curves are parametric curves controlled by a set of control points. The curve is a weighted average of the control points, using Bernstein polynomials as weights. Bézier surfaces extend this to patches controlled by a grid of control points. Splines chain multiple Bézier patches together with continuity constraints at the joints.

The appeal: parametric surfaces are infinitely smooth and compact (a few control points define a complex curve). The limitation: representing complex topology (a torus, a figure-eight) requires careful patching, and the mapping from parameters to geometry can be unintuitive.

Explicit = Easy to Sample, Hard to Test

All explicit representations share a common trait: it's trivial to generate points on the surface (just evaluate the formula, or read from the list). But testing "is this point inside or outside the shape?" is hard — you'd need to cast rays and count intersections, or do other geometric computation. Implicit representations flip this tradeoff exactly.

Explicit vs Implicit Representations Interactive

Shape:

Interaction Guide

Explicit mode: Click anywhere to sample random points on the shape's surface. Points appear as dots — easy to generate, but you can't tell if a point is "inside" or "outside" just from the dots. Implicit mode: Hover to query the implicit function f(x,y). Green means inside (f<0), red means outside (f>0), and the surface is the zero-crossing (f=0) shown as a gold contour.

Chapter 03

Implicit Representations: SDFs, Voxels, Level Sets

Flip the perspective. Instead of storing where the surface is, store a function that tells you, for any point in space, whether it's inside, outside, or on the surface. The surface is defined as the zero set of this function.

Definition

Implicit Surface

A surface S defined by a function f: ℝ³ → ℝ such that S = {(x,y,z) : f(x,y,z) = 0}. Points with f < 0 are inside, f > 0 are outside. The function f is called the implicit function.

Algebraic Surfaces

The simplest implicit surfaces: the zero set of a polynomial. A sphere of radius r: f(x,y,z) = x² + y² + z² − r². A torus: f(x,y,z) = (x² + y² + z² + R² − r²)² − 4R²(x² + y²). Clean, compact, but limited in what shapes they can describe.

Constructive Solid Geometry (CSG)

CSG builds complex shapes from simple primitives using Boolean operations. You start with spheres, cubes, cylinders — each with its own implicit function — and combine them:

CSG Boolean Operations Union: f_A∪B = min(f_A, f_B)
Intersection: f_A∩B = max(f_A, f_B)
Difference: f_A\B = max(f_A, −f_B)
min/max elegantly implement set operations on implicit functions.

CAD software uses CSG heavily. "Take a cylinder, subtract a smaller cylinder" gives you a pipe. "Intersect a sphere with a cube" gives you a rounded cube. The beauty is composability — each Boolean operation is just a min or max.

Signed Distance Functions (SDFs)

A signed distance function is a special implicit function where the value at any point equals the distance to the nearest surface, with a sign indicating inside (negative) or outside (positive).

Signed Distance Function SDF(x) = ± min_{y ∈ S} ||x − y||
Negative inside, zero on surface, positive outside. The gradient ||∇SDF|| = 1 everywhere (Eikonal equation).

SDFs have a magical property: you can smoothly blend shapes by interpolating their distance fields. Where CSG gives hard Boolean edges, SDF blending produces organic, smooth transitions — like two water droplets merging.

Worked Example — SDF of a Sphere

Sphere centered at origin with radius r: SDF(x,y,z) = √(x² + y² + z²) − r. At the origin, SDF = −r (deepest inside). At the surface, SDF = 0. At distance d from surface, SDF = d. This works for any point in space — no mesh intersection, no ray casting.

Level Sets

Instead of a closed-form function, store the implicit function on a 3D grid. Each voxel holds a floating-point value. The surface is the set of points where the stored value crosses zero — the zero level set. This is how medical imaging (CT scans, MRIs) naturally represents anatomy: each voxel stores tissue density, and the organ boundary is a level set.

Voxels

The simplest volumetric representation: divide space into a regular 3D grid of cubes. Each voxel (volume pixel) stores a binary value: 1 if the shape is present, 0 otherwise. Simple, intuitive — and catastrophically expensive. A 256³ grid has 16.7 million voxels. A 512³ grid has 134 million. Resolution scales as O(N³), which makes fine detail prohibitive.

Marching Cubes

Given a volumetric field (SDF, level set, or voxel grid), how do you extract a triangle mesh for rendering? The Marching Cubes algorithm walks through the grid one cube at a time. Each cube has 8 corners, each either inside or outside. That's 2⁸ = 256 configurations, which reduce to 15 unique cases by symmetry. For each configuration, there's a pre-computed template of triangles that approximate the surface within that cube.

Marching Cubes

For each cube in the 3D grid (8 corners per cube):
Classify corners: f < 0 is inside, f ≥ 0 is outside. This gives an 8-bit index (256 possibilities).
Look up the triangle template for this configuration (one of 15 unique patterns).
Interpolate vertex positions along edges where sign changes (linear interpolation between the two f-values).
Output triangles. Stitch them with neighbors to form a watertight mesh.

Implicit = Easy to Test, Hard to Sample

Implicit representations flip the explicit tradeoff: testing "is this point inside or outside?" is instant (evaluate f and check the sign). But generating a point on the surface requires root-finding — you must search for where f = 0. This is why we need marching cubes: it systematically finds the zero-crossings by scanning the grid.

Representation	Sample Surface	Inside/Outside	Memory	Topology
Point cloud	Trivial	Hard	O(N)	No info
Triangle mesh	Easy	Ray casting	O(N)	Explicit
Parametric	Evaluate f(u,v)	Hard	O(1) per patch	Patch-based
Voxel grid	March cubes	Lookup	O(N³)	Implicit
SDF	March cubes	Sign of f	Continuous/grid	Implicit
Level set	March cubes	Sign of f	O(N³)	Implicit

Chapter 04

Learning on Point Clouds: PointNet

You've got a point cloud — a bag of 3D coordinates. You want to classify it ("is this a chair or a table?") or segment it ("which points belong to the seat vs. the leg?"). The challenge: a point cloud is a set, not a sequence. There's no canonical ordering. If you shuffle the points, the answer shouldn't change.

Definition

Permutation Invariance

A function f is permutation invariant if for any permutation π of the input set: f({x₁, ..., x_N}) = f({x_π(1), ..., x_π(N)}). The output doesn't depend on the order of the inputs.

A standard neural network (MLP, CNN) takes an ordered vector as input. Feed it [point1, point2, point3] and it gives one answer; feed it [point3, point1, point2] and it gives a different answer. That's wrong for point clouds. We need a symmetric function — one that treats all permutations identically.

The PointNet Insight

Here's the key theorem: any continuous symmetric function on a set can be approximated by a function of the form:

PointNet Architecture h({x₁, ..., x_N}) = g( MAX(f(x₁), f(x₂), ..., f(x_N)) )

where:
  f: ℝ³ → ℝ^K   — per-point MLP (shared weights)
  MAX: element-wise max over N points → ℝ^K
  g: ℝ^K → output   — classification MLP

Read it from the bottom up. Each point x_i ∈ ℝ³ is independently passed through the same MLP f, producing a K-dimensional feature vector. Then you take the element-wise maximum across all N points, collapsing the set into a single K-dimensional vector. This is the global feature. Finally, another MLP g maps this global feature to the output (class probabilities).

Why Max Pooling?

Max is a symmetric function: max(a,b,c) = max(c,a,b) = max(b,c,a). It doesn't matter what order you feed the points in — the max is the same. Sum and mean also work, but max performs best empirically because it acts like a "voting" mechanism: each dimension of the K-vector captures whether any point has a particular geometric feature.

Universal Approximation

PointNet isn't just a hack. Qi et al. proved that this architecture can approximate any continuous symmetric function to arbitrary accuracy, given enough dimensions K. The max-pooling layer is the key: it's a universal symmetric aggregator. The per-point MLP f learns to map each point into a high-dimensional space where max-pooling extracts all the geometric information.

Proof Sketch

Consider the set of continuous symmetric functions on sets of up to N points. For any such function h, there exist continuous functions f: ℝ³ → ℝ^K and g: ℝ^K → ℝ such that h = g ∘ MAX ∘ f, provided K is large enough. The proof constructs f and g explicitly using the Stone-Weierstrass theorem on the quotient space of point sets under permutations.

Segmentation

For classification, the global feature (after max pooling) goes to a classifier. For part segmentation (labeling each point), PointNet concatenates the global feature with each point's local feature, then runs a per-point classifier. Each point sees both its own local geometry AND the global shape context.

PointNet Segmentation Per-point output: g_seg( [f(x_i) ; global_feature] )
Concatenate local feature with global feature, then per-point MLP.

PointNet++ and Beyond

PointNet's weakness: it processes each point independently before pooling, so it misses local structure. Two points that are spatially close don't interact until the global max pool. PointNet++ fixes this with hierarchical grouping: apply PointNet to local neighborhoods (like a convolution), pool within each neighborhood, then apply PointNet again to the resulting points, and so on. It's like building a spatial hierarchy — local features first, then progressively more global ones.

DGCNN (Dynamic Graph CNN) builds a k-nearest-neighbor graph in feature space (not just 3D space) and applies graph convolutions. The graph is recomputed at each layer, so the connectivity adapts as features evolve.

Distance Metrics for Point Clouds

To train generative models or evaluate reconstructions, we need to measure how "close" two point clouds are. Two standard metrics:

Chamfer Distance CD(S₁, S₂) = ∑_{x ∈ S₁} min_{y ∈ S₂} ||x − y||² + ∑_{y ∈ S₂} min_{x ∈ S₁} ||x − y||²
For each point in S1, find nearest point in S2, and vice versa. Cheap but can miss global structure.

Earth Mover's Distance (EMD) EMD(S₁, S₂) = min_{φ:S₁→S₂} ∑_{x ∈ S₁} ||x − φ(x)||²
Find the optimal bijection between the two sets. More accurate but O(N³) to compute.

PointNet: Permutation-Invariant Processing Interactive

Points: 8

What to Observe

Click Process Points to animate: each point passes through the shared MLP, then max pooling aggregates everything into one global feature vector. Now click Shuffle Input — notice that the output (global feature) is identical. That's permutation invariance. Adjust the slider to see how more points give richer features.

Chapter 05

Learning 3D from 2D Images

Here's a surprising result from 2015: if you want to classify 3D shapes, don't bother with fancy 3D architectures. Just render the shape from many viewpoints, run a standard 2D CNN on each view, and combine the results. This approach — Multi-View CNN (MVCNN) — beat every direct 3D method on the ModelNet benchmark.

Multi-View CNN Architecture

Multi-View CNN

Render the 3D shape from N viewpoints (e.g., 12 views around the object, evenly spaced on a circle).
Run a CNN (e.g., VGG, ResNet) on each rendered image independently, extracting a feature vector per view.
View pooling: Take the element-wise maximum across all N feature vectors (same idea as PointNet!).
Classify: Feed the pooled feature vector into a final classifier (fully connected layers + softmax).

Result: 89.9% accuracy on ModelNet40 (40-class 3D shape classification), versus ~77% for the best voxel-based 3D CNN at the time. The 2D CNN had seen millions of natural images during pretraining (ImageNet); the 3D CNN was trained from scratch on a much smaller dataset.

Why 2D Beats 3D (For Now)

2D CNNs benefit from massive pretraining on ImageNet. They've learned rich feature hierarchies — edges, textures, parts, objects. 3D CNNs start from scratch with orders of magnitude less training data. The rendered views effectively transfer all that 2D knowledge to the 3D task. As 3D datasets grow, this gap narrows.

3D Convolutions on Voxels

The direct approach: voxelize the 3D shape into a 3D grid (like an image, but with depth), then run 3D convolutions. A 3D conv kernel is a cube (e.g., 3×3×3) that slides through the volume.

The problem: O(N³) memory. A 32³ voxel grid is manageable (~32K voxels). A 128³ grid has 2 million voxels. A 256³ grid has 16.7 million. Most of those voxels are empty — the surface occupies only a thin shell. You're paying cubic memory for quadratic information.

3D GANs

3D-GANs extend generative adversarial networks to voxel grids. The generator takes a random latent vector z ∈ ℝ²⁰⁰ and produces a 64³ voxel grid via 3D transposed convolutions. The discriminator classifies voxel grids as real or generated. The result: you can sample random 3D shapes from a learned distribution.

Octree Representations

The fix for cubic memory: octrees. An octree recursively subdivides space into 8 octants, but only subdivides where the surface is present. Empty regions stay as large, coarse blocks. The surface region gets fine subdivisions.

Octree Efficiency

For a sphere at resolution N: uniform voxel grid uses O(N³) voxels. An octree uses O(N² log N) — only the surface shell is refined. At N=256, that's 16.7M vs. ~1.5M. At N=1024, it's 1.07B vs. ~30M. The savings grow with resolution.

Differentiable Rendering

The key enabler for learning 3D from 2D: make the rendering process differentiable. If you can compute gradients of the rendered image with respect to the 3D shape parameters, you can train a neural network to predict 3D shapes by comparing rendered views to ground-truth images.

Standard rasterization is not differentiable (a pixel is either covered by a triangle or not — there's no gradient). Differentiable renderers soften this binary decision: each pixel gets a smooth probability of being covered, and gradients flow through. Neural mesh renderer, SoftRasterizer, and PyTorch3D all implement variants of this idea.

The Supervision Gap

3D supervision (ground-truth meshes, point clouds, SDFs) is expensive to acquire. 2D supervision (photographs) is essentially free. Differentiable rendering bridges this gap: train on 2D images, learn 3D geometry. This philosophy drives NeRF, 3D Gaussian Splatting, and most modern 3D reconstruction methods.

Chapter 06

Deep Implicit Functions: DeepSDF & Occupancy Networks

Voxels are too expensive (O(N³)). Point clouds have no surface. Meshes are hard to generate with neural networks (you need to predict both vertices and connectivity, which is a combinatorial problem). What if you trained a neural network to be the implicit function?

DeepSDF

DeepSDF (Park et al., 2019) trains an MLP to be a signed distance function. The input is a 3D coordinate plus a learnable latent code z that encodes the shape identity. The output is the signed distance at that point.

DeepSDF f_θ(x, y, z, z) → SDF value

where:
  (x, y, z) ∈ ℝ³ — query point in space
  z ∈ ℝ^D — latent shape code
  θ — MLP parameters (shared across all shapes)

During training, you have ground-truth SDF samples from known shapes. For each shape, you jointly optimize the MLP weights θ and that shape's latent code z to minimize:

DeepSDF Loss L(θ, z₁, ..., z_K) = ∑_i=1^K ∑_{(x,s) ∈ X_i} |clamp(f_θ(x, z_i), δ) − clamp(s, δ)|
clamp(x, δ) = max(−δ, min(δ, x)). Clamping focuses learning near the surface.

Auto-Decoder

DeepSDF doesn't use an encoder. There's no network that maps a shape to its latent code. Instead, latent codes are directly optimized as free parameters during training — this is called an auto-decoder. At test time, given a new shape (as a partial point cloud), you freeze θ and optimize z to fit the observations. This is slower than a feed-forward encoder but produces better reconstructions.

Occupancy Networks

Occupancy Networks (Mescheder et al., 2019) take a different approach: instead of predicting signed distance, predict the probability of occupancy.

Occupancy Network f_θ(x, y, z, z) → o ∈ [0, 1]
o = probability that point (x,y,z) is inside the shape. Surface is where o = 0.5.

Trained with binary cross-entropy: for each query point, the ground truth is 1 (inside) or 0 (outside). Unlike DeepSDF, Occupancy Networks use an encoder — a PointNet or ResNet that maps the input observation to the latent code z in a single forward pass.

Why Deep Implicit Functions Matter

Property	Voxels (64³)	Point Cloud	DeepSDF / OccNet
Resolution	Fixed, coarse	Depends on N	Continuous (infinite)
Memory	O(N³)	O(N)	O(\|θ\|) network params
Surface quality	Blocky	No surface	Smooth
Topology	Implicit	None	Arbitrary (learned)
Mesh extraction	Marching cubes	Poisson, ball pivot	Marching cubes on grid eval

The Deep Implicit Revolution

A single 8-layer MLP with 256 hidden units can represent complex 3D geometry at arbitrary resolution. The network weights are the shape. You evaluate it at any point you want — no grid, no predefined resolution. This is the conceptual shift: from storing geometry in data structures to encoding geometry in neural network weights.

Inference Pipeline

Step 1: Given a new observation (image, partial point cloud), compute latent code z (via encoder or optimization). Step 2: Evaluate f_θ(x, z) on a dense 3D grid (e.g., 256³). Step 3: Run Marching Cubes to extract the zero level set as a triangle mesh. Step 4: Render the mesh. Total: a photograph in, a 3D mesh out.

Chapter 07

Neural Radiance Fields: NeRF

DeepSDF learns geometry. But what about appearance? A 3D scene isn't just shape — it's shape plus material plus lighting. What if a neural network could learn both simultaneously, trained from nothing but ordinary photographs?

That's NeRF (Mildenhall et al., 2020). It represents a scene as a continuous volumetric function that maps a 3D position and viewing direction to a color and density. No mesh. No point cloud. Just an MLP that you can query at any point in space to ask: "What does this look like from this direction?"

NeRF: Scene as a Function F_θ: (x, y, z, θ, φ) → (R, G, B, σ)

where:
  (x, y, z) — 3D position
  (θ, φ) — viewing direction (azimuth, elevation)
  (R, G, B) — emitted color at that point from that direction
  σ — volume density (how opaque this point is)

Density σ depends only on position (geometry is view-independent). Color depends on both position and direction (materials reflect differently from different angles — think of specular highlights on a glossy surface).

Volume Rendering

To render a pixel, shoot a ray from the camera through that pixel into the scene. Sample N points along the ray. At each sample, query the network for color and density. Then composite the colors using the volume rendering equation:

Volume Rendering Equation C(r) = ∑_i=1^N T_i · α_i · c_i

where:
  T_i = ∏_j=1ⁱ⁻¹ (1 − α_j)   — transmittance (light NOT blocked by earlier samples)
  α_i = 1 − exp(−σ_i · δ_i)   — opacity of sample i (δ_i = distance between samples)
  c_i   — color at sample i

Read it step by step. T_i tracks how much light has survived from the camera to sample i — it starts at 1 (nothing blocking) and decreases as it hits dense regions. α_i is how opaque this particular segment is (high density σ or large step size δ means more blocking). c_i is the color we see at this point. Multiply all three: this sample contributes its color, weighted by its opacity and the accumulated transparency.

Why Volume Rendering Is Magical

Every operation in the volume rendering equation — multiplication, addition, exponentiation — is differentiable. That means you can backpropagate from a rendered pixel all the way back through the volume rendering integral to the MLP weights. This is the key insight: you train NeRF using only 2D photographs + camera poses. No 3D supervision at all.

Positional Encoding

MLPs are biased toward learning smooth, low-frequency functions. But scenes have sharp edges, fine textures, and high-frequency detail. NeRF fixes this with positional encoding: before feeding (x,y,z) into the MLP, map it to a higher-dimensional space using sinusoids at multiple frequencies.

Positional Encoding γ(p) = (sin(2⁰πp), cos(2⁰πp), sin(2¹πp), cos(2¹πp), ..., sin(2^L−1πp), cos(2^L−1πp))
Applied independently to each coordinate. L=10 for position (60D), L=4 for direction (24D).

Without positional encoding, NeRF produces blurry renderings. With it, the network can represent high-frequency detail — individual bricks on a wall, text on a sign, the weave of a fabric.

Hierarchical Sampling

Naive approach: sample N points uniformly along each ray. Wasteful — most of the ray passes through empty space. NeRF uses a coarse-to-fine strategy:

Hierarchical Volume Sampling

Coarse network: Sample N_c = 64 points uniformly along the ray. Evaluate the coarse MLP. Compute approximate weights w_i = T_iα_i.
Importance sampling: Treat {w_i} as a probability distribution. Sample N_f = 128 additional points concentrated where the coarse network predicts high density.
Fine network: Evaluate all N_c + N_f = 192 points through the fine MLP. Compute the final color using all samples.

Training NeRF

Input: a set of photographs with known camera poses (position and orientation for each image). Loss: the L2 difference between the rendered pixel color and the ground-truth pixel color.

NeRF Training Loss L = ∑_{r ∈ R} ||Ĉ(r) − C(r)||²
Sum over rays r in a batch. Ĉ is the rendered color, C is the ground truth.

That's it. No 3D supervision, no depth maps, no segmentation masks. Just "make the rendered images match the real photographs." The 3D geometry and appearance emerge as a byproduct of matching 2D observations.

Volume Rendering Along a Ray Interactive

Samples: 6 Density: 5

What to Observe

A ray shoots from left to right through a 1D "scene" with colored density regions. At each sample point, the visualization shows the transmittance T (how much light survived so far), the opacity α (how much this sample blocks), and the weighted color contribution. The final pixel color at the right accumulates all contributions. Increase samples to see the rendering converge; increase density to make objects more opaque.

Limitations and Extensions

NeRF is slow. Training a single scene takes 1-2 days. Rendering a single image takes 30+ seconds (hundreds of MLP evaluations per pixel, millions of pixels per image). The scene is baked into the network weights — no generalization to new scenes.

Extension	Key Idea	Improvement
Instant-NGP	Hash-based feature grids instead of positional encoding	Training in seconds, not hours
Mip-NeRF	Integrated positional encoding over cone frustums	Anti-aliased rendering, multi-scale
NeRF-W	Per-image appearance/transient embeddings	Works with internet photos (varying lighting)
Plenoxels	Explicit voxel grid with spherical harmonics	No MLP needed, 100× faster training
3DGS	Gaussian blobs instead of dense volume	Real-time rendering (see next chapter)

Chapter 08

3D Gaussian Splatting

NeRF treats the scene as a dense implicit field: to render a pixel, you must query the MLP at hundreds of points along every ray. That's slow. 3D Gaussian Splatting (3DGS, Kerbl et al., 2023) inverts this: instead of querying a dense field, scatter a collection of explicit 3D blobs onto the image. The blobs are 3D Gaussians — each one a fuzzy, colored ellipsoid floating in space.

Definition

3D Gaussian Primitive

Each Gaussian is parameterized by: μ ∈ ℝ³ (position), Σ ∈ ℝ^3×3 (covariance — shape/orientation of the ellipsoid), α ∈ [0,1] (opacity), and color (represented as spherical harmonics coefficients for view-dependent appearance). A typical scene uses 1-5 million Gaussians.

Rendering: Splatting

NeRF shoots rays into the scene (ray marching). 3DGS does the opposite: it projects Gaussians onto the image plane (splatting). Each 3D Gaussian projects to a 2D Gaussian on the screen. Then you sort the 2D Gaussians by depth and alpha-composite them front-to-back, exactly like stacking semi-transparent colored blobs.

Gaussian Projection Σ' = J W Σ W^T J^T

where:
  Σ' — 2D covariance in image space
  Σ — 3D covariance in world space
  W — viewing transformation matrix
  J — Jacobian of the projective transformation

Alpha Compositing (Front-to-Back) C(p) = ∑_i=1^N c_i · α_i · G_i(p) · ∏_j=1ⁱ⁻¹ (1 − α_j · G_j(p))
G_i(p) = Gaussian evaluated at pixel p. Sort by depth, composite front-to-back.

This looks almost identical to NeRF's volume rendering equation — and it is. The difference: NeRF samples densely along rays; 3DGS only evaluates at the Gaussian positions. Since most of space is empty, this is dramatically faster.

Initialization

3DGS starts from a Structure-from-Motion (SfM) point cloud. Run COLMAP on your input images to get a sparse 3D reconstruction and camera poses. Each SfM point becomes the initial center μ of a Gaussian. The initial covariance is isotropic (spherical blob), initial opacity is low, and initial color is the average color of the corresponding 2D point.

Training: Differentiable Splatting

The entire splatting pipeline is differentiable. Loss = L1 + D-SSIM between rendered and ground-truth images. Gradients flow back to all Gaussian parameters: position, covariance, color, opacity.

3DGS Training Loop

Initialize Gaussians from SfM point cloud.
For each training image: project all Gaussians, sort by depth, alpha-composite, render image.
Compute loss vs. ground truth. Backprop to all Gaussian parameters.
Adaptive density control every 100 iterations:
- Gaussians with large gradients but small scale → clone (split into two)
- Gaussians with large gradients and large scale → split (divide in half)
- Gaussians with very low opacity → prune (delete)
Repeat for ~30,000 iterations.

NeRF vs. 3DGS

Property	NeRF	3D Gaussian Splatting
Representation	Dense implicit (MLP)	Sparse explicit (Gaussian set)
Rendering	Ray marching (sample along rays)	Splatting (project blobs onto image)
Speed	~30s per frame	Real-time (100+ FPS)
Training	Hours to days	Minutes to hours
Editability	Hard (weights encode everything)	Easy (move/delete Gaussians)
Memory	Small (MLP weights)	Large (millions of Gaussians)
Quality	Excellent	Competitive or better
Supervision	2D images + poses	2D images + poses + SfM init

The Representation Duality

NeRF and 3DGS represent two poles: dense implicit vs. sparse explicit. NeRF stores the scene in network weights (compact, but slow to query). 3DGS stores it in millions of explicit primitives (large, but fast to render). Both are trained from the same input (2D photographs) and produce the same output (photorealistic novel views). The tradeoff is speed vs. compactness.

Chapter 09

Showcase: 3D Representation Explorer

Let's bring everything together. Below is an interactive visualization of the same 3D shape in four different representations: point cloud, wireframe mesh, voxel grid, and SDF heatmap. Toggle between them to see the tradeoffs firsthand.

3D Representation Explorer Interactive

Rotate: 0° Detail: 3

Each representation captures the same torus, but notice the tradeoffs:

Representation	What You See	Sampling	Inside/Outside	Memory
Point Cloud	Scattered dots on surface	Trivial	Hard	Low
Wireframe Mesh	Connected triangle edges	Easy	Ray cast	Medium
Voxels	Filled cubes	March cubes	Lookup	O(N³)
SDF Heatmap	Color = distance to surface	Root-find	Sign check	Continuous

The Tradeoff Triangle

You can't have everything. Point clouds are memory-efficient but have no surface. Meshes have surfaces but are hard to generate from neural networks. Voxels are easy to generate but blow up in memory. Implicit functions (SDFs) are compact and smooth but slow to render (need marching cubes). Every modern method picks a point in this tradeoff space.

Chapter 10

Structure & Connections

Beyond Geometry: Structure-Aware Representations

Everything we've discussed represents geometry — the raw shape. But objects have structure: a chair has legs, a seat, a back. A car has wheels, doors, windows. Understanding parts and their relationships is a higher level of 3D understanding.

Part segmentation labels each point or face with a semantic part (e.g., "leg 1," "seat," "back"). PointNet and its successors can learn this from labeled data.

StructureNet (Mo et al., 2019) represents shapes as hierarchical graphs: the root is the whole object, children are parts, and edges encode spatial relationships (adjacency, symmetry). Each node stores a bounding box and a part type. This lets you generate shapes by generating graph structures.

Shape programs are the most expressive representation: describe a 3D shape as a program (sequence of operations like "place a cylinder at position p with radius r and height h"). A shape program subsumes all other representations — you can write a program that outputs points, meshes, voxels, or SDFs. The challenge is learning to write programs from data.

The Full Landscape

Representation	Type	Resolution	Learning	Key Method
Point cloud	Explicit	Discrete	PointNet, PointNet++	Per-point MLP + max pool
Mesh	Explicit	Discrete	Mesh R-CNN, Pixel2Mesh	Deform template mesh
Voxels	Explicit	Discrete (coarse)	3D CNN, 3D-GAN	3D convolutions
Multi-view	Implicit (2D)	Image resolution	MVCNN	Render + 2D CNN + pool
SDF/Occupancy	Implicit	Continuous	DeepSDF, OccNet	MLP as implicit function
Radiance field	Implicit	Continuous	NeRF, Instant-NGP	MLP + volume rendering
Gaussians	Explicit	Adaptive	3DGS	Differentiable splatting
Octree	Explicit	Adaptive	OctNet	Sparse 3D convolutions
Parts/programs	Structured	Semantic	StructureNet, ShapeAssembly	Hierarchical graphs

The Arc of Progress

The field has moved through a clear progression: voxels (simple but cubic) → point clouds (efficient but no surface) → meshes (surfaces but hard to generate) → deep implicit functions (continuous but slow) → NeRF (photorealistic but slow) → 3DGS (real-time). Each step addressed the previous step's biggest weakness.

The Unifying Theme

Every method in this lecture answers the same question: "How do I represent 3D geometry so that a neural network can learn it, and a renderer can display it?" The choice of representation determines the inductive bias, the computational cost, and the achievable quality. There is no universal best — only the right tool for the right task.

References

Qi et al. "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation." CVPR 2017. arXiv
Qi et al. "PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space." NeurIPS 2017. arXiv
Su et al. "Multi-view Convolutional Neural Networks for 3D Shape Recognition." ICCV 2015. arXiv
Park et al. "DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation." CVPR 2019. arXiv
Mescheder et al. "Occupancy Networks: Learning 3D Reconstruction in Function Space." CVPR 2019. arXiv
Mildenhall et al. "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis." ECCV 2020. arXiv
Müller et al. "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding." SIGGRAPH 2022. arXiv
Kerbl et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." SIGGRAPH 2023. arXiv
Mo et al. "StructureNet: Hierarchical Graph Networks for 3D Shape Generation." SIGGRAPH Asia 2019. arXiv
Wang et al. "Dynamic Graph CNN for Learning on Point Clouds." ACM TOG 2019. arXiv