A photograph crushes the world into a flat grid of pixels. How do we get the 3D back? That's the deepest challenge in computer vision — and the foundation for robotics, AR, and autonomous driving.
Hold a coffee mug in front of you and look at it from one side. You see a cylinder with a handle. Now look at it from the top — you see a circle. The same 3D object produces completely different 2D images depending on your viewpoint. A 2D image classifier might correctly label both as "mug," but it has no idea that they're the same mug, or that the handle is on the back, or that the mug is 10cm tall.
This is the fundamental problem: a photograph destroys depth. The 3D world has occlusions (objects hiding behind other objects), scale ambiguity (a small object nearby looks identical to a large object far away), and viewpoint dependence (the same scene looks different from different angles). A flat 2D image captures none of this.
Rendering is the forward problem: given a 3D scene (geometry, materials, lights, camera), produce a 2D image. Graphics engines do this in milliseconds. Inverse rendering is the backward problem: given a 2D image, recover the 3D scene. This is catastrophically harder.
Why? Because the mapping is many-to-one. Infinitely many 3D configurations can produce the same 2D image. A flat photograph of a sphere looks identical to a photograph of a painted circle, or a hemisphere, or an ellipsoid viewed at just the right angle. The information is simply gone.
The rendering equation tells us that each pixel's color depends on geometry (the cos(θ) term — surface orientation), material (the reflectance function fr), and lighting (the incoming radiance L). From a single image, you can't disentangle these three factors. That's why 3D vision from 2D images requires strong priors or multiple views.
Robotics: A robot arm picking up objects must know where things are in 3D space. A 2D bounding box says "the mug is somewhere in this rectangle of pixels" — useless for grasping.
Autonomous driving: A car needs to know that the pedestrian is 15 meters away, not just "somewhere in the image." Depth is life or death.
AR/VR: Augmented reality must place virtual objects at correct 3D positions. If the geometry is wrong, the illusion breaks immediately.
Medical imaging: A CT scanner captures 3D slices. Reconstructing the 3D organ from these slices is literally 3D vision.
How do we represent 3D shapes computationally? The choice of representation determines what operations are easy, what's hard, how much memory we use, and what neural networks can learn. There is no single best representation — only tradeoffs. That's what this lecture is about.
Given a single 2D image, recovering 3D is mathematically under-determined. There are infinitely many 3D scenes that project to the same image. We need additional constraints: multiple views (stereo, multi-view), depth sensors (LiDAR, structured light), learned priors (neural networks trained on 3D data), or physics-based assumptions (smoothness, symmetry).
An explicit representation directly encodes the surface: you store the actual 3D coordinates of points on or near the surface. Think of it like drawing a shape by placing dots, connecting them with lines, or writing a formula that traces out the curve. You can always answer "give me a point on the surface" easily. But asking "is this arbitrary point inside or outside?" is harder.
The simplest 3D representation: an unordered set of (x, y, z) coordinates, sometimes with normals (nx, ny, nz) or colors. That's it — no connectivity, no topology, just a bag of points floating in space.
A set P = {p1, p2, ..., pN} where each pi ∈ ℝ3 (or ℝ6 with normals, or ℝ6+ with color/features). The set is unordered — there is no "first" or "last" point.
Where do point clouds come from? LiDAR sensors on self-driving cars shoot laser pulses and measure time-of-flight to get depth. Depth cameras (like the iPhone's Face ID sensor) use structured light or time-of-flight. 3D scanners use triangulation. Structure from Motion (SfM) algorithms triangulate points from multiple 2D images.
Point clouds are beautifully simple, but they have real limitations. There's no notion of a surface between points — you can't smoothly render them or compute volumes. You don't know which points are "connected." And they have no topology: is the shape a sphere or a torus? A point cloud alone can't tell you.
A polygon mesh adds connectivity to points. You store a list of vertices (3D coordinates) and a list of faces (triplets of vertex indices forming triangles). Now you have a surface: the triangles tile together to form a watertight boundary.
A pair (V, F) where V = {v1, ..., vN} is a set of vertices (vi ∈ ℝ3) and F = {f1, ..., fM} is a set of triangular faces (fj = (i, k, l) referencing three vertices). The mesh defines a piecewise-linear surface.
Meshes are the workhorse of computer graphics. The Digital Michelangelo Project scanned Michelangelo's David at 28 million vertices. Google Earth uses trillions of triangles to represent the planet. Every video game character, every Pixar movie, every CAD model — meshes.
Three fundamental mesh operations:
| Operation | What It Does | When to Use |
|---|---|---|
| Subdivision | Splits each triangle into smaller triangles (upsample). Adds detail. | Close-up rendering, smooth surface approximation |
| Simplification | Merges triangles to reduce count (downsample). Removes detail. | Level-of-detail for distant objects, storage reduction |
| Regularization | Makes triangles more uniform in size/shape. Improves quality. | Numerical simulation (FEM), better rendering |
Instead of storing discrete points, what if we stored a formula? A parametric surface is a function f: ℝ2 → ℝ3 that maps two parameters (u, v) to a 3D point (x, y, z).
Bézier curves are parametric curves controlled by a set of control points. The curve is a weighted average of the control points, using Bernstein polynomials as weights. Bézier surfaces extend this to patches controlled by a grid of control points. Splines chain multiple Bézier patches together with continuity constraints at the joints.
The appeal: parametric surfaces are infinitely smooth and compact (a few control points define a complex curve). The limitation: representing complex topology (a torus, a figure-eight) requires careful patching, and the mapping from parameters to geometry can be unintuitive.
All explicit representations share a common trait: it's trivial to generate points on the surface (just evaluate the formula, or read from the list). But testing "is this point inside or outside the shape?" is hard — you'd need to cast rays and count intersections, or do other geometric computation. Implicit representations flip this tradeoff exactly.
Explicit mode: Click anywhere to sample random points on the shape's surface. Points appear as dots — easy to generate, but you can't tell if a point is "inside" or "outside" just from the dots. Implicit mode: Hover to query the implicit function f(x,y). Green means inside (f<0), red means outside (f>0), and the surface is the zero-crossing (f=0) shown as a gold contour.
Flip the perspective. Instead of storing where the surface is, store a function that tells you, for any point in space, whether it's inside, outside, or on the surface. The surface is defined as the zero set of this function.
A surface S defined by a function f: ℝ3 → ℝ such that S = {(x,y,z) : f(x,y,z) = 0}. Points with f < 0 are inside, f > 0 are outside. The function f is called the implicit function.
The simplest implicit surfaces: the zero set of a polynomial. A sphere of radius r: f(x,y,z) = x2 + y2 + z2 − r2. A torus: f(x,y,z) = (x2 + y2 + z2 + R2 − r2)2 − 4R2(x2 + y2). Clean, compact, but limited in what shapes they can describe.
CSG builds complex shapes from simple primitives using Boolean operations. You start with spheres, cubes, cylinders — each with its own implicit function — and combine them:
CAD software uses CSG heavily. "Take a cylinder, subtract a smaller cylinder" gives you a pipe. "Intersect a sphere with a cube" gives you a rounded cube. The beauty is composability — each Boolean operation is just a min or max.
A signed distance function is a special implicit function where the value at any point equals the distance to the nearest surface, with a sign indicating inside (negative) or outside (positive).
SDFs have a magical property: you can smoothly blend shapes by interpolating their distance fields. Where CSG gives hard Boolean edges, SDF blending produces organic, smooth transitions — like two water droplets merging.
Sphere centered at origin with radius r: SDF(x,y,z) = √(x2 + y2 + z2) − r. At the origin, SDF = −r (deepest inside). At the surface, SDF = 0. At distance d from surface, SDF = d. This works for any point in space — no mesh intersection, no ray casting.
Instead of a closed-form function, store the implicit function on a 3D grid. Each voxel holds a floating-point value. The surface is the set of points where the stored value crosses zero — the zero level set. This is how medical imaging (CT scans, MRIs) naturally represents anatomy: each voxel stores tissue density, and the organ boundary is a level set.
The simplest volumetric representation: divide space into a regular 3D grid of cubes. Each voxel (volume pixel) stores a binary value: 1 if the shape is present, 0 otherwise. Simple, intuitive — and catastrophically expensive. A 2563 grid has 16.7 million voxels. A 5123 grid has 134 million. Resolution scales as O(N3), which makes fine detail prohibitive.
Given a volumetric field (SDF, level set, or voxel grid), how do you extract a triangle mesh for rendering? The Marching Cubes algorithm walks through the grid one cube at a time. Each cube has 8 corners, each either inside or outside. That's 28 = 256 configurations, which reduce to 15 unique cases by symmetry. For each configuration, there's a pre-computed template of triangles that approximate the surface within that cube.
Implicit representations flip the explicit tradeoff: testing "is this point inside or outside?" is instant (evaluate f and check the sign). But generating a point on the surface requires root-finding — you must search for where f = 0. This is why we need marching cubes: it systematically finds the zero-crossings by scanning the grid.
| Representation | Sample Surface | Inside/Outside | Memory | Topology |
|---|---|---|---|---|
| Point cloud | Trivial | Hard | O(N) | No info |
| Triangle mesh | Easy | Ray casting | O(N) | Explicit |
| Parametric | Evaluate f(u,v) | Hard | O(1) per patch | Patch-based |
| Voxel grid | March cubes | Lookup | O(N3) | Implicit |
| SDF | March cubes | Sign of f | Continuous/grid | Implicit |
| Level set | March cubes | Sign of f | O(N3) | Implicit |
You've got a point cloud — a bag of 3D coordinates. You want to classify it ("is this a chair or a table?") or segment it ("which points belong to the seat vs. the leg?"). The challenge: a point cloud is a set, not a sequence. There's no canonical ordering. If you shuffle the points, the answer shouldn't change.
A function f is permutation invariant if for any permutation π of the input set: f({x1, ..., xN}) = f({xπ(1), ..., xπ(N)}). The output doesn't depend on the order of the inputs.
A standard neural network (MLP, CNN) takes an ordered vector as input. Feed it [point1, point2, point3] and it gives one answer; feed it [point3, point1, point2] and it gives a different answer. That's wrong for point clouds. We need a symmetric function — one that treats all permutations identically.
Here's the key theorem: any continuous symmetric function on a set can be approximated by a function of the form:
Read it from the bottom up. Each point xi ∈ ℝ3 is independently passed through the same MLP f, producing a K-dimensional feature vector. Then you take the element-wise maximum across all N points, collapsing the set into a single K-dimensional vector. This is the global feature. Finally, another MLP g maps this global feature to the output (class probabilities).
Max is a symmetric function: max(a,b,c) = max(c,a,b) = max(b,c,a). It doesn't matter what order you feed the points in — the max is the same. Sum and mean also work, but max performs best empirically because it acts like a "voting" mechanism: each dimension of the K-vector captures whether any point has a particular geometric feature.
PointNet isn't just a hack. Qi et al. proved that this architecture can approximate any continuous symmetric function to arbitrary accuracy, given enough dimensions K. The max-pooling layer is the key: it's a universal symmetric aggregator. The per-point MLP f learns to map each point into a high-dimensional space where max-pooling extracts all the geometric information.
Consider the set of continuous symmetric functions on sets of up to N points. For any such function h, there exist continuous functions f: ℝ3 → ℝK and g: ℝK → ℝ such that h = g ∘ MAX ∘ f, provided K is large enough. The proof constructs f and g explicitly using the Stone-Weierstrass theorem on the quotient space of point sets under permutations.
For classification, the global feature (after max pooling) goes to a classifier. For part segmentation (labeling each point), PointNet concatenates the global feature with each point's local feature, then runs a per-point classifier. Each point sees both its own local geometry AND the global shape context.
PointNet's weakness: it processes each point independently before pooling, so it misses local structure. Two points that are spatially close don't interact until the global max pool. PointNet++ fixes this with hierarchical grouping: apply PointNet to local neighborhoods (like a convolution), pool within each neighborhood, then apply PointNet again to the resulting points, and so on. It's like building a spatial hierarchy — local features first, then progressively more global ones.
DGCNN (Dynamic Graph CNN) builds a k-nearest-neighbor graph in feature space (not just 3D space) and applies graph convolutions. The graph is recomputed at each layer, so the connectivity adapts as features evolve.
To train generative models or evaluate reconstructions, we need to measure how "close" two point clouds are. Two standard metrics:
Click Process Points to animate: each point passes through the shared MLP, then max pooling aggregates everything into one global feature vector. Now click Shuffle Input — notice that the output (global feature) is identical. That's permutation invariance. Adjust the slider to see how more points give richer features.
Here's a surprising result from 2015: if you want to classify 3D shapes, don't bother with fancy 3D architectures. Just render the shape from many viewpoints, run a standard 2D CNN on each view, and combine the results. This approach — Multi-View CNN (MVCNN) — beat every direct 3D method on the ModelNet benchmark.
Result: 89.9% accuracy on ModelNet40 (40-class 3D shape classification), versus ~77% for the best voxel-based 3D CNN at the time. The 2D CNN had seen millions of natural images during pretraining (ImageNet); the 3D CNN was trained from scratch on a much smaller dataset.
2D CNNs benefit from massive pretraining on ImageNet. They've learned rich feature hierarchies — edges, textures, parts, objects. 3D CNNs start from scratch with orders of magnitude less training data. The rendered views effectively transfer all that 2D knowledge to the 3D task. As 3D datasets grow, this gap narrows.
The direct approach: voxelize the 3D shape into a 3D grid (like an image, but with depth), then run 3D convolutions. A 3D conv kernel is a cube (e.g., 3×3×3) that slides through the volume.
The problem: O(N3) memory. A 323 voxel grid is manageable (~32K voxels). A 1283 grid has 2 million voxels. A 2563 grid has 16.7 million. Most of those voxels are empty — the surface occupies only a thin shell. You're paying cubic memory for quadratic information.
3D-GANs extend generative adversarial networks to voxel grids. The generator takes a random latent vector z ∈ ℝ200 and produces a 643 voxel grid via 3D transposed convolutions. The discriminator classifies voxel grids as real or generated. The result: you can sample random 3D shapes from a learned distribution.
The fix for cubic memory: octrees. An octree recursively subdivides space into 8 octants, but only subdivides where the surface is present. Empty regions stay as large, coarse blocks. The surface region gets fine subdivisions.
For a sphere at resolution N: uniform voxel grid uses O(N3) voxels. An octree uses O(N2 log N) — only the surface shell is refined. At N=256, that's 16.7M vs. ~1.5M. At N=1024, it's 1.07B vs. ~30M. The savings grow with resolution.
The key enabler for learning 3D from 2D: make the rendering process differentiable. If you can compute gradients of the rendered image with respect to the 3D shape parameters, you can train a neural network to predict 3D shapes by comparing rendered views to ground-truth images.
Standard rasterization is not differentiable (a pixel is either covered by a triangle or not — there's no gradient). Differentiable renderers soften this binary decision: each pixel gets a smooth probability of being covered, and gradients flow through. Neural mesh renderer, SoftRasterizer, and PyTorch3D all implement variants of this idea.
3D supervision (ground-truth meshes, point clouds, SDFs) is expensive to acquire. 2D supervision (photographs) is essentially free. Differentiable rendering bridges this gap: train on 2D images, learn 3D geometry. This philosophy drives NeRF, 3D Gaussian Splatting, and most modern 3D reconstruction methods.
Voxels are too expensive (O(N3)). Point clouds have no surface. Meshes are hard to generate with neural networks (you need to predict both vertices and connectivity, which is a combinatorial problem). What if you trained a neural network to be the implicit function?
DeepSDF (Park et al., 2019) trains an MLP to be a signed distance function. The input is a 3D coordinate plus a learnable latent code z that encodes the shape identity. The output is the signed distance at that point.
During training, you have ground-truth SDF samples from known shapes. For each shape, you jointly optimize the MLP weights θ and that shape's latent code z to minimize:
DeepSDF doesn't use an encoder. There's no network that maps a shape to its latent code. Instead, latent codes are directly optimized as free parameters during training — this is called an auto-decoder. At test time, given a new shape (as a partial point cloud), you freeze θ and optimize z to fit the observations. This is slower than a feed-forward encoder but produces better reconstructions.
Occupancy Networks (Mescheder et al., 2019) take a different approach: instead of predicting signed distance, predict the probability of occupancy.
Trained with binary cross-entropy: for each query point, the ground truth is 1 (inside) or 0 (outside). Unlike DeepSDF, Occupancy Networks use an encoder — a PointNet or ResNet that maps the input observation to the latent code z in a single forward pass.
| Property | Voxels (643) | Point Cloud | DeepSDF / OccNet |
|---|---|---|---|
| Resolution | Fixed, coarse | Depends on N | Continuous (infinite) |
| Memory | O(N3) | O(N) | O(|θ|) network params |
| Surface quality | Blocky | No surface | Smooth |
| Topology | Implicit | None | Arbitrary (learned) |
| Mesh extraction | Marching cubes | Poisson, ball pivot | Marching cubes on grid eval |
A single 8-layer MLP with 256 hidden units can represent complex 3D geometry at arbitrary resolution. The network weights are the shape. You evaluate it at any point you want — no grid, no predefined resolution. This is the conceptual shift: from storing geometry in data structures to encoding geometry in neural network weights.
Step 1: Given a new observation (image, partial point cloud), compute latent code z (via encoder or optimization). Step 2: Evaluate fθ(x, z) on a dense 3D grid (e.g., 2563). Step 3: Run Marching Cubes to extract the zero level set as a triangle mesh. Step 4: Render the mesh. Total: a photograph in, a 3D mesh out.
DeepSDF learns geometry. But what about appearance? A 3D scene isn't just shape — it's shape plus material plus lighting. What if a neural network could learn both simultaneously, trained from nothing but ordinary photographs?
That's NeRF (Mildenhall et al., 2020). It represents a scene as a continuous volumetric function that maps a 3D position and viewing direction to a color and density. No mesh. No point cloud. Just an MLP that you can query at any point in space to ask: "What does this look like from this direction?"
Density σ depends only on position (geometry is view-independent). Color depends on both position and direction (materials reflect differently from different angles — think of specular highlights on a glossy surface).
To render a pixel, shoot a ray from the camera through that pixel into the scene. Sample N points along the ray. At each sample, query the network for color and density. Then composite the colors using the volume rendering equation:
Read it step by step. Ti tracks how much light has survived from the camera to sample i — it starts at 1 (nothing blocking) and decreases as it hits dense regions. αi is how opaque this particular segment is (high density σ or large step size δ means more blocking). ci is the color we see at this point. Multiply all three: this sample contributes its color, weighted by its opacity and the accumulated transparency.
Every operation in the volume rendering equation — multiplication, addition, exponentiation — is differentiable. That means you can backpropagate from a rendered pixel all the way back through the volume rendering integral to the MLP weights. This is the key insight: you train NeRF using only 2D photographs + camera poses. No 3D supervision at all.
MLPs are biased toward learning smooth, low-frequency functions. But scenes have sharp edges, fine textures, and high-frequency detail. NeRF fixes this with positional encoding: before feeding (x,y,z) into the MLP, map it to a higher-dimensional space using sinusoids at multiple frequencies.
Without positional encoding, NeRF produces blurry renderings. With it, the network can represent high-frequency detail — individual bricks on a wall, text on a sign, the weave of a fabric.
Naive approach: sample N points uniformly along each ray. Wasteful — most of the ray passes through empty space. NeRF uses a coarse-to-fine strategy:
Input: a set of photographs with known camera poses (position and orientation for each image). Loss: the L2 difference between the rendered pixel color and the ground-truth pixel color.
That's it. No 3D supervision, no depth maps, no segmentation masks. Just "make the rendered images match the real photographs." The 3D geometry and appearance emerge as a byproduct of matching 2D observations.
A ray shoots from left to right through a 1D "scene" with colored density regions. At each sample point, the visualization shows the transmittance T (how much light survived so far), the opacity α (how much this sample blocks), and the weighted color contribution. The final pixel color at the right accumulates all contributions. Increase samples to see the rendering converge; increase density to make objects more opaque.
NeRF is slow. Training a single scene takes 1-2 days. Rendering a single image takes 30+ seconds (hundreds of MLP evaluations per pixel, millions of pixels per image). The scene is baked into the network weights — no generalization to new scenes.
| Extension | Key Idea | Improvement |
|---|---|---|
| Instant-NGP | Hash-based feature grids instead of positional encoding | Training in seconds, not hours |
| Mip-NeRF | Integrated positional encoding over cone frustums | Anti-aliased rendering, multi-scale |
| NeRF-W | Per-image appearance/transient embeddings | Works with internet photos (varying lighting) |
| Plenoxels | Explicit voxel grid with spherical harmonics | No MLP needed, 100× faster training |
| 3DGS | Gaussian blobs instead of dense volume | Real-time rendering (see next chapter) |
NeRF treats the scene as a dense implicit field: to render a pixel, you must query the MLP at hundreds of points along every ray. That's slow. 3D Gaussian Splatting (3DGS, Kerbl et al., 2023) inverts this: instead of querying a dense field, scatter a collection of explicit 3D blobs onto the image. The blobs are 3D Gaussians — each one a fuzzy, colored ellipsoid floating in space.
Each Gaussian is parameterized by: μ ∈ ℝ3 (position), Σ ∈ ℝ3×3 (covariance — shape/orientation of the ellipsoid), α ∈ [0,1] (opacity), and color (represented as spherical harmonics coefficients for view-dependent appearance). A typical scene uses 1-5 million Gaussians.
NeRF shoots rays into the scene (ray marching). 3DGS does the opposite: it projects Gaussians onto the image plane (splatting). Each 3D Gaussian projects to a 2D Gaussian on the screen. Then you sort the 2D Gaussians by depth and alpha-composite them front-to-back, exactly like stacking semi-transparent colored blobs.
This looks almost identical to NeRF's volume rendering equation — and it is. The difference: NeRF samples densely along rays; 3DGS only evaluates at the Gaussian positions. Since most of space is empty, this is dramatically faster.
3DGS starts from a Structure-from-Motion (SfM) point cloud. Run COLMAP on your input images to get a sparse 3D reconstruction and camera poses. Each SfM point becomes the initial center μ of a Gaussian. The initial covariance is isotropic (spherical blob), initial opacity is low, and initial color is the average color of the corresponding 2D point.
The entire splatting pipeline is differentiable. Loss = L1 + D-SSIM between rendered and ground-truth images. Gradients flow back to all Gaussian parameters: position, covariance, color, opacity.
| Property | NeRF | 3D Gaussian Splatting |
|---|---|---|
| Representation | Dense implicit (MLP) | Sparse explicit (Gaussian set) |
| Rendering | Ray marching (sample along rays) | Splatting (project blobs onto image) |
| Speed | ~30s per frame | Real-time (100+ FPS) |
| Training | Hours to days | Minutes to hours |
| Editability | Hard (weights encode everything) | Easy (move/delete Gaussians) |
| Memory | Small (MLP weights) | Large (millions of Gaussians) |
| Quality | Excellent | Competitive or better |
| Supervision | 2D images + poses | 2D images + poses + SfM init |
NeRF and 3DGS represent two poles: dense implicit vs. sparse explicit. NeRF stores the scene in network weights (compact, but slow to query). 3DGS stores it in millions of explicit primitives (large, but fast to render). Both are trained from the same input (2D photographs) and produce the same output (photorealistic novel views). The tradeoff is speed vs. compactness.
Let's bring everything together. Below is an interactive visualization of the same 3D shape in four different representations: point cloud, wireframe mesh, voxel grid, and SDF heatmap. Toggle between them to see the tradeoffs firsthand.
Each representation captures the same torus, but notice the tradeoffs:
| Representation | What You See | Sampling | Inside/Outside | Memory |
|---|---|---|---|---|
| Point Cloud | Scattered dots on surface | Trivial | Hard | Low |
| Wireframe Mesh | Connected triangle edges | Easy | Ray cast | Medium |
| Voxels | Filled cubes | March cubes | Lookup | O(N3) |
| SDF Heatmap | Color = distance to surface | Root-find | Sign check | Continuous |
You can't have everything. Point clouds are memory-efficient but have no surface. Meshes have surfaces but are hard to generate from neural networks. Voxels are easy to generate but blow up in memory. Implicit functions (SDFs) are compact and smooth but slow to render (need marching cubes). Every modern method picks a point in this tradeoff space.
Everything we've discussed represents geometry — the raw shape. But objects have structure: a chair has legs, a seat, a back. A car has wheels, doors, windows. Understanding parts and their relationships is a higher level of 3D understanding.
Part segmentation labels each point or face with a semantic part (e.g., "leg 1," "seat," "back"). PointNet and its successors can learn this from labeled data.
StructureNet (Mo et al., 2019) represents shapes as hierarchical graphs: the root is the whole object, children are parts, and edges encode spatial relationships (adjacency, symmetry). Each node stores a bounding box and a part type. This lets you generate shapes by generating graph structures.
Shape programs are the most expressive representation: describe a 3D shape as a program (sequence of operations like "place a cylinder at position p with radius r and height h"). A shape program subsumes all other representations — you can write a program that outputs points, meshes, voxels, or SDFs. The challenge is learning to write programs from data.
| Representation | Type | Resolution | Learning | Key Method |
|---|---|---|---|---|
| Point cloud | Explicit | Discrete | PointNet, PointNet++ | Per-point MLP + max pool |
| Mesh | Explicit | Discrete | Mesh R-CNN, Pixel2Mesh | Deform template mesh |
| Voxels | Explicit | Discrete (coarse) | 3D CNN, 3D-GAN | 3D convolutions |
| Multi-view | Implicit (2D) | Image resolution | MVCNN | Render + 2D CNN + pool |
| SDF/Occupancy | Implicit | Continuous | DeepSDF, OccNet | MLP as implicit function |
| Radiance field | Implicit | Continuous | NeRF, Instant-NGP | MLP + volume rendering |
| Gaussians | Explicit | Adaptive | 3DGS | Differentiable splatting |
| Octree | Explicit | Adaptive | OctNet | Sparse 3D convolutions |
| Parts/programs | Structured | Semantic | StructureNet, ShapeAssembly | Hierarchical graphs |
The field has moved through a clear progression: voxels (simple but cubic) → point clouds (efficient but no surface) → meshes (surfaces but hard to generate) → deep implicit functions (continuous but slow) → NeRF (photorealistic but slow) → 3DGS (real-time). Each step addressed the previous step's biggest weakness.
Every method in this lecture answers the same question: "How do I represent 3D geometry so that a neural network can learn it, and a renderer can display it?" The choice of representation determines the inductive bias, the computational cost, and the achievable quality. There is no universal best — only the right tool for the right task.