How neural networks conjure 3D scenes from ordinary photographs — and why one approach uses rays while the other throws paint.
You take 50 photos of a statue from different angles. Your brain can reconstruct the 3D shape. Can a neural network do the same? This is the problem of novel view synthesis: given a set of images and their camera poses, render the scene from any new viewpoint.
Traditional approaches (structure from motion, multi-view stereo) extract explicit geometry like point clouds or meshes. Neural approaches like NeRF and 3D Gaussian Splatting take a radically different path: they learn an implicit or parametric representation of the scene that can be rendered directly.
Multiple cameras observe a 3D object from different angles. The goal: reconstruct what the object looks like from any viewpoint.
To render a pixel, you shoot a ray from the camera through the scene. Along this ray, you sample points and ask: "What color and density exists here?" Then you composite these samples from front to back using the volume rendering equation.
σ is density (how opaque the material is), c is color, and T is transmittance (how much light makes it through to this point). Dense regions block light; empty space passes it through.
A ray travels through a scene, sampling density and color at each point. Denser regions contribute more to the final pixel color.
NeRF represents a 3D scene as a continuous function: given a 3D point (x, y, z) and viewing direction (theta, phi), it outputs the color (r, g, b) and density (sigma) at that point. This function is parameterized by a simple MLP (multilayer perceptron).
Click anywhere in the scene to query the NeRF network at that point. It returns color and density.
MLPs are biased toward learning smooth, low-frequency functions. But scenes have sharp edges, fine textures, and high-frequency detail. Positional encoding solves this by mapping the input coordinates to a higher-dimensional space using sinusoidal functions.
Without positional encoding, the MLP can only learn smooth blobs. With it, sharp edges and fine detail emerge. Adjust L (number of frequency bands).
Evaluating the MLP at every point along a ray is expensive. NeRF uses a hierarchical sampling strategy: first, sample uniformly (coarse pass), then concentrate more samples where density is high (fine pass). This focuses compute where it matters.
Blue dots = coarse uniform samples. Green dots = fine importance-weighted samples near the surface.
Original NeRF takes hours to train and seconds to render a single frame. Instant-NGP (NVIDIA, 2022) slashed training to seconds and rendering to real-time by replacing the MLP with a multi-resolution hash table.
Instead of a deep MLP that must process every point through 8 layers, Instant-NGP looks up precomputed features in a hash table indexed by spatial position. This is massively parallel and cache-friendly.
| Method | Training Time | Render Speed | Key Technique |
|---|---|---|---|
| Original NeRF | ~1 day | ~30s/frame | Deep MLP |
| Instant-NGP | ~5 seconds | Real-time | Hash grid encoding |
| TensoRF | ~30 min | ~1s/frame | Tensor factorization |
| Plenoxels | ~11 min | ~15fps | Sparse voxel grid |
Compare the two approaches: deep MLP requires sequential layers, hash grid is a fast parallel lookup. Toggle to compare.
3DGS takes a completely different approach from NeRF. Instead of an implicit function queried along rays, it represents the scene as millions of 3D Gaussians — each one a colored, oriented ellipsoid with position, covariance, color, and opacity.
To render: project each Gaussian onto the screen (splatting), sort by depth, and alpha-composite them front to back. No ray marching, no MLP evaluation — just rasterization. This is extremely fast.
Each ellipse is a 3D Gaussian projected to 2D. More Gaussians = more detail. Adjust the count and see them splat!
| Per-Gaussian Parameter | Meaning | Count |
|---|---|---|
| Position μ | 3D center of the Gaussian | 3 |
| Covariance Σ | Shape, size, orientation (via quaternion + scale) | 7 |
| Color | Spherical harmonics coefficients | 48 |
| Opacity α | Transparency | 1 |
The fundamental difference between NeRF and 3DGS is how they render:
| NeRF (Ray Marching) | 3DGS (Rasterization) | |
|---|---|---|
| Approach | Shoot rays, sample points, query MLP | Project primitives to screen, sort, blend |
| Representation | Implicit (continuous function) | Explicit (millions of Gaussians) |
| Training speed | Hours to days | Minutes |
| Render speed | Seconds per frame | 100+ FPS |
| Memory | Low (small MLP) | High (millions of Gaussians) |
| Quality | Excellent | Excellent (often better) |
Toggle between the two rendering paradigms to see the conceptual difference.
What if you could generate a 3D object from a text prompt or a single image? Generative 3D combines NeRF/3DGS with diffusion models to create 3D content without any multi-view input.
| Method | Input | Key Idea |
|---|---|---|
| DreamFusion | Text prompt | Score Distillation Sampling (SDS) from 2D diffusion |
| Zero-1-to-3 | Single image | Viewpoint-conditioned diffusion |
| Magic3D | Text prompt | Coarse-to-fine with mesh extraction |
| LGM | 4 images | Feed-forward 3DGS generation |
| GaussianDreamer | Text prompt | SDS applied to Gaussian splatting |
A 2D diffusion model guides the optimization of a 3D NeRF representation. The NeRF renders views, the diffusion model critiques them.
Neural 3D representations are transforming robotics. Robots need to understand 3D geometry to grasp objects, navigate spaces, and plan motions. NeRF and 3DGS provide rich scene representations that go beyond flat depth maps.
| Application | How NeRF/3DGS Helps |
|---|---|
| Grasp planning | Dense 3D geometry for contact-rich manipulation |
| Navigation | Photorealistic simulation for training navigation policies |
| Sim-to-real | Reconstruct real environments as training scenes |
| Language-guided | Pair 3D features with CLIP for "find the mug" queries |
| Deformables | Model soft objects and cloth for manipulation |
A robot uses a 3DGS representation to plan grasps. Colored regions show semantic understanding overlaid on 3D structure.
You now understand how neural networks reconstruct 3D worlds from photographs. From NeRF's elegant ray marching to 3DGS's blazing-fast splatting, these techniques are redefining what's possible in graphics, VR, and robotics.