The Complete Beginner's Path

Understand NeRF & 3D Gaussian Splatting

How neural networks conjure 3D scenes from ordinary photographs — and why one approach uses rays while the other throws paint.

Prerequisites: Basic intuition for 3D space + Curiosity about graphics. That's it.
10
Chapters
9+
Simulations
0
Graphics Background Needed

Chapter 0: 3D from 2D

You take 50 photos of a statue from different angles. Your brain can reconstruct the 3D shape. Can a neural network do the same? This is the problem of novel view synthesis: given a set of images and their camera poses, render the scene from any new viewpoint.

Traditional approaches (structure from motion, multi-view stereo) extract explicit geometry like point clouds or meshes. Neural approaches like NeRF and 3D Gaussian Splatting take a radically different path: they learn an implicit or parametric representation of the scene that can be rendered directly.

The setup: Input = photos + camera positions. Output = a representation that can render the scene from any new camera position. No 3D scanner needed — just ordinary photographs.
Multi-View Input

Multiple cameras observe a 3D object from different angles. The goal: reconstruct what the object looks like from any viewpoint.

Camera count8
Check: What is the input to a NeRF or 3DGS system?

Chapter 1: Volume Rendering

To render a pixel, you shoot a ray from the camera through the scene. Along this ray, you sample points and ask: "What color and density exists here?" Then you composite these samples from front to back using the volume rendering equation.

C(r) = ∫ T(t) · σ(t) · c(t) dt    where T(t) = exp(−∫σ(s)ds)

σ is density (how opaque the material is), c is color, and T is transmittance (how much light makes it through to this point). Dense regions block light; empty space passes it through.

Ray Marching Through a Volume

A ray travels through a scene, sampling density and color at each point. Denser regions contribute more to the final pixel color.

Density scale8
The key insight: Volume rendering is differentiable. This means we can use gradient descent to optimize whatever produces the density and color values — and that "whatever" is the neural network in NeRF.
Check: In volume rendering, what does transmittance T(t) represent?

Chapter 2: Neural Radiance Fields (NeRF)

NeRF represents a 3D scene as a continuous function: given a 3D point (x, y, z) and viewing direction (theta, phi), it outputs the color (r, g, b) and density (sigma) at that point. This function is parameterized by a simple MLP (multilayer perceptron).

Fθ(x, y, z, θ, φ) → (r, g, b, σ)
Input
(x, y, z) position + (θ, φ) view direction
MLP Network
8 layers, 256 channels, skip connections
Output
(r, g, b) color + σ density
NeRF: Query the Scene Function

Click anywhere in the scene to query the NeRF network at that point. It returns color and density.

Why viewing direction? The same surface looks different from different angles (specular highlights, reflections). By conditioning on view direction, NeRF can represent view-dependent effects like the shine on a glossy object.
Check: What does a NeRF MLP take as input?

Chapter 3: Positional Encoding & the MLP

MLPs are biased toward learning smooth, low-frequency functions. But scenes have sharp edges, fine textures, and high-frequency detail. Positional encoding solves this by mapping the input coordinates to a higher-dimensional space using sinusoidal functions.

γ(p) = [sin(20πp), cos(20πp), ..., sin(2L-1πp), cos(2L-1πp)]
Positional Encoding: Low vs High Frequency

Without positional encoding, the MLP can only learn smooth blobs. With it, sharp edges and fine detail emerge. Adjust L (number of frequency bands).

Frequency bands L0
Same idea as transformers: This is the same positional encoding concept used in transformers, but applied to 3D spatial coordinates. Higher frequencies let the network distinguish between nearby points — essential for sharp geometry.
Check: Why does NeRF need positional encoding?

Chapter 4: Sampling Strategies

Evaluating the MLP at every point along a ray is expensive. NeRF uses a hierarchical sampling strategy: first, sample uniformly (coarse pass), then concentrate more samples where density is high (fine pass). This focuses compute where it matters.

Coarse Pass
64 uniform samples along the ray
Fine Pass
128 additional samples concentrated near surfaces
Hierarchical Sampling

Blue dots = coarse uniform samples. Green dots = fine importance-weighted samples near the surface.

Surface position60%
Importance sampling intuition: Why waste evaluations in empty air? After the coarse pass reveals where the surface is, the fine pass puts most of its budget right at the surface boundary — exactly where the details matter.
Check: Why does NeRF use hierarchical sampling?

Chapter 5: Speed — Instant-NGP

Original NeRF takes hours to train and seconds to render a single frame. Instant-NGP (NVIDIA, 2022) slashed training to seconds and rendering to real-time by replacing the MLP with a multi-resolution hash table.

Instead of a deep MLP that must process every point through 8 layers, Instant-NGP looks up precomputed features in a hash table indexed by spatial position. This is massively parallel and cache-friendly.

MethodTraining TimeRender SpeedKey Technique
Original NeRF~1 day~30s/frameDeep MLP
Instant-NGP~5 secondsReal-timeHash grid encoding
TensoRF~30 min~1s/frameTensor factorization
Plenoxels~11 min~15fpsSparse voxel grid
MLP vs Hash Grid Lookup

Compare the two approaches: deep MLP requires sequential layers, hash grid is a fast parallel lookup. Toggle to compare.

The hash trick: At each resolution level, spatial positions are hashed to a fixed-size table. Collisions are resolved by the neural network learning to work around them. This trades a tiny bit of quality for massive speed gains.
Check: What makes Instant-NGP so much faster than original NeRF?

Chapter 6: 3D Gaussian Splatting

3DGS takes a completely different approach from NeRF. Instead of an implicit function queried along rays, it represents the scene as millions of 3D Gaussians — each one a colored, oriented ellipsoid with position, covariance, color, and opacity.

To render: project each Gaussian onto the screen (splatting), sort by depth, and alpha-composite them front to back. No ray marching, no MLP evaluation — just rasterization. This is extremely fast.

G(x) = exp(−½ (x − μ)T Σ−1 (x − μ))
Gaussian Splatting Visualization

Each ellipse is a 3D Gaussian projected to 2D. More Gaussians = more detail. Adjust the count and see them splat!

Gaussians100
Per-Gaussian ParameterMeaningCount
Position μ3D center of the Gaussian3
Covariance ΣShape, size, orientation (via quaternion + scale)7
ColorSpherical harmonics coefficients48
Opacity αTransparency1
Why Gaussians? They're differentiable (can be optimized with gradient descent), fast to project (closed-form 2D projection), and naturally handle smooth surfaces. Plus, they tile-sort beautifully on GPUs.
Check: How does 3DGS render an image?

Chapter 7: Rasterization vs Ray Marching

The fundamental difference between NeRF and 3DGS is how they render:

NeRF (Ray Marching)3DGS (Rasterization)
ApproachShoot rays, sample points, query MLPProject primitives to screen, sort, blend
RepresentationImplicit (continuous function)Explicit (millions of Gaussians)
Training speedHours to daysMinutes
Render speedSeconds per frame100+ FPS
MemoryLow (small MLP)High (millions of Gaussians)
QualityExcellentExcellent (often better)
Ray Marching vs Rasterization

Toggle between the two rendering paradigms to see the conceptual difference.

The verdict (2024+): 3DGS has largely overtaken NeRF for most practical applications due to its speed advantage. But NeRF-style implicit representations still win for memory-constrained scenarios and certain generative tasks.
Check: What is the main practical advantage of 3DGS over NeRF?

Chapter 8: Generative 3D

What if you could generate a 3D object from a text prompt or a single image? Generative 3D combines NeRF/3DGS with diffusion models to create 3D content without any multi-view input.

MethodInputKey Idea
DreamFusionText promptScore Distillation Sampling (SDS) from 2D diffusion
Zero-1-to-3Single imageViewpoint-conditioned diffusion
Magic3DText promptCoarse-to-fine with mesh extraction
LGM4 imagesFeed-forward 3DGS generation
GaussianDreamerText promptSDS applied to Gaussian splatting
Score Distillation: Text to 3D

A 2D diffusion model guides the optimization of a 3D NeRF representation. The NeRF renders views, the diffusion model critiques them.

SDS intuition: Render the NeRF from a random viewpoint. Add noise. Ask the diffusion model "does this look like [the text prompt]?" Use its gradient to update the NeRF. Repeat from many angles. The NeRF converges to a 3D object that looks right from every direction.
Check: How does DreamFusion create 3D from text?

Chapter 9: 3D in Robotics

Neural 3D representations are transforming robotics. Robots need to understand 3D geometry to grasp objects, navigate spaces, and plan motions. NeRF and 3DGS provide rich scene representations that go beyond flat depth maps.

ApplicationHow NeRF/3DGS Helps
Grasp planningDense 3D geometry for contact-rich manipulation
NavigationPhotorealistic simulation for training navigation policies
Sim-to-realReconstruct real environments as training scenes
Language-guidedPair 3D features with CLIP for "find the mug" queries
DeformablesModel soft objects and cloth for manipulation
3D Scene Understanding for Robots

A robot uses a 3DGS representation to plan grasps. Colored regions show semantic understanding overlaid on 3D structure.

Feature Fields: LERF and similar methods distill CLIP features into NeRF/3DGS representations. This creates a 3D scene where you can point at any location and get a language-aligned feature vector — enabling queries like "where is the coffee mug?" directly in 3D space.
"We don't see the world as it is, we see it as we render it."
— Adapted from Anaïs Nin

You now understand how neural networks reconstruct 3D worlds from photographs. From NeRF's elegant ray marching to 3DGS's blazing-fast splatting, these techniques are redefining what's possible in graphics, VR, and robotics.

Check: How do Feature Fields (like LERF) help robots?