Kerbl, Kopanas, Leimkühler, Drettakis — 2023

3D Gaussian Splatting

Represent scenes as millions of learnable 3D Gaussians, render them via differentiable tile-based splatting at 100+ fps with quality matching NeRF — no neural network at render time.

Prerequisites: NeRF basics + Alpha compositing + Multivariate Gaussians
10
Chapters
6
Simulations

Chapter 0: The Problem

You have 100 photos of a room. You want to render what the room looks like from any new viewpoint — smoothly, in real time, at 1080p. This is novel view synthesis.

Neural Radiance Fields (NeRF) solved this beautifully in 2020. Train an MLP to map (x, y, z, direction) → (color, density), then render by marching rays through the scene and querying the network hundreds of times per pixel. The results are stunning. But there is a brutal cost:

The fundamental bottleneck is the implicit representation. NeRF stores the scene inside network weights. To know what color a point in space is, you must run a forward pass. Every pixel requires hundreds of such queries along its ray. No matter how clever the acceleration, you are married to per-ray, per-sample network evaluation.

The challenge: Can we match Mip-NeRF360's quality while rendering at 100+ fps? The answer requires abandoning implicit neural representations entirely and going back to an old idea: explicit point primitives — but with a critical twist.
NeRF Rendering Cost

Each pixel casts a ray and samples the MLP many times. Drag the slider to see how sample count affects cost. At 1080p, millions of MLP queries are needed per frame.

Samples/ray64
Why can't NeRF render in real time, even with acceleration structures like hash grids?

Chapter 1: The Key Insight

What if the scene representation were explicit instead of implicit? Instead of querying a neural network to ask "what is at position x?", what if we stored millions of little geometric primitives that directly are the scene?

Point clouds are explicit, but raw points have problems: holes between points, no notion of extent or shape, no smooth gradients for optimization. Voxel grids are explicit too, but they waste enormous memory on empty space and are locked to a fixed resolution.

The insight of 3D Gaussian Splatting: represent the scene as a collection of 3D Gaussians — soft, fuzzy ellipsoids, each with a position, shape (covariance), color (via spherical harmonics), and opacity. These primitives are:

The paradigm shift: NeRF asks "what is the color and density at this 3D point?" for every sample along every ray. Gaussian Splatting asks "where does each Gaussian land on the image?" and splats them all at once. Ray marching is O(pixels × samples). Splatting is O(Gaussians). With a fast tile-based rasterizer and GPU sorting, this is radically faster.
Implicit vs Explicit

Left: NeRF queries an MLP per sample along each ray. Right: 3DGS projects Gaussians to the image and blends. Toggle to compare.

Why are 3D Gaussians a better scene primitive than raw points or voxels?

Chapter 2: Anatomy of a 3D Gaussian

Each Gaussian in the scene is defined by four groups of learnable parameters. Let's walk through each one.

Position (mean) μ

A 3D vector (x, y, z) specifying the center of the Gaussian in world space. Initialized from the sparse SfM point cloud.

Covariance Σ (shape)

A 3×3 positive semi-definite matrix that defines the Gaussian's shape and orientation — think of it as an ellipsoid. But optimizing a raw 3×3 matrix is dangerous: gradient descent can easily produce a matrix that isn't positive semi-definite (and thus not a valid covariance).

The solution: decompose Σ into a rotation R (stored as a unit quaternion q, 4 parameters) and a scale S (a 3D vector s for the three axis lengths):

Σ = R S ST RT

This is always positive semi-definite by construction. We optimize s and q independently, and reconstruct Σ when needed. Anisotropic (non-spherical) Gaussians can represent thin structures, flat surfaces, and complex geometry compactly — a flat wall might need just one highly elongated Gaussian instead of hundreds of tiny spheres.

Color via Spherical Harmonics (SH)

Real-world appearance is view-dependent: a shiny surface looks different from different angles. Rather than storing a single RGB color, each Gaussian stores spherical harmonic coefficients — a compact basis for functions defined on the sphere of viewing directions.

With SH degree ℓ, we store (ℓ+1)² coefficients per color channel. The paper uses up to degree 3, which gives 16 coefficients × 3 channels = 48 parameters for color. For a given view direction d, the color is:

c(d) = ∑ℓ,m cℓm Ym(d)

where Ym are the spherical harmonic basis functions. Degree 0 = diffuse (constant color). Higher degrees capture specular highlights and view-dependent effects.

Opacity α

A single scalar in [0, 1] controlling how opaque the Gaussian is. Passed through a sigmoid activation for smooth gradients. Transparent Gaussians (α near 0) can be pruned during training.

Parameter count per Gaussian: Position (3) + Scale (3) + Rotation quaternion (4) + Opacity (1) + SH coefficients (48 at degree 3) = 59 parameters. A typical scene uses 1–5 million Gaussians, so 59M–295M floats total. This is stored explicitly in GPU memory — no network weights needed at render time.
3D Gaussian Anatomy

A single Gaussian with its four parameter groups. Drag the sliders to change scale and rotation and see how the ellipsoid shape changes.

Scale X50
Scale Y30
Rotation
Opacity0.80
Why is the covariance matrix Σ decomposed into rotation R and scale S instead of being optimized directly?

Chapter 3: Differentiable Splatting

We have millions of 3D Gaussians floating in world space. Now we need to render them into an image. This is where the "splatting" happens — and it needs to be both fast and differentiable.

Step 1: Project 3D Gaussians to 2D

Each 3D Gaussian is an ellipsoid in world space. To render it, we project it onto the image plane, producing a 2D ellipse (a "splat"). The 3D covariance Σ transforms under the camera's viewing transformation W and the projective Jacobian J:

Σ′ = J W Σ WT JT

Drop the third row and column of Σ′ to get a 2×2 covariance matrix for the projected 2D Gaussian. This is the same math used in the classical EWA splatting framework (Zwicker et al., 2001).

Step 2: Alpha compositing (front-to-back)

For each pixel, we blend all the Gaussians that overlap it, sorted by depth, front to back:

C = ∑i=1N Ti αi ci   where   Ti = ∏j=1i−1 (1 − αj)

Here αi for each pixel is the Gaussian's learned opacity times the 2D Gaussian evaluated at that pixel's location. Ti is the transmittance — how much light from Gaussian i actually reaches the camera after being partially blocked by Gaussians 1 through i−1.

This is exactly the same image formation model as NeRF's volume rendering equation, just with explicit Gaussians instead of neural density samples along a ray.

Step 3: Tile-based rasterization

The paper's key engineering contribution is a custom CUDA rasterizer that makes this fast:

  1. Tile the screen into 16×16 pixel blocks
  2. Cull Gaussians against the view frustum (keep only those whose 99% confidence interval intersects it)
  3. Assign each Gaussian to the tiles it overlaps, creating (tile_id, depth) key pairs
  4. Sort all key pairs with a single GPU radix sort — one global sort for the entire image, not per-pixel
  5. Rasterize each tile in parallel: one CUDA thread block per tile, loading Gaussians into shared memory and blending front-to-back until the pixel saturates (α → 1)
Why this is so fast: NeRF does O(pixels × samples_per_ray) MLP evaluations. 3DGS does O(Gaussians × tiles_per_Gaussian) simple arithmetic operations. The radix sort is O(n log n) but runs entirely on GPU. No neural network is evaluated at render time — just evaluate 2D Gaussians and blend. This is why 3DGS achieves 100+ fps.

Backward pass

For training, we need gradients of the rendered image with respect to all Gaussian parameters. The rasterizer traverses the per-tile Gaussian lists back-to-front, recovering intermediate transmittance values by dividing the stored final accumulated opacity by each Gaussian's alpha. This avoids storing per-pixel lists of arbitrary length — only one float (total accumulated alpha) is stored per pixel.

Splatting Pipeline

Watch 3D Gaussians get projected to 2D ellipses and alpha-composited front-to-back. Click "Step" to advance through the pipeline, or "Auto" to animate. Each colored ellipse is one Gaussian's splat.

Ready — click Step
What makes the tile-based rasterizer so much faster than NeRF's ray marching?

Chapter 4: Adaptive Density Control

Training starts from a sparse SfM point cloud — maybe 50,000 points for a complex scene. That is nowhere near enough to represent fine geometry. The optimization needs to grow and refine the Gaussian set during training. This is adaptive density control, and it has three operations: clone, split, and prune.

When to densify

Every 100 training iterations (after a warm-up period), the system checks each Gaussian's average view-space positional gradient. If a Gaussian has large positional gradients, it means the optimizer is struggling — the current Gaussian placement isn't capturing the geometry well. The threshold is τpos = 0.0002.

Clone (under-reconstruction)

When a small Gaussian has large gradients, the scene needs more coverage in that area. The system creates a copy of the Gaussian and moves it in the direction of the positional gradient. This increases both the number of Gaussians and the total volume they cover.

Split (over-reconstruction)

When a large Gaussian has large gradients, it is trying to cover too much detail with one blob. The system replaces it with two smaller Gaussians, each with scale divided by φ = 1.6, positioned by sampling from the original Gaussian's distribution. This preserves total volume but increases resolution.

Prune (cleanup)

Gaussians with opacity α below a threshold εα are removed — they are effectively transparent and contribute nothing. Additionally, every 3,000 iterations, all opacities are reset close to zero. The optimization then naturally restores opacity for Gaussians that are actually needed, while newly transparent ones get pruned. This prevents "floater" artifacts near the training cameras.

The elegance of adaptive density control: Unlike voxel grids or hash tables that have a fixed spatial resolution, the Gaussian count adapts to scene complexity. A plain white wall might use a few large Gaussians. A detailed bookshelf might use tens of thousands of tiny ones. The optimization discovers this allocation automatically via gradient signals.
Clone, Split, Prune

Click each button to see how the three density control operations transform Gaussians. Clone duplicates small Gaussians, Split breaks large ones, Prune removes transparent ones.

What signal does the system use to decide which Gaussians need to be cloned or split?

Chapter 5: Training

Training 3D Gaussian Splatting follows a straightforward render-and-compare loop, but there are several important design choices that make it work.

Initialization

Start with the sparse point cloud from Structure-from-Motion (SfM) — the same camera calibration step that NeRF uses. Each SfM point becomes one Gaussian. The initial covariance is set to an isotropic (spherical) Gaussian whose radius equals the average distance to the three nearest neighbors. On synthetic scenes (NeRF-synthetic dataset), even random initialization works.

Loss function

The training loss combines L1 photometric loss with a structural similarity term:

L = (1 − λ) L1 + λ LD-SSIM   with   λ = 0.2

L1 compares pixel colors directly. D-SSIM (Structural Dissimilarity) captures perceptual differences — it penalizes blurriness and structural mismatches that L1 alone might miss. The combination yields both sharp edges and accurate colors.

Optimizer

Standard Adam optimizer with learning rate scheduling. Positions use an exponential decay schedule (starting at 1.6×10−4, decaying to 1.6×10−6). Other parameters use fixed learning rates: opacity at 0.05, scales at 0.005, rotation at 0.001, SH coefficients at 0.0025.

Activation functions

Training schedule

Init
Create Gaussians from SfM points. Set isotropic covariance, random SH, α via inverse sigmoid
Warm-up
500 iterations: optimize parameters only (no densification)
Main loop
Render random training view → compute L1 + D-SSIM loss → backprop → update parameters. Every 100 iters: densify (clone/split). Every 3000 iters: reset opacities
Convergence
30,000 iterations (~6–51 min depending on quality target). Final: 1–5M Gaussians
Training time comparison: 6 minutes of 3DGS training matches InstantNGP quality (PSNR ~22). 51 minutes matches or exceeds Mip-NeRF360 (PSNR ~25). Mip-NeRF360 itself takes 48 hours. That is a 56× speedup at equal quality.
Why does the loss function combine L1 with D-SSIM instead of using L1 alone?

Chapter 6: Results

3D Gaussian Splatting was evaluated on three established benchmarks: Mip-NeRF360 (complex unbounded real scenes), Tanks&Temples (large-scale indoor/outdoor), and Deep Blending (indoor with challenging lighting). The results are striking.

Quality (PSNR / SSIM / LPIPS)

On Mip-NeRF360 scenes at full training time (51 min):

Speed

At 1080p resolution:

The headline result: 3DGS is the first method to achieve real-time (≥30 fps) novel view synthesis at 1080p with quality that matches the best offline method. It renders 1,300–1,900× faster than Mip-NeRF360 while matching or exceeding its quality. Training is also 56× faster.

Ablation studies

The paper systematically removes components to measure their contribution:

Speed vs Quality

Comparing methods on the Mip-NeRF360 benchmark. 3DGS (teal) achieves the best combination of speed and quality. Hover/tap points for details.

How does 3DGS rendering speed compare to Mip-NeRF360 at 1080p?

Chapter 7: Comparison with NeRF

3DGS and NeRF solve the same problem (novel view synthesis from multi-view images) but make fundamentally different design choices. Understanding these differences clarifies when to use which.

Scene representation

NeRF: Implicit. The scene is encoded in MLP weights. To query any point, run a forward pass. The representation is continuous and compact (a few MB of weights), but every render query is expensive.

3DGS: Explicit. The scene is millions of Gaussian primitives stored as parameter arrays. No network query needed. The representation is large (hundreds of MB) but every render operation is trivially cheap (evaluate a 2D Gaussian, multiply, add).

Rendering

NeRF: Ray marching. Cast a ray per pixel, sample 64–256 points per ray, query the MLP at each sample. Computational cost scales with resolution × samples per ray.

3DGS: Rasterization (splatting). Project each Gaussian to the image, sort by depth, blend in tiles. Cost scales with number of Gaussians × tiles per Gaussian. Massive parallelism on GPUs.

Editability

NeRF: Hard to edit. The scene is entangled in network weights. Moving an object means retraining.

3DGS: Directly editable. Gaussians are explicit objects with positions. You can select, move, delete, or recolor subsets of Gaussians. This enables scene editing, composition, and animation.

The shared image formation model

Despite different rendering pipelines, both methods use the same alpha compositing equation:

C = ∑i Ti αi ci   where   Ti = ∏j<i (1 − αj)

In NeRF, the samples come from points along a ray with density σ. In 3DGS, they come from Gaussians overlapping a pixel. Same math, different sources of (c, α).

When to use which: 3DGS wins on speed (real-time), editability, and training time. NeRF wins on compactness (model size) and may generalize better to unseen regions due to the implicit prior of the MLP. For practical applications requiring real-time rendering (VR, games, telepresence), 3DGS is the clear choice.
What is the fundamental reason 3DGS is faster than NeRF at render time?

Chapter 8: Limitations

3DGS is a breakthrough, but it has real limitations that subsequent work has been addressing.

Storage

A trained 3DGS model stores 1–5 million Gaussians, each with 59 parameters as 32-bit floats. That is 200 MB–1 GB per scene, compared to ~5 MB for a NeRF MLP. Compression techniques (quantization, codebooks, pruning) can reduce this by 10–50×, but it remains a concern for mobile and web deployment.

Aliasing and zoom

When the camera zooms far out, many Gaussians project to sub-pixel size, causing aliasing artifacts (flickering, moire patterns). The original 3DGS has no multi-scale or mip-mapping equivalent. Mip-Splatting (2024) addresses this by adding 3D and 2D smoothing filters that prevent Gaussians from becoming smaller than a pixel.

Dynamic scenes

The original paper handles only static scenes. Extending to dynamics requires modeling Gaussian motion over time. 4D Gaussian Splatting and deformable variants have since addressed this, but it was not in the original work.

Artifacts in under-observed regions

Regions seen by few training views can develop "floater" Gaussians — semi-transparent blobs floating in mid-air. The opacity reset heuristic mitigates this but doesn't eliminate it entirely. Regularization techniques (depth supervision, normal consistency) help.

No learned priors

Because 3DGS is purely per-scene optimization (no pretrained network), it cannot hallucinate plausible content in unobserved regions. NeRF's MLP provides a weak inductive bias toward smooth, natural-looking completions. 3DGS needs all regions to be well-observed in the training images.

The storage vs. speed tradeoff: NeRF's implicit representation is compact but slow to render. 3DGS's explicit representation is fast to render but large to store. This is a fundamental tradeoff between compression (implicit/neural) and speed (explicit/primitive). Subsequent work on Gaussian compression aims to have both.
Why does 3DGS produce aliasing artifacts when the camera zooms far out?

Chapter 9: Connections

What 3DGS built on

NeRF (Mildenhall et al., 2020): The neural radiance field that started it all. 3DGS keeps the same image formation model (alpha compositing along viewing directions) but replaces the implicit MLP with explicit Gaussians.

Mip-NeRF 360 (Barron et al., 2022): The quality benchmark that 3DGS aimed to match. Handles unbounded scenes with anti-aliased conical frustum rendering. Takes 48 hours to train.

EWA Splatting (Zwicker et al., 2001): The mathematical framework for projecting 3D Gaussians to 2D via the Jacobian of the projective transformation. 3DGS directly uses this 20-year-old technique.

Plenoxels / InstantNGP (2022): Showed that explicit or hybrid representations could dramatically speed up NeRF training. 3DGS pushes this further by using unstructured primitives instead of grid-based structures.

What 3DGS enabled

4D Gaussian Splatting (Wu et al., 2024): Extends to dynamic scenes by modeling Gaussian deformation over time, enabling real-time rendering of videos and dynamic content.

Mip-Splatting (Yu et al., 2024): Fixes the aliasing problem with 3D smoothing and 2D Mip filters that prevent Gaussians from falling below pixel scale. Alias-free 3DGS.

GaussianSLAM / SplaTAM (2024): Uses Gaussian Splatting as the scene representation for real-time SLAM (Simultaneous Localization and Mapping). The explicit, differentiable representation is a natural fit for tracking and mapping.

DUSt3R / MASt3R (2024): Dense point cloud reconstruction from uncalibrated images. Combined with Gaussian Splatting, enables reconstruction without SfM preprocessing.

Gaussian-based generation (2024+): Text-to-3D and image-to-3D methods like DreamGaussian and LGM generate 3D Gaussians directly, enabling fast 3D content creation.

The impact: 3DGS triggered an explosion of follow-up work: 500+ papers in the first year. It made real-time novel view synthesis practical for the first time, enabling applications in VR/AR, robotics, autonomous driving, and digital content creation. The explicit Gaussian representation turned out to be a universal primitive for 3D vision — from SLAM to generation to simulation.

Cheat sheet

Representation
Millions of 3D Gaussians: position μ, covariance RSSTRT, SH color, opacity α
Rendering
Project to 2D ellipses, tile-based sort, alpha-composite front-to-back. No neural network
Training
L1 + D-SSIM loss, Adam, adaptive density control (clone/split/prune), 30K iterations
Performance
100+ fps at 1080p, matching Mip-NeRF360 quality, 56× faster training
Limitation
Large storage (200MB–1GB), aliasing on zoom-out, static scenes only
What fundamental technique from 2001 does 3DGS reuse for projecting 3D Gaussians to 2D?