3D Gaussian Splatting

Chapter 0: The Problem

You have 100 photos of a room. You want to render what the room looks like from any new viewpoint — smoothly, in real time, at 1080p. This is novel view synthesis.

Neural Radiance Fields (NeRF) solved this beautifully in 2020. Train an MLP to map (x, y, z, direction) → (color, density), then render by marching rays through the scene and querying the network hundreds of times per pixel. The results are stunning. But there is a brutal cost:

Training: Mip-NeRF360, the best-quality variant, takes up to 48 hours per scene
Rendering: 0.07 fps at 1080p. That is 14 seconds per frame. The network must be queried millions of times per image
The speed-quality tradeoff: Faster methods like InstantNGP (hash grids) and Plenoxels (sparse voxels) reach 10-15 fps, but sacrifice quality and still fall short of real-time

The fundamental bottleneck is the implicit representation. NeRF stores the scene inside network weights. To know what color a point in space is, you must run a forward pass. Every pixel requires hundreds of such queries along its ray. No matter how clever the acceleration, you are married to per-ray, per-sample network evaluation.

The challenge: Can we match Mip-NeRF360's quality while rendering at 100+ fps? The answer requires abandoning implicit neural representations entirely and going back to an old idea: explicit point primitives — but with a critical twist.

NeRF Rendering Cost

Each pixel casts a ray and samples the MLP many times. Drag the slider to see how sample count affects cost. At 1080p, millions of MLP queries are needed per frame.

Samples/ray64

Why can't NeRF render in real time, even with acceleration structures like hash grids?

The scene is stored implicitly in network weights — rendering every pixel requires many forward passes through the MLP, creating an inescapable per-ray computation bottleneck The images are too high resolution GPUs are not fast enough for ray tracing

Chapter 1: The Key Insight

What if the scene representation were explicit instead of implicit? Instead of querying a neural network to ask "what is at position x?", what if we stored millions of little geometric primitives that directly are the scene?

Point clouds are explicit, but raw points have problems: holes between points, no notion of extent or shape, no smooth gradients for optimization. Voxel grids are explicit too, but they waste enormous memory on empty space and are locked to a fixed resolution.

The insight of 3D Gaussian Splatting: represent the scene as a collection of 3D Gaussians — soft, fuzzy ellipsoids, each with a position, shape (covariance), color (via spherical harmonics), and opacity. These primitives are:

Explicit: Each Gaussian is a concrete object with position and properties. No network query needed
Differentiable: Gaussians are smooth functions, so we can backpropagate through the rendering process and optimize all parameters with gradient descent
Rasterizable: Project each 3D Gaussian to a 2D ellipse on the image plane, then alpha-blend front-to-back. This is classic rasterization — the thing GPUs were born to do — not ray marching
Adaptive: We can add, remove, split, and clone Gaussians during training. No fixed grid resolution

The paradigm shift: NeRF asks "what is the color and density at this 3D point?" for every sample along every ray. Gaussian Splatting asks "where does each Gaussian land on the image?" and splats them all at once. Ray marching is O(pixels × samples). Splatting is O(Gaussians). With a fast tile-based rasterizer and GPU sorting, this is radically faster.

Implicit vs Explicit

Left: NeRF queries an MLP per sample along each ray. Right: 3DGS projects Gaussians to the image and blends. Toggle to compare.

Why are 3D Gaussians a better scene primitive than raw points or voxels?

They are differentiable (enabling gradient-based optimization), have extent and shape (no holes), can be rasterized efficiently (GPU-friendly), and their count adapts during training (no fixed resolution) They use less memory than points They produce sharper images than neural networks

Chapter 2: Anatomy of a 3D Gaussian

Each Gaussian in the scene is defined by four groups of learnable parameters. Let's walk through each one.

Position (mean) μ

A 3D vector (x, y, z) specifying the center of the Gaussian in world space. Initialized from the sparse SfM point cloud.

Covariance Σ (shape)

A 3×3 positive semi-definite matrix that defines the Gaussian's shape and orientation — think of it as an ellipsoid. But optimizing a raw 3×3 matrix is dangerous: gradient descent can easily produce a matrix that isn't positive semi-definite (and thus not a valid covariance).

The solution: decompose Σ into a rotation R (stored as a unit quaternion q, 4 parameters) and a scale S (a 3D vector s for the three axis lengths):

Σ = R S S^T R^T

This is always positive semi-definite by construction. We optimize s and q independently, and reconstruct Σ when needed. Anisotropic (non-spherical) Gaussians can represent thin structures, flat surfaces, and complex geometry compactly — a flat wall might need just one highly elongated Gaussian instead of hundreds of tiny spheres.

Color via Spherical Harmonics (SH)

Real-world appearance is view-dependent: a shiny surface looks different from different angles. Rather than storing a single RGB color, each Gaussian stores spherical harmonic coefficients — a compact basis for functions defined on the sphere of viewing directions.

With SH degree ℓ, we store (ℓ+1)² coefficients per color channel. The paper uses up to degree 3, which gives 16 coefficients × 3 channels = 48 parameters for color. For a given view direction d, the color is:

c(d) = ∑_ℓ,m c_ℓm Y_ℓ^m(d)

where Y_ℓ^m are the spherical harmonic basis functions. Degree 0 = diffuse (constant color). Higher degrees capture specular highlights and view-dependent effects.

Opacity α

A single scalar in [0, 1] controlling how opaque the Gaussian is. Passed through a sigmoid activation for smooth gradients. Transparent Gaussians (α near 0) can be pruned during training.

Parameter count per Gaussian: Position (3) + Scale (3) + Rotation quaternion (4) + Opacity (1) + SH coefficients (48 at degree 3) = 59 parameters. A typical scene uses 1–5 million Gaussians, so 59M–295M floats total. This is stored explicitly in GPU memory — no network weights needed at render time.

3D Gaussian Anatomy

A single Gaussian with its four parameter groups. Drag the sliders to change scale and rotation and see how the ellipsoid shape changes.

Scale X50

Scale Y30

Rotation0°

Opacity0.80

Why is the covariance matrix Σ decomposed into rotation R and scale S instead of being optimized directly?

Direct optimization can produce matrices that aren't positive semi-definite (invalid covariances). The RSS^TR^T decomposition is always valid by construction It reduces the number of parameters Quaternions are faster to compute

Chapter 3: Differentiable Splatting

We have millions of 3D Gaussians floating in world space. Now we need to render them into an image. This is where the "splatting" happens — and it needs to be both fast and differentiable.

Step 1: Project 3D Gaussians to 2D

Each 3D Gaussian is an ellipsoid in world space. To render it, we project it onto the image plane, producing a 2D ellipse (a "splat"). The 3D covariance Σ transforms under the camera's viewing transformation W and the projective Jacobian J:

Σ′ = J W Σ W^T J^T

Drop the third row and column of Σ′ to get a 2×2 covariance matrix for the projected 2D Gaussian. This is the same math used in the classical EWA splatting framework (Zwicker et al., 2001).

Step 2: Alpha compositing (front-to-back)

For each pixel, we blend all the Gaussians that overlap it, sorted by depth, front to back:

C = ∑_i=1^N T_i α_i c_i where T_i = ∏_j=1ⁱ⁻¹ (1 − α_j)

Here α_i for each pixel is the Gaussian's learned opacity times the 2D Gaussian evaluated at that pixel's location. T_i is the transmittance — how much light from Gaussian i actually reaches the camera after being partially blocked by Gaussians 1 through i−1.

This is exactly the same image formation model as NeRF's volume rendering equation, just with explicit Gaussians instead of neural density samples along a ray.

Step 3: Tile-based rasterization

The paper's key engineering contribution is a custom CUDA rasterizer that makes this fast:

Tile the screen into 16×16 pixel blocks
Cull Gaussians against the view frustum (keep only those whose 99% confidence interval intersects it)
Assign each Gaussian to the tiles it overlaps, creating (tile_id, depth) key pairs
Sort all key pairs with a single GPU radix sort — one global sort for the entire image, not per-pixel
Rasterize each tile in parallel: one CUDA thread block per tile, loading Gaussians into shared memory and blending front-to-back until the pixel saturates (α → 1)

Why this is so fast: NeRF does O(pixels × samples_per_ray) MLP evaluations. 3DGS does O(Gaussians × tiles_per_Gaussian) simple arithmetic operations. The radix sort is O(n log n) but runs entirely on GPU. No neural network is evaluated at render time — just evaluate 2D Gaussians and blend. This is why 3DGS achieves 100+ fps.

Backward pass

For training, we need gradients of the rendered image with respect to all Gaussian parameters. The rasterizer traverses the per-tile Gaussian lists back-to-front, recovering intermediate transmittance values by dividing the stored final accumulated opacity by each Gaussian's alpha. This avoids storing per-pixel lists of arbitrary length — only one float (total accumulated alpha) is stored per pixel.

Splatting Pipeline

Watch 3D Gaussians get projected to 2D ellipses and alpha-composited front-to-back. Click "Step" to advance through the pipeline, or "Auto" to animate. Each colored ellipse is one Gaussian's splat.

Ready — click Step

What makes the tile-based rasterizer so much faster than NeRF's ray marching?

It replaces per-ray MLP queries with a single GPU radix sort + simple per-tile alpha blending of pre-projected 2D Gaussians — no neural network at render time It uses fewer Gaussians than NeRF uses samples It renders at lower resolution

Chapter 4: Adaptive Density Control

Training starts from a sparse SfM point cloud — maybe 50,000 points for a complex scene. That is nowhere near enough to represent fine geometry. The optimization needs to grow and refine the Gaussian set during training. This is adaptive density control, and it has three operations: clone, split, and prune.

When to densify

Every 100 training iterations (after a warm-up period), the system checks each Gaussian's average view-space positional gradient. If a Gaussian has large positional gradients, it means the optimizer is struggling — the current Gaussian placement isn't capturing the geometry well. The threshold is τ_pos = 0.0002.

Clone (under-reconstruction)

When a small Gaussian has large gradients, the scene needs more coverage in that area. The system creates a copy of the Gaussian and moves it in the direction of the positional gradient. This increases both the number of Gaussians and the total volume they cover.

Split (over-reconstruction)

When a large Gaussian has large gradients, it is trying to cover too much detail with one blob. The system replaces it with two smaller Gaussians, each with scale divided by φ = 1.6, positioned by sampling from the original Gaussian's distribution. This preserves total volume but increases resolution.

Prune (cleanup)

Gaussians with opacity α below a threshold ε_α are removed — they are effectively transparent and contribute nothing. Additionally, every 3,000 iterations, all opacities are reset close to zero. The optimization then naturally restores opacity for Gaussians that are actually needed, while newly transparent ones get pruned. This prevents "floater" artifacts near the training cameras.

The elegance of adaptive density control: Unlike voxel grids or hash tables that have a fixed spatial resolution, the Gaussian count adapts to scene complexity. A plain white wall might use a few large Gaussians. A detailed bookshelf might use tens of thousands of tiny ones. The optimization discovers this allocation automatically via gradient signals.

Clone, Split, Prune

Click each button to see how the three density control operations transform Gaussians. Clone duplicates small Gaussians, Split breaks large ones, Prune removes transparent ones.

What signal does the system use to decide which Gaussians need to be cloned or split?

The average magnitude of view-space positional gradients — large gradients mean the optimizer is struggling to represent that region, indicating under- or over-reconstruction The rendering loss at each Gaussian's location The number of training views that see each Gaussian

Chapter 5: Training

Training 3D Gaussian Splatting follows a straightforward render-and-compare loop, but there are several important design choices that make it work.

Initialization

Start with the sparse point cloud from Structure-from-Motion (SfM) — the same camera calibration step that NeRF uses. Each SfM point becomes one Gaussian. The initial covariance is set to an isotropic (spherical) Gaussian whose radius equals the average distance to the three nearest neighbors. On synthetic scenes (NeRF-synthetic dataset), even random initialization works.

Loss function

The training loss combines L1 photometric loss with a structural similarity term:

L = (1 − λ) L₁ + λ L_D-SSIM with λ = 0.2

L₁ compares pixel colors directly. D-SSIM (Structural Dissimilarity) captures perceptual differences — it penalizes blurriness and structural mismatches that L₁ alone might miss. The combination yields both sharp edges and accurate colors.

Optimizer

Standard Adam optimizer with learning rate scheduling. Positions use an exponential decay schedule (starting at 1.6×10⁻⁴, decaying to 1.6×10⁻⁶). Other parameters use fixed learning rates: opacity at 0.05, scales at 0.005, rotation at 0.001, SH coefficients at 0.0025.

Activation functions

Opacity: Sigmoid activation to constrain to [0, 1) with smooth gradients
Scale: Exponential activation to ensure positive values
Rotation: Quaternion normalization to ensure valid rotations

Training schedule

Init

Create Gaussians from SfM points. Set isotropic covariance, random SH, α via inverse sigmoid

↓

Warm-up

500 iterations: optimize parameters only (no densification)

↓

Main loop

Render random training view → compute L₁ + D-SSIM loss → backprop → update parameters. Every 100 iters: densify (clone/split). Every 3000 iters: reset opacities

↓

Convergence

30,000 iterations (~6–51 min depending on quality target). Final: 1–5M Gaussians

Training time comparison: 6 minutes of 3DGS training matches InstantNGP quality (PSNR ~22). 51 minutes matches or exceeds Mip-NeRF360 (PSNR ~25). Mip-NeRF360 itself takes 48 hours. That is a 56× speedup at equal quality.

Why does the loss function combine L₁ with D-SSIM instead of using L₁ alone?

L₁ alone can produce blurry results because it doesn't penalize structural differences — D-SSIM captures perceptual quality like edge sharpness and local contrast D-SSIM trains faster L₁ doesn't work with Gaussians

Chapter 6: Results

3D Gaussian Splatting was evaluated on three established benchmarks: Mip-NeRF360 (complex unbounded real scenes), Tanks&Temples (large-scale indoor/outdoor), and Deep Blending (indoor with challenging lighting). The results are striking.

Quality (PSNR / SSIM / LPIPS)

On Mip-NeRF360 scenes at full training time (51 min):

3DGS: PSNR 25.2, SSIM 0.811 — slightly better than Mip-NeRF360
Mip-NeRF360: PSNR 24.3, SSIM 0.792 — after 48 hours of training
InstantNGP: PSNR 22.1 — after 7 minutes
Plenoxels: PSNR 21.9 — after 26 minutes

Speed

At 1080p resolution:

3DGS: 93–135 fps (real-time)
InstantNGP: 9.2 fps (interactive but not real-time)
Plenoxels: 8.2 fps
Mip-NeRF360: 0.071 fps (14 seconds per frame)

The headline result: 3DGS is the first method to achieve real-time (≥30 fps) novel view synthesis at 1080p with quality that matches the best offline method. It renders 1,300–1,900× faster than Mip-NeRF360 while matching or exceeding its quality. Training is also 56× faster.

Ablation studies

The paper systematically removes components to measure their contribution:

Without anisotropic covariance (isotropic only): PSNR drops by ~1 dB. Thin structures and flat surfaces degrade significantly
Without SfM initialization (random init): Works on synthetic scenes but fails on real scenes without the geometric prior
Without adaptive density control: Quality degrades, especially in fine-detail regions
Without D-SSIM loss: Minor quality drop; L₁ alone is decent but blurrier

Speed vs Quality

Comparing methods on the Mip-NeRF360 benchmark. 3DGS (teal) achieves the best combination of speed and quality. Hover/tap points for details.

How does 3DGS rendering speed compare to Mip-NeRF360 at 1080p?

3DGS renders at 93–135 fps vs Mip-NeRF360's 0.071 fps — roughly 1,300–1,900× faster, while matching or exceeding quality About 10× faster About the same speed but better quality

Chapter 7: Comparison with NeRF

3DGS and NeRF solve the same problem (novel view synthesis from multi-view images) but make fundamentally different design choices. Understanding these differences clarifies when to use which.

Scene representation

NeRF: Implicit. The scene is encoded in MLP weights. To query any point, run a forward pass. The representation is continuous and compact (a few MB of weights), but every render query is expensive.

3DGS: Explicit. The scene is millions of Gaussian primitives stored as parameter arrays. No network query needed. The representation is large (hundreds of MB) but every render operation is trivially cheap (evaluate a 2D Gaussian, multiply, add).

Rendering

NeRF: Ray marching. Cast a ray per pixel, sample 64–256 points per ray, query the MLP at each sample. Computational cost scales with resolution × samples per ray.

3DGS: Rasterization (splatting). Project each Gaussian to the image, sort by depth, blend in tiles. Cost scales with number of Gaussians × tiles per Gaussian. Massive parallelism on GPUs.

Editability

NeRF: Hard to edit. The scene is entangled in network weights. Moving an object means retraining.

3DGS: Directly editable. Gaussians are explicit objects with positions. You can select, move, delete, or recolor subsets of Gaussians. This enables scene editing, composition, and animation.

The shared image formation model

Despite different rendering pipelines, both methods use the same alpha compositing equation:

C = ∑_i T_i α_i c_i where T_i = ∏_j<i (1 − α_j)

In NeRF, the samples come from points along a ray with density σ. In 3DGS, they come from Gaussians overlapping a pixel. Same math, different sources of (c, α).

When to use which: 3DGS wins on speed (real-time), editability, and training time. NeRF wins on compactness (model size) and may generalize better to unseen regions due to the implicit prior of the MLP. For practical applications requiring real-time rendering (VR, games, telepresence), 3DGS is the clear choice.

What is the fundamental reason 3DGS is faster than NeRF at render time?

3DGS uses rasterization (project and blend explicit primitives) which requires no neural network evaluation, while NeRF must run an MLP hundreds of times per pixel via ray marching 3DGS uses a smaller neural network 3DGS renders at lower resolution

Chapter 8: Limitations

3DGS is a breakthrough, but it has real limitations that subsequent work has been addressing.

Storage

A trained 3DGS model stores 1–5 million Gaussians, each with 59 parameters as 32-bit floats. That is 200 MB–1 GB per scene, compared to ~5 MB for a NeRF MLP. Compression techniques (quantization, codebooks, pruning) can reduce this by 10–50×, but it remains a concern for mobile and web deployment.

Aliasing and zoom

When the camera zooms far out, many Gaussians project to sub-pixel size, causing aliasing artifacts (flickering, moire patterns). The original 3DGS has no multi-scale or mip-mapping equivalent. Mip-Splatting (2024) addresses this by adding 3D and 2D smoothing filters that prevent Gaussians from becoming smaller than a pixel.

Dynamic scenes

The original paper handles only static scenes. Extending to dynamics requires modeling Gaussian motion over time. 4D Gaussian Splatting and deformable variants have since addressed this, but it was not in the original work.

Artifacts in under-observed regions

Regions seen by few training views can develop "floater" Gaussians — semi-transparent blobs floating in mid-air. The opacity reset heuristic mitigates this but doesn't eliminate it entirely. Regularization techniques (depth supervision, normal consistency) help.

No learned priors

Because 3DGS is purely per-scene optimization (no pretrained network), it cannot hallucinate plausible content in unobserved regions. NeRF's MLP provides a weak inductive bias toward smooth, natural-looking completions. 3DGS needs all regions to be well-observed in the training images.

The storage vs. speed tradeoff: NeRF's implicit representation is compact but slow to render. 3DGS's explicit representation is fast to render but large to store. This is a fundamental tradeoff between compression (implicit/neural) and speed (explicit/primitive). Subsequent work on Gaussian compression aims to have both.

Why does 3DGS produce aliasing artifacts when the camera zooms far out?

Many Gaussians project to sub-pixel size, and unlike texture mip-mapping, the original 3DGS has no multi-scale filtering to handle this — Mip-Splatting later addressed this The resolution is too low The Gaussians are too large

Chapter 9: Connections

What 3DGS built on

NeRF (Mildenhall et al., 2020): The neural radiance field that started it all. 3DGS keeps the same image formation model (alpha compositing along viewing directions) but replaces the implicit MLP with explicit Gaussians.

Mip-NeRF 360 (Barron et al., 2022): The quality benchmark that 3DGS aimed to match. Handles unbounded scenes with anti-aliased conical frustum rendering. Takes 48 hours to train.

EWA Splatting (Zwicker et al., 2001): The mathematical framework for projecting 3D Gaussians to 2D via the Jacobian of the projective transformation. 3DGS directly uses this 20-year-old technique.

Plenoxels / InstantNGP (2022): Showed that explicit or hybrid representations could dramatically speed up NeRF training. 3DGS pushes this further by using unstructured primitives instead of grid-based structures.

What 3DGS enabled

4D Gaussian Splatting (Wu et al., 2024): Extends to dynamic scenes by modeling Gaussian deformation over time, enabling real-time rendering of videos and dynamic content.

Mip-Splatting (Yu et al., 2024): Fixes the aliasing problem with 3D smoothing and 2D Mip filters that prevent Gaussians from falling below pixel scale. Alias-free 3DGS.

GaussianSLAM / SplaTAM (2024): Uses Gaussian Splatting as the scene representation for real-time SLAM (Simultaneous Localization and Mapping). The explicit, differentiable representation is a natural fit for tracking and mapping.

DUSt3R / MASt3R (2024): Dense point cloud reconstruction from uncalibrated images. Combined with Gaussian Splatting, enables reconstruction without SfM preprocessing.

Gaussian-based generation (2024+): Text-to-3D and image-to-3D methods like DreamGaussian and LGM generate 3D Gaussians directly, enabling fast 3D content creation.

The impact: 3DGS triggered an explosion of follow-up work: 500+ papers in the first year. It made real-time novel view synthesis practical for the first time, enabling applications in VR/AR, robotics, autonomous driving, and digital content creation. The explicit Gaussian representation turned out to be a universal primitive for 3D vision — from SLAM to generation to simulation.

Cheat sheet

Representation

Millions of 3D Gaussians: position μ, covariance RSS^TR^T, SH color, opacity α

Rendering

Project to 2D ellipses, tile-based sort, alpha-composite front-to-back. No neural network

Training

L₁ + D-SSIM loss, Adam, adaptive density control (clone/split/prune), 30K iterations

Performance

100+ fps at 1080p, matching Mip-NeRF360 quality, 56× faster training

Limitation

Large storage (200MB–1GB), aliasing on zoom-out, static scenes only

What fundamental technique from 2001 does 3DGS reuse for projecting 3D Gaussians to 2D?

EWA (Elliptical Weighted Average) Splatting by Zwicker et al. — projects 3D Gaussians to 2D using the Jacobian of the projective transformation, a 20-year-old technique Ray marching from volume rendering Convolutional neural networks