Represent scenes as millions of learnable 3D Gaussians, render them via differentiable tile-based splatting at 100+ fps with quality matching NeRF — no neural network at render time.
You have 100 photos of a room. You want to render what the room looks like from any new viewpoint — smoothly, in real time, at 1080p. This is novel view synthesis.
Neural Radiance Fields (NeRF) solved this beautifully in 2020. Train an MLP to map (x, y, z, direction) → (color, density), then render by marching rays through the scene and querying the network hundreds of times per pixel. The results are stunning. But there is a brutal cost:
The fundamental bottleneck is the implicit representation. NeRF stores the scene inside network weights. To know what color a point in space is, you must run a forward pass. Every pixel requires hundreds of such queries along its ray. No matter how clever the acceleration, you are married to per-ray, per-sample network evaluation.
Each pixel casts a ray and samples the MLP many times. Drag the slider to see how sample count affects cost. At 1080p, millions of MLP queries are needed per frame.
What if the scene representation were explicit instead of implicit? Instead of querying a neural network to ask "what is at position x?", what if we stored millions of little geometric primitives that directly are the scene?
Point clouds are explicit, but raw points have problems: holes between points, no notion of extent or shape, no smooth gradients for optimization. Voxel grids are explicit too, but they waste enormous memory on empty space and are locked to a fixed resolution.
The insight of 3D Gaussian Splatting: represent the scene as a collection of 3D Gaussians — soft, fuzzy ellipsoids, each with a position, shape (covariance), color (via spherical harmonics), and opacity. These primitives are:
Left: NeRF queries an MLP per sample along each ray. Right: 3DGS projects Gaussians to the image and blends. Toggle to compare.
Each Gaussian in the scene is defined by four groups of learnable parameters. Let's walk through each one.
A 3D vector (x, y, z) specifying the center of the Gaussian in world space. Initialized from the sparse SfM point cloud.
A 3×3 positive semi-definite matrix that defines the Gaussian's shape and orientation — think of it as an ellipsoid. But optimizing a raw 3×3 matrix is dangerous: gradient descent can easily produce a matrix that isn't positive semi-definite (and thus not a valid covariance).
The solution: decompose Σ into a rotation R (stored as a unit quaternion q, 4 parameters) and a scale S (a 3D vector s for the three axis lengths):
This is always positive semi-definite by construction. We optimize s and q independently, and reconstruct Σ when needed. Anisotropic (non-spherical) Gaussians can represent thin structures, flat surfaces, and complex geometry compactly — a flat wall might need just one highly elongated Gaussian instead of hundreds of tiny spheres.
Real-world appearance is view-dependent: a shiny surface looks different from different angles. Rather than storing a single RGB color, each Gaussian stores spherical harmonic coefficients — a compact basis for functions defined on the sphere of viewing directions.
With SH degree ℓ, we store (ℓ+1)² coefficients per color channel. The paper uses up to degree 3, which gives 16 coefficients × 3 channels = 48 parameters for color. For a given view direction d, the color is:
where Yℓm are the spherical harmonic basis functions. Degree 0 = diffuse (constant color). Higher degrees capture specular highlights and view-dependent effects.
A single scalar in [0, 1] controlling how opaque the Gaussian is. Passed through a sigmoid activation for smooth gradients. Transparent Gaussians (α near 0) can be pruned during training.
A single Gaussian with its four parameter groups. Drag the sliders to change scale and rotation and see how the ellipsoid shape changes.
We have millions of 3D Gaussians floating in world space. Now we need to render them into an image. This is where the "splatting" happens — and it needs to be both fast and differentiable.
Each 3D Gaussian is an ellipsoid in world space. To render it, we project it onto the image plane, producing a 2D ellipse (a "splat"). The 3D covariance Σ transforms under the camera's viewing transformation W and the projective Jacobian J:
Drop the third row and column of Σ′ to get a 2×2 covariance matrix for the projected 2D Gaussian. This is the same math used in the classical EWA splatting framework (Zwicker et al., 2001).
For each pixel, we blend all the Gaussians that overlap it, sorted by depth, front to back:
Here αi for each pixel is the Gaussian's learned opacity times the 2D Gaussian evaluated at that pixel's location. Ti is the transmittance — how much light from Gaussian i actually reaches the camera after being partially blocked by Gaussians 1 through i−1.
This is exactly the same image formation model as NeRF's volume rendering equation, just with explicit Gaussians instead of neural density samples along a ray.
The paper's key engineering contribution is a custom CUDA rasterizer that makes this fast:
For training, we need gradients of the rendered image with respect to all Gaussian parameters. The rasterizer traverses the per-tile Gaussian lists back-to-front, recovering intermediate transmittance values by dividing the stored final accumulated opacity by each Gaussian's alpha. This avoids storing per-pixel lists of arbitrary length — only one float (total accumulated alpha) is stored per pixel.
Watch 3D Gaussians get projected to 2D ellipses and alpha-composited front-to-back. Click "Step" to advance through the pipeline, or "Auto" to animate. Each colored ellipse is one Gaussian's splat.
Training starts from a sparse SfM point cloud — maybe 50,000 points for a complex scene. That is nowhere near enough to represent fine geometry. The optimization needs to grow and refine the Gaussian set during training. This is adaptive density control, and it has three operations: clone, split, and prune.
Every 100 training iterations (after a warm-up period), the system checks each Gaussian's average view-space positional gradient. If a Gaussian has large positional gradients, it means the optimizer is struggling — the current Gaussian placement isn't capturing the geometry well. The threshold is τpos = 0.0002.
When a small Gaussian has large gradients, the scene needs more coverage in that area. The system creates a copy of the Gaussian and moves it in the direction of the positional gradient. This increases both the number of Gaussians and the total volume they cover.
When a large Gaussian has large gradients, it is trying to cover too much detail with one blob. The system replaces it with two smaller Gaussians, each with scale divided by φ = 1.6, positioned by sampling from the original Gaussian's distribution. This preserves total volume but increases resolution.
Gaussians with opacity α below a threshold εα are removed — they are effectively transparent and contribute nothing. Additionally, every 3,000 iterations, all opacities are reset close to zero. The optimization then naturally restores opacity for Gaussians that are actually needed, while newly transparent ones get pruned. This prevents "floater" artifacts near the training cameras.
Click each button to see how the three density control operations transform Gaussians. Clone duplicates small Gaussians, Split breaks large ones, Prune removes transparent ones.
Training 3D Gaussian Splatting follows a straightforward render-and-compare loop, but there are several important design choices that make it work.
Start with the sparse point cloud from Structure-from-Motion (SfM) — the same camera calibration step that NeRF uses. Each SfM point becomes one Gaussian. The initial covariance is set to an isotropic (spherical) Gaussian whose radius equals the average distance to the three nearest neighbors. On synthetic scenes (NeRF-synthetic dataset), even random initialization works.
The training loss combines L1 photometric loss with a structural similarity term:
L1 compares pixel colors directly. D-SSIM (Structural Dissimilarity) captures perceptual differences — it penalizes blurriness and structural mismatches that L1 alone might miss. The combination yields both sharp edges and accurate colors.
Standard Adam optimizer with learning rate scheduling. Positions use an exponential decay schedule (starting at 1.6×10−4, decaying to 1.6×10−6). Other parameters use fixed learning rates: opacity at 0.05, scales at 0.005, rotation at 0.001, SH coefficients at 0.0025.
3D Gaussian Splatting was evaluated on three established benchmarks: Mip-NeRF360 (complex unbounded real scenes), Tanks&Temples (large-scale indoor/outdoor), and Deep Blending (indoor with challenging lighting). The results are striking.
On Mip-NeRF360 scenes at full training time (51 min):
At 1080p resolution:
The paper systematically removes components to measure their contribution:
Comparing methods on the Mip-NeRF360 benchmark. 3DGS (teal) achieves the best combination of speed and quality. Hover/tap points for details.
3DGS and NeRF solve the same problem (novel view synthesis from multi-view images) but make fundamentally different design choices. Understanding these differences clarifies when to use which.
NeRF: Implicit. The scene is encoded in MLP weights. To query any point, run a forward pass. The representation is continuous and compact (a few MB of weights), but every render query is expensive.
3DGS: Explicit. The scene is millions of Gaussian primitives stored as parameter arrays. No network query needed. The representation is large (hundreds of MB) but every render operation is trivially cheap (evaluate a 2D Gaussian, multiply, add).
NeRF: Ray marching. Cast a ray per pixel, sample 64–256 points per ray, query the MLP at each sample. Computational cost scales with resolution × samples per ray.
3DGS: Rasterization (splatting). Project each Gaussian to the image, sort by depth, blend in tiles. Cost scales with number of Gaussians × tiles per Gaussian. Massive parallelism on GPUs.
NeRF: Hard to edit. The scene is entangled in network weights. Moving an object means retraining.
3DGS: Directly editable. Gaussians are explicit objects with positions. You can select, move, delete, or recolor subsets of Gaussians. This enables scene editing, composition, and animation.
Despite different rendering pipelines, both methods use the same alpha compositing equation:
In NeRF, the samples come from points along a ray with density σ. In 3DGS, they come from Gaussians overlapping a pixel. Same math, different sources of (c, α).
3DGS is a breakthrough, but it has real limitations that subsequent work has been addressing.
A trained 3DGS model stores 1–5 million Gaussians, each with 59 parameters as 32-bit floats. That is 200 MB–1 GB per scene, compared to ~5 MB for a NeRF MLP. Compression techniques (quantization, codebooks, pruning) can reduce this by 10–50×, but it remains a concern for mobile and web deployment.
When the camera zooms far out, many Gaussians project to sub-pixel size, causing aliasing artifacts (flickering, moire patterns). The original 3DGS has no multi-scale or mip-mapping equivalent. Mip-Splatting (2024) addresses this by adding 3D and 2D smoothing filters that prevent Gaussians from becoming smaller than a pixel.
The original paper handles only static scenes. Extending to dynamics requires modeling Gaussian motion over time. 4D Gaussian Splatting and deformable variants have since addressed this, but it was not in the original work.
Regions seen by few training views can develop "floater" Gaussians — semi-transparent blobs floating in mid-air. The opacity reset heuristic mitigates this but doesn't eliminate it entirely. Regularization techniques (depth supervision, normal consistency) help.
Because 3DGS is purely per-scene optimization (no pretrained network), it cannot hallucinate plausible content in unobserved regions. NeRF's MLP provides a weak inductive bias toward smooth, natural-looking completions. 3DGS needs all regions to be well-observed in the training images.
NeRF (Mildenhall et al., 2020): The neural radiance field that started it all. 3DGS keeps the same image formation model (alpha compositing along viewing directions) but replaces the implicit MLP with explicit Gaussians.
Mip-NeRF 360 (Barron et al., 2022): The quality benchmark that 3DGS aimed to match. Handles unbounded scenes with anti-aliased conical frustum rendering. Takes 48 hours to train.
EWA Splatting (Zwicker et al., 2001): The mathematical framework for projecting 3D Gaussians to 2D via the Jacobian of the projective transformation. 3DGS directly uses this 20-year-old technique.
Plenoxels / InstantNGP (2022): Showed that explicit or hybrid representations could dramatically speed up NeRF training. 3DGS pushes this further by using unstructured primitives instead of grid-based structures.
4D Gaussian Splatting (Wu et al., 2024): Extends to dynamic scenes by modeling Gaussian deformation over time, enabling real-time rendering of videos and dynamic content.
Mip-Splatting (Yu et al., 2024): Fixes the aliasing problem with 3D smoothing and 2D Mip filters that prevent Gaussians from falling below pixel scale. Alias-free 3DGS.
GaussianSLAM / SplaTAM (2024): Uses Gaussian Splatting as the scene representation for real-time SLAM (Simultaneous Localization and Mapping). The explicit, differentiable representation is a natural fit for tracking and mapping.
DUSt3R / MASt3R (2024): Dense point cloud reconstruction from uncalibrated images. Combined with Gaussian Splatting, enables reconstruction without SfM preprocessing.
Gaussian-based generation (2024+): Text-to-3D and image-to-3D methods like DreamGaussian and LGM generate 3D Gaussians directly, enabling fast 3D content creation.