One photo in. 1.2 million 3D Gaussians out. Rendered at 100+ FPS. No per-scene optimization, no diffusion sampling — pure feedforward regression from a frozen vision backbone.
You take a photo of your living room. Just one shot from your phone. Now you want to walk through that scene in VR, or peek around the corner of the couch you couldn't quite capture. Can a computer reconstruct the full 3D scene from that single image?
This is the problem of monocular view synthesis: given one photo, generate photorealistic images from nearby viewpoints. It sounds like it should be impossible — a flat image throws away depth information entirely. But humans do it effortlessly. When you look at a photo of a room, you immediately have a sense of which objects are close, which are far, and what's probably hiding just outside the frame.
The computational approaches before SHARP fell into a brutal quality-speed tradeoff:
SHARP lands in a completely different spot: 25–34% better perceptual quality than diffusion methods, 1000x faster synthesis, and 100+ FPS rendering — all from a single feedforward pass taking under a second.
X-axis: synthesis time (log scale, seconds). Y-axis: perceptual quality (lower LPIPS = better). SHARP achieves the best quality and is among the fastest.
Every diffusion-based view synthesis method frames the problem as: "generate a plausible new image that looks like it came from a nearby camera." This is a generation problem — the model hallucinates content. It's slow because diffusion is iterative (many denoising steps), and it's stochastic because the model can produce different outputs each time.
SHARP reframes the problem entirely: "predict the 3D Gaussian representation of the scene directly." This is a regression problem — the model estimates something that already exists in the world. One forward pass. Deterministic. Fast.
The full pipeline is five stages, each with a specific job:
Click each module to see its input/output shapes and the design decision behind it.
Every expert system is only as good as its perceptual foundation. SHARP's foundation is Depth Pro, Apple's state-of-the-art monocular depth estimator, already pretrained to extract rich spatial features from images. Rather than training a vision backbone from scratch, SHARP inherits Depth Pro's understanding of scene geometry.
Depth Pro uses a dual-ViT architecture: one Vision Transformer processes a low-resolution global view of the entire image (to understand overall scene structure), and another processes high-resolution patches (to capture fine surface details). The two streams are fused into a feature pyramid — four feature maps at different spatial resolutions, each carrying different levels of geometric abstraction.
Here's a critical engineering decision: SHARP fine-tunes Depth Pro for the view synthesis task, but it does not fine-tune everything equally.
| Component | Treatment | Why |
|---|---|---|
| High-res patch ViT | Frozen | Pretrained features are already excellent; fine-tuning could destroy them |
| Low-res global ViT | Fine-tuned | Needs to adapt global understanding to view synthesis cues |
| Depth decoder | Fine-tuned | Needs to output view-synthesis-optimized depth, not just depth estimation depth |
This is a classic partial fine-tuning strategy. The patch encoder has learned universal visual features (edges, textures, materials) that transfer directly. The global encoder and decoder need to learn what makes a good initialization for 3D Gaussians — which is subtly different from what makes a good depth estimate.
The four feature maps output by Depth Pro at different resolutions. Each level captures different geometric information.
If you take a photo of a table with a mug on it, and then move your camera slightly to the left, you'll suddenly see a bit of table that was previously hidden behind the mug. This is called disocclusion: surfaces that were occluded from the original viewpoint become visible in the new viewpoint.
A single depth map can't handle this. It knows where the mug is (foreground) and where the visible table is, but it has no information about the table surface hidden underneath. When you render from a new angle, there's a hole where the mug used to be. This creates the ghosting artifacts common in naive view synthesis.
There's a complication: during training on synthetic data, we have ground-truth depth. But during inference on real images, we don't. SHARP needs to handle both cases gracefully.
The depth adjustment module is a small U-Net (2M parameters) that takes the predicted depth plus — when available — ground-truth depth, and outputs a scale map S that refines the depth prediction. The inspiration is a Conditional VAE: the scale map resolves depth ambiguity when additional information is available.
Where D̂ is the raw depth prediction and S is the scale map. During inference on real images (no GT depth), S = 1 everywhere — the module is bypassed entirely. During training, S learns to correct depth errors when it can peek at the ground truth.
A simple scene: foreground objects (warm) occlude background surfaces (teal). Drag the viewpoint slider to reveal what Layer 2 fills in.
Here's where the geometry comes in. Once we have a depth map, we can mathematically reconstruct where every pixel lives in 3D space. This process is called unprojection: taking a 2D depth value and a pixel location, and computing the 3D point in camera space.
Starting from the predicted depth map D':
Step 1: Subsample. The depth map is at 1536×1536 but we subsample by 2× to 768×768. Why? 768×768 = ~590K Gaussians. That's already a dense representation. Going full resolution would be computationally prohibitive.
Step 2: Unproject to 3D. For each pixel (i, j) with depth value D'(i,j), compute the 3D position:
This places each Gaussian at the correct 3D location. The (i,j) pixel coordinates, scaled by depth, give the x,y position in camera space. D' itself is the z depth.
Step 3: Set scale proportional to depth. Far-away Gaussians should be larger (they cover more 3D space per pixel):
Where s0 is a small base scale factor. This prevents holes in the rendered output — near objects have fine-grained small Gaussians, distant objects have coarser large ones.
Step 4: Color from image. Each Gaussian's initial color is just the corresponding pixel color from the subsampled image I'(i,j).
Step 5: Default rotation and opacity. Rotation = identity quaternion [1,0,0,0] (axis-aligned Gaussian). Opacity = 0.5 (neutral starting point).
The decoder is a U-Net-style network operating on the 768×768 grid. It takes the base Gaussian attributes plus the feature pyramid from Depth Pro, and outputs attribute deltas ΔG for all 14 attributes.
The composer applies these deltas with careful activation functions to prevent extreme values:
Where γ is the attribute-specific activation (e.g., sigmoid for opacity, exp for scale), γ−1 is its inverse, η is a small step size, and G0 is the base attribute. This composer pattern ensures the base initialization is respected — the decoder adjusts rather than replaces.
A toy scene with objects at different depths. Adjust depth and see how Gaussians are placed in 3D. Toggle between base initialization and refined result.
SHARP faces a classic problem in 3D vision: multi-view training data is expensive to collect. If you want to train a model to synthesize novel views, you need images of the same scene from multiple cameras — paired ground truth. This exists in synthetic datasets but is scarce for real-world scenes.
SHARP uses a two-stage training strategy that sidesteps this problem for real data entirely.
Train on large synthetic datasets where everything is known perfectly: ground-truth depth, ground-truth multi-view images, ground-truth camera poses. The model learns the fundamentals of 3D reconstruction — how to unproject depth, how to refine Gaussians, how to render accurately.
This is the learn the physics stage. The model develops a strong internal model of how 3D geometry relates to 2D images.
Now we want the model to work on real photos. But real photos don't have multi-view GT. So how do we train?
Why does this work? Because the rendered novel view looks like a real image (it has the camera noise and appearance characteristics of the rendered scene), but the ground truth (the original real photo) is a genuine real image that the model must reproduce. This forces the model to close the domain gap between synthetic rendering and real photography.
The compute split is telling: Stage 1 needs 128 A100s because the synthetic datasets are enormous and the model must learn everything from scratch. Stage 2 only needs 32 A100s — the model already understands 3D; it just needs to adapt its output distribution to real images.
Stage 1 (synthetic) vs Stage 2 (self-supervised). Click each stage to see details.
SHARP uses seven loss terms. Each addresses a specific failure mode. Together they force the model to produce Gaussians that are accurate, sharp, geometrically valid, and well-behaved.
Lcolor (L1 loss on rendered vs GT pixels): The most direct supervision. Minimizing L1 pixel error pushes the model toward accurate color reconstruction. L1 is preferred over L2 because it's more robust to occasional large errors (outliers from floating Gaussians).
Lpercep (perceptual loss for inpainted regions): When Gaussians fill in occluded regions (Layer 2), there's no GT pixel to compare against. Instead, a VGG-based perceptual loss compares feature-level similarity — the inpainted content doesn't need to be pixel-perfect, just perceptually plausible.
Lα (binary cross-entropy on opacity): Penalizes Gaussians with intermediate opacity. Pushes Gaussians to be either fully opaque (at a real surface) or fully transparent (empty space). This suppresses the model's tendency to place semi-transparent Gaussians everywhere as a hedge.
Ldepth (L1 on predicted vs GT disparity, Layer 1 only): Directly supervises the depth decoder. Uses disparity (1/depth) rather than depth because disparity is roughly linear in image coordinates — it's easier for a network to predict uniformly.
Ltv (total variation on Layer 2 depth): The second depth layer should be smooth — occluded surfaces are typically continuous. Total variation penalizes sharp discontinuities in Layer 2, preventing it from hallucinating physically implausible depth discontinuities.
Lgrad (gradient loss to suppress floaters): Penalizes Gaussians whose depth gradient is inconsistent with the image gradient. If there's no visible edge in the image at a location, there shouldn't be a sudden depth change (which would cause a floating Gaussian in space).
Lδ (L2 on position offsets): The Gaussian decoder outputs position deltas relative to the unprojected base positions. Penalizing large deltas forces the decoder to trust the geometric initialization and make only small corrections. This is an information bottleneck — the decoder can't override geometry entirely.
Lsplat (projected variance bound): Each Gaussian projects onto the image plane as an ellipse. If a Gaussian is very large or oddly oriented, its projected ellipse can cover a huge image area, causing blurry rendering. This loss bounds the projected 2D variance of each Gaussian.
Lscale and L∇scale: Penalize the scale map S from deviating far from 1 and from being spatially non-smooth. This prevents the adjustment module from simply memorizing the GT depth rather than producing a generally useful correction.
Where the three groups are depth losses (d), rendering losses (r), and scale/regularization losses (s), each with their own weighting coefficients λ.
Relative contribution of each loss term. Hover over a bar to see its description.
SHARP is evaluated zero-shot on six benchmark datasets that it was never trained on: Middlebury (stereo), Booster (indoor high-res), ScanNet++ (indoor RGB-D), WildRGBD (in-the-wild objects), ETH3D (outdoor scenes), Tanks and Temples (outdoor large-scale).
Zero-shot evaluation is the hardest test — no fine-tuning on test distribution, no domain-specific adaptation. Just train, deploy, and see.
| Metric | SHARP vs Gen3C (best prior) | SHARP vs Flash3D |
|---|---|---|
| LPIPS improvement | 25–34% | Massive on all datasets |
| DISTS improvement | 21–43% | Consistent improvement |
| Synthesis speed | 1000× faster | Similar speed advantage |
| Rendering FPS | 100+ FPS vs ~0.1 FPS diffusion | — |
Gen3C is a video diffusion model fine-tuned for view synthesis. It was the state-of-the-art before SHARP — better quality than all previous methods. But it takes 100+ seconds per synthesis because it runs many diffusion steps. SHARP beats Gen3C on quality and is 1000× faster.
The diffusion methods (ViewCrafter, SVC) can be impressive for large viewpoint changes — they can hallucinate plausible content for regions completely invisible from the input view. SHARP is constrained to what can be geometrically inferred from depth, so it doesn't hallucinate as freely. For nearby views (the typical use case), SHARP's geometric grounding gives better results.
LPIPS scores across methods. Lower is better. Click a dataset to see its results. SHARP consistently wins.
An ablation study removes components one at a time to verify they contribute. SHARP's ablations show that each design decision — the two-layer depth, the SSFT, the specific loss terms — meaningfully improves results.
| Removed Component | What Breaks | Why |
|---|---|---|
| Perceptual loss (Lpercep) | Inpainted regions (Layer 2) become blurry | Without perceptual loss, L1 pixel loss over-smooths the inpainted areas |
| Depth adjustment module | Artifacts on reflective surfaces, mirrors | Reflective surfaces violate Lambertian assumptions; the depth adjustment handles depth ambiguity these cases create |
| SSFT (Stage 2 training) | Reduced sharpness on real images | Without adapting to real data, the model's outputs are optimized for synthetic rendering style |
| Layer 2 depth | Ghosting holes when camera moves laterally | No occluded surface information to fill disoccluded regions |
| Fine-tuning depth decoder | Worse geometry → worse Gaussian placement | Frozen depth decoder predicts depth for depth estimation task, not for view synthesis initialization |
| Gaussian position offsets (Lδ removed) | Floaters, inaccurate geometry | Without regularization, decoder moves Gaussians arbitrarily far from geometric init |
Some components interact non-trivially. The depth adjustment module and SSFT together produce a larger improvement than either alone — the adjustment module prevents depth errors from corrupting the pseudo-GT in SSFT, making the self-supervision signal cleaner. Synergy between components is common in well-designed systems.
Toggle components on/off to see their impact on the relative quality score. All components on = SHARP's full performance.
SHARP is a node in a larger ecosystem of 3D reconstruction and novel view synthesis methods. Understanding its relationships clarifies both what it does well and where it might struggle.
| Method | Speed | Views needed | Real-time render | Limitation |
|---|---|---|---|---|
| NeRF | Minutes per scene | Many (50–200) | No (costly ray march) | Per-scene optimization |
| 3DGS | Minutes per scene | Many (30–100) | Yes (100+ FPS) | Per-scene optimization |
| Flash3D | <1 sec | One | Yes | Lower quality, no Layer 2 |
| Gen3C | 100+ seconds | One | No | Slow diffusion, stochastic |
| ViewCrafter | 100+ seconds | One | No | Can hallucinate large views |
| SHARP | <1 sec | One | Yes (100+ FPS) | Limited to nearby views |
Nearby views only. SHARP can't synthesize far-off views (e.g., 90° rotation). Its Gaussians are seeded from visible geometry; surfaces that are completely out of frame have no initialization and can't be hallucinated reliably. Diffusion models can hallucinate plausible content here, at the cost of geometric accuracy.
Dynamic scenes. SHARP assumes a static scene. If the photo contains motion blur or the scene changes between capture and viewing, the 3D representation won't be valid.
Reflective and transparent surfaces. Glass and mirrors violate the Lambertian assumption that a surface looks the same from any viewpoint. The depth adjustment module partially compensates, but these remain challenging.
The most exciting open question: can you combine SHARP's speed and geometric grounding with diffusion's hallucination ability for large viewpoint changes? One approach: use SHARP to initialize a sparse 3D representation, then use diffusion to fill in unseen regions. This hybrid would get accurate geometry where data exists and plausible completion where it doesn't.
Another direction: extend SHARP to video input. Given a short video clip instead of a single image, the model has multiple views available and could produce much more accurate 3D. This is essentially the feedforward version of 3DGS — fast reconstruction from sparse known-pose frames.
| Component | Key Detail |
|---|---|
| Input | 1536×1536 single RGB image |
| Backbone | Depth Pro (dual ViT, patch-frozen) |
| Depth layers | 2 (visible + occluded) |
| Gaussians | ~1.2M, 14 attributes each |
| Inference time | <1 second on GPU |
| Rendering FPS | 100+ |
| vs Gen3C (LPIPS) | 25–34% improvement |
| Training Stage 1 | 100K steps, 128 A100s, synthetic data |
| Training Stage 2 | 60K steps, 32 A100s, SSFT on real data |
| Paper | arXiv 2512.10685 |
| Code | github.com/apple/ml-sharp |