SHARP: Single-image 3D Reconstruction

Chapter 0: The Problem

You take a photo of your living room. Just one shot from your phone. Now you want to walk through that scene in VR, or peek around the corner of the couch you couldn't quite capture. Can a computer reconstruct the full 3D scene from that single image?

This is the problem of monocular view synthesis: given one photo, generate photorealistic images from nearby viewpoints. It sounds like it should be impossible — a flat image throws away depth information entirely. But humans do it effortlessly. When you look at a photo of a room, you immediately have a sense of which objects are close, which are far, and what's probably hiding just outside the frame.

The computational approaches before SHARP fell into a brutal quality-speed tradeoff:

Optimization-based (NeRF / 3DGS): Give the model many photos of the same scene and let it optimize a neural representation over minutes or hours. Result: stunning quality. Cost: completely impractical for single images.

Diffusion-based (Gen3C, ViewCrafter): Use a video diffusion model to hallucinate plausible novel views. Result: reasonable quality, stochastic. Cost: 100–200 seconds per synthesis, not real-time.

SHARP lands in a completely different spot: 25–34% better perceptual quality than diffusion methods, 1000x faster synthesis, and 100+ FPS rendering — all from a single feedforward pass taking under a second.

The core tension: Diffusion models are expressive but slow and stochastic. Optimization methods are precise but need many images and minutes. SHARP asks: can we just directly predict the 3D structure from one image, the same way a depth estimator directly predicts depth?

Latency vs Quality: Where SHARP Lands

X-axis: synthesis time (log scale, seconds). Y-axis: perceptual quality (lower LPIPS = better). SHARP achieves the best quality and is among the fastest.

Why can't a standard NeRF or 3DGS model solve single-image view synthesis?

They can only handle static scenes, not dynamic ones They require too much GPU memory for a single forward pass They require many images from different viewpoints and minutes of per-scene optimization — neither is available with just one photo

Chapter 1: The Key Insight — Regression, Not Generation

Every diffusion-based view synthesis method frames the problem as: "generate a plausible new image that looks like it came from a nearby camera." This is a generation problem — the model hallucinates content. It's slow because diffusion is iterative (many denoising steps), and it's stochastic because the model can produce different outputs each time.

SHARP reframes the problem entirely: "predict the 3D Gaussian representation of the scene directly." This is a regression problem — the model estimates something that already exists in the world. One forward pass. Deterministic. Fast.

The key realization: If you can predict accurate depth from a single image (which modern monocular depth estimators can do remarkably well), you can unproject that depth map into 3D space and initialize a cloud of 3D Gaussians. Then a decoder refines those Gaussians to be photorealistic. No iteration required.

Architecture Overview

The full pipeline is five stages, each with a specific job:

Input Image

1536×1536 RGB photo

↓

Depth Pro Encoder

Pretrained ViT backbone → 4 multi-resolution feature maps

↓

Two-Layer Depth Decoder

Predicts TWO depth maps: visible surfaces + occluded regions

↓

Gaussian Initializer

Unprojects depth to 3D → 768×768 ≈ 590K base Gaussians

↓

Gaussian Decoder

Refines all 14 attributes per Gaussian → 1.2M final Gaussians

↓

3DGS Renderer

Novel views at 100+ FPS via differentiable rasterization

Why 3D Gaussians? A 3D Gaussian is defined by 14 numbers: 3 for position (μ), 3 for scale (s), 4 for rotation (quaternion q), 3 for color (c), 1 for opacity (α). Each Gaussian is like a fuzzy blob in 3D space. Millions of them together form a complete scene representation that can be rasterized (rendered) in real-time — unlike NeRF, which requires expensive ray marching.

Architecture Flow — Click a Stage

Click each module to see its input/output shapes and the design decision behind it.

What is the fundamental difference between SHARP and diffusion-based view synthesis?

SHARP directly regresses the 3D scene representation in one feedforward pass; diffusion models iteratively generate plausible images SHARP uses 3D Gaussian Splatting; diffusion models use NeRF SHARP is trained on more data than diffusion models

Chapter 2: The Depth Pro Backbone

Every expert system is only as good as its perceptual foundation. SHARP's foundation is Depth Pro, Apple's state-of-the-art monocular depth estimator, already pretrained to extract rich spatial features from images. Rather than training a vision backbone from scratch, SHARP inherits Depth Pro's understanding of scene geometry.

Depth Pro uses a dual-ViT architecture: one Vision Transformer processes a low-resolution global view of the entire image (to understand overall scene structure), and another processes high-resolution patches (to capture fine surface details). The two streams are fused into a feature pyramid — four feature maps at different spatial resolutions, each carrying different levels of geometric abstraction.

Why a feature pyramid? A single feature map would have to balance global context (is this a wall or sky?) against local detail (where exactly is the edge?). A pyramid separates these concerns — coarse maps capture structure, fine maps capture boundaries. This is the same principle behind FPN in object detection.

What Gets Frozen, What Gets Trained

Here's a critical engineering decision: SHARP fine-tunes Depth Pro for the view synthesis task, but it does not fine-tune everything equally.

Component	Treatment	Why
High-res patch ViT	Frozen	Pretrained features are already excellent; fine-tuning could destroy them
Low-res global ViT	Fine-tuned	Needs to adapt global understanding to view synthesis cues
Depth decoder	Fine-tuned	Needs to output view-synthesis-optimized depth, not just depth estimation depth

This is a classic partial fine-tuning strategy. The patch encoder has learned universal visual features (edges, textures, materials) that transfer directly. The global encoder and decoder need to learn what makes a good initialization for 3D Gaussians — which is subtly different from what makes a good depth estimate.

Input resolution: 1536×1536. This is quite high for a vision model. Depth Pro uses patch-based processing to handle this efficiently — the image is split into overlapping patches, each processed independently by the patch ViT, then recombined. This gives high-resolution detail without quadratic attention cost.

Feature Pyramid Visualization

The four feature maps output by Depth Pro at different resolutions. Each level captures different geometric information.

Why does SHARP freeze the high-resolution patch ViT but fine-tune the low-resolution global ViT?

The patch ViT is too large to fine-tune given GPU memory constraints The patch ViT's pretrained local features transfer directly to view synthesis; the global ViT needs to adapt its scene-level understanding to predict Gaussian-optimal structure The patch ViT uses a different architecture that doesn't support gradients

Chapter 3: Two-Layer Depth — Seeing Behind Objects

If you take a photo of a table with a mug on it, and then move your camera slightly to the left, you'll suddenly see a bit of table that was previously hidden behind the mug. This is called disocclusion: surfaces that were occluded from the original viewpoint become visible in the new viewpoint.

A single depth map can't handle this. It knows where the mug is (foreground) and where the visible table is, but it has no information about the table surface hidden underneath. When you render from a new angle, there's a hole where the mug used to be. This creates the ghosting artifacts common in naive view synthesis.

SHARP's solution: Predict TWO depth layers. Layer 1 captures visible surfaces (what the camera sees). Layer 2 captures occluded surfaces (what's hiding behind foreground objects). Together, they give Gaussians on both layers, so when you move the camera, Layer 2 fills in the disoccluded regions.

The Depth Adjustment Module

There's a complication: during training on synthetic data, we have ground-truth depth. But during inference on real images, we don't. SHARP needs to handle both cases gracefully.

The depth adjustment module is a small U-Net (2M parameters) that takes the predicted depth plus — when available — ground-truth depth, and outputs a scale map S that refines the depth prediction. The inspiration is a Conditional VAE: the scale map resolves depth ambiguity when additional information is available.

D'(i,j) = S(i,j) · D̂(i,j)

Where D̂ is the raw depth prediction and S is the scale map. During inference on real images (no GT depth), S = 1 everywhere — the module is bypassed entirely. During training, S learns to correct depth errors when it can peek at the ground truth.

Why this design matters: Without the adjustment module, the model would need to produce perfect depth from the image alone during training — which is impossible for ambiguous regions (mirrors, glass, thin structures). The adjustment module acts as a safety valve during training, letting the depth decoder focus on getting the overall structure right rather than every pixel perfect.

Two-Layer Depth — Disocclusion Demo

A simple scene: foreground objects (warm) occlude background surfaces (teal). Drag the viewpoint slider to reveal what Layer 2 fills in.

Viewpoint shift 0px

During inference on a real image (no ground-truth depth available), what does the depth adjustment scale map S equal?

S = 0 (depth adjustment is disabled by zeroing out) S = 1 everywhere (the scale map is an identity — no correction applied) S = the predicted disparity map

Chapter 4: From Depth to Gaussians — The Initialization

Here's where the geometry comes in. Once we have a depth map, we can mathematically reconstruct where every pixel lives in 3D space. This process is called unprojection: taking a 2D depth value and a pixel location, and computing the 3D point in camera space.

Step by Step: Building Base Gaussians

Starting from the predicted depth map D':

Step 1: Subsample. The depth map is at 1536×1536 but we subsample by 2× to 768×768. Why? 768×768 = ~590K Gaussians. That's already a dense representation. Going full resolution would be computationally prohibitive.

Step 2: Unproject to 3D. For each pixel (i, j) with depth value D'(i,j), compute the 3D position:

μ(i,j) = [i · D'(i,j), j · D'(i,j), D'(i,j)]^T

This places each Gaussian at the correct 3D location. The (i,j) pixel coordinates, scaled by depth, give the x,y position in camera space. D' itself is the z depth.

Step 3: Set scale proportional to depth. Far-away Gaussians should be larger (they cover more 3D space per pixel):

s(i,j) = s₀ · D'(i,j)

Where s₀ is a small base scale factor. This prevents holes in the rendered output — near objects have fine-grained small Gaussians, distant objects have coarser large ones.

Step 4: Color from image. Each Gaussian's initial color is just the corresponding pixel color from the subsampled image I'(i,j).

Step 5: Default rotation and opacity. Rotation = identity quaternion [1,0,0,0] (axis-aligned Gaussian). Opacity = 0.5 (neutral starting point).

These are BASE Gaussians — not the final answer. They're a structured initialization that captures approximate 3D geometry. The Gaussian decoder then refines all 14 attributes to make the scene photorealistic.

The Gaussian Decoder and Composer

The decoder is a U-Net-style network operating on the 768×768 grid. It takes the base Gaussian attributes plus the feature pyramid from Depth Pro, and outputs attribute deltas ΔG for all 14 attributes.

The composer applies these deltas with careful activation functions to prevent extreme values:

G_attr = γ(γ⁻¹(G₀) + η · ΔG)

Where γ is the attribute-specific activation (e.g., sigmoid for opacity, exp for scale), γ⁻¹ is its inverse, η is a small step size, and G₀ is the base attribute. This composer pattern ensures the base initialization is respected — the decoder adjusts rather than replaces.

Why compose rather than predict from scratch? Predicting 14 Gaussian attributes per pixel purely from image features is an extremely hard regression problem — the network would have to re-derive geometry it already knows. By starting from the geometrically correct initialization and predicting small corrections, the decoder can focus on appearance (color, opacity, sharpness) rather than position.

Interactive Unprojection — Base vs Refined Gaussians

A toy scene with objects at different depths. Adjust depth and see how Gaussians are placed in 3D. Toggle between base initialization and refined result.

View angle 0°

Why is Gaussian scale set proportional to depth (s = s₀ · D)?

Larger Gaussians produce sharper renderings at all distances The training loss penalizes small Gaussians more than large ones Each pixel covers more 3D space at greater depth, so a larger Gaussian is needed to avoid holes in the rendered image

Chapter 5: Training Strategy — Synthetic Then Self-Supervised

SHARP faces a classic problem in 3D vision: multi-view training data is expensive to collect. If you want to train a model to synthesize novel views, you need images of the same scene from multiple cameras — paired ground truth. This exists in synthetic datasets but is scarce for real-world scenes.

SHARP uses a two-stage training strategy that sidesteps this problem for real data entirely.

Stage 1: Synthetic Pretraining (100K steps, 128 A100s)

Train on large synthetic datasets where everything is known perfectly: ground-truth depth, ground-truth multi-view images, ground-truth camera poses. The model learns the fundamentals of 3D reconstruction — how to unproject depth, how to refine Gaussians, how to render accurately.

This is the learn the physics stage. The model develops a strong internal model of how 3D geometry relates to 2D images.

Stage 2: Self-Supervised Fine-Tuning (SSFT, 60K steps, 32 A100s)

Now we want the model to work on real photos. But real photos don't have multi-view GT. So how do we train?

The SSFT trick: Use the trained model to generate its own supervision. Take a real image → predict 3D → render a novel view from a nearby camera → swap the roles: use the rendered novel view as the INPUT and the original real image as the GROUND TRUTH. Now you have a training pair on real-image characteristics.

Why does this work? Because the rendered novel view looks like a real image (it has the camera noise and appearance characteristics of the rendered scene), but the ground truth (the original real photo) is a genuine real image that the model must reproduce. This forces the model to close the domain gap between synthetic rendering and real photography.

Real Image I

No GT depth, no GT novel views

↓ Run trained model

Predicted 3D Gaussians

Reconstruct scene from I

↓ Render from nearby camera

Pseudo Novel View I'

Looks like a real image of the same scene

↓ Swap: I' → input, I → target

Training Pair

Input: I' (rendered view) → GT: I (real photo)

The compute split is telling: Stage 1 needs 128 A100s because the synthetic datasets are enormous and the model must learn everything from scratch. Stage 2 only needs 32 A100s — the model already understands 3D; it just needs to adapt its output distribution to real images.

SSFT is principled, not hacky. It's related to self-distillation and pseudo-label training in semi-supervised learning. The key insight: the model's own outputs, when used as inputs, naturally represent the distribution of images the model will encounter at inference time.

Training Pipeline Visualization

Stage 1 (synthetic) vs Stage 2 (self-supervised). Click each stage to see details.

In SSFT, what plays the role of "ground truth" in the training pair?

The original real photograph — the model renders a nearby view and must predict the real photo from that render The rendered novel view itself (the model supervises itself) Depth maps from a separate depth sensor

Chapter 6: Loss Functions — A Careful Recipe

SHARP uses seven loss terms. Each addresses a specific failure mode. Together they force the model to produce Gaussians that are accurate, sharp, geometrically valid, and well-behaved.

Rendering Losses

L_color (L1 loss on rendered vs GT pixels): The most direct supervision. Minimizing L1 pixel error pushes the model toward accurate color reconstruction. L1 is preferred over L2 because it's more robust to occasional large errors (outliers from floating Gaussians).

L_percep (perceptual loss for inpainted regions): When Gaussians fill in occluded regions (Layer 2), there's no GT pixel to compare against. Instead, a VGG-based perceptual loss compares feature-level similarity — the inpainted content doesn't need to be pixel-perfect, just perceptually plausible.

L_α (binary cross-entropy on opacity): Penalizes Gaussians with intermediate opacity. Pushes Gaussians to be either fully opaque (at a real surface) or fully transparent (empty space). This suppresses the model's tendency to place semi-transparent Gaussians everywhere as a hedge.

Depth Loss

L_depth (L1 on predicted vs GT disparity, Layer 1 only): Directly supervises the depth decoder. Uses disparity (1/depth) rather than depth because disparity is roughly linear in image coordinates — it's easier for a network to predict uniformly.

Regularization Losses

L_tv (total variation on Layer 2 depth): The second depth layer should be smooth — occluded surfaces are typically continuous. Total variation penalizes sharp discontinuities in Layer 2, preventing it from hallucinating physically implausible depth discontinuities.

L_grad (gradient loss to suppress floaters): Penalizes Gaussians whose depth gradient is inconsistent with the image gradient. If there's no visible edge in the image at a location, there shouldn't be a sudden depth change (which would cause a floating Gaussian in space).

L_δ (L2 on position offsets): The Gaussian decoder outputs position deltas relative to the unprojected base positions. Penalizing large deltas forces the decoder to trust the geometric initialization and make only small corrections. This is an information bottleneck — the decoder can't override geometry entirely.

L_splat (projected variance bound): Each Gaussian projects onto the image plane as an ellipse. If a Gaussian is very large or oddly oriented, its projected ellipse can cover a huge image area, causing blurry rendering. This loss bounds the projected 2D variance of each Gaussian.

Depth Adjustment Regularizers

L_scale and L_∇scale: Penalize the scale map S from deviating far from 1 and from being spatially non-smooth. This prevents the adjustment module from simply memorizing the GT depth rather than producing a generally useful correction.

L = ∑_d λ_dL_d + ∑_r λ_rL_r + ∑_s λ_sL_s

Where the three groups are depth losses (d), rendering losses (r), and scale/regularization losses (s), each with their own weighting coefficients λ.

Why so many losses? Each one prevents a specific pathology. Without L_α, the model places semi-transparent clouds everywhere. Without L_splat, a few huge Gaussians dominate. Without L_δ, the decoder ignores geometry entirely. The recipe is empirically derived from ablations — each term was tested and shown to matter.

Loss Components Visualization

Relative contribution of each loss term. Hover over a bar to see its description.

What is the purpose of the Lα (opacity) loss?

To maximize the total number of visible Gaussians in the rendered image To ensure Gaussians near the camera are brighter than distant ones To push Gaussians toward fully opaque or fully transparent, suppressing semi-transparent floaters that blur the rendering

Chapter 7: Results — SOTA Across 6 Datasets

SHARP is evaluated zero-shot on six benchmark datasets that it was never trained on: Middlebury (stereo), Booster (indoor high-res), ScanNet++ (indoor RGB-D), WildRGBD (in-the-wild objects), ETH3D (outdoor scenes), Tanks and Temples (outdoor large-scale).

Zero-shot evaluation is the hardest test — no fine-tuning on test distribution, no domain-specific adaptation. Just train, deploy, and see.

Key Numbers

Metric	SHARP vs Gen3C (best prior)	SHARP vs Flash3D
LPIPS improvement	25–34%	Massive on all datasets
DISTS improvement	21–43%	Consistent improvement
Synthesis speed	1000× faster	Similar speed advantage
Rendering FPS	100+ FPS vs ~0.1 FPS diffusion	—

LPIPS over PSNR: SHARP's paper deliberately uses LPIPS and DISTS (perceptual similarity metrics) rather than PSNR/SSIM. Why? PSNR penalizes any pixel shift equally, including sub-pixel misalignments that are invisible to humans. Perceptual metrics measure what actually matters: does the image look right? This is especially important for view synthesis, where slight viewpoint differences cause unavoidable pixel shifts.

Why Gen3C is the Meaningful Comparison

Gen3C is a video diffusion model fine-tuned for view synthesis. It was the state-of-the-art before SHARP — better quality than all previous methods. But it takes 100+ seconds per synthesis because it runs many diffusion steps. SHARP beats Gen3C on quality and is 1000× faster.

The diffusion methods (ViewCrafter, SVC) can be impressive for large viewpoint changes — they can hallucinate plausible content for regions completely invisible from the input view. SHARP is constrained to what can be geometrically inferred from depth, so it doesn't hallucinate as freely. For nearby views (the typical use case), SHARP's geometric grounding gives better results.

Results Comparison — Toggle Dataset

LPIPS scores across methods. Lower is better. Click a dataset to see its results. SHARP consistently wins.

Why does SHARP's paper use LPIPS and DISTS rather than PSNR as the primary evaluation metric?

Perceptual metrics measure visible image quality; PSNR treats sub-pixel misalignments from viewpoint differences as severe errors even when the image looks correct to humans PSNR is too slow to compute on the large test datasets LPIPS is the standard metric for all 3D reconstruction papers and must be used for fair comparison

Chapter 8: Ablations — Every Component Earns Its Place

An ablation study removes components one at a time to verify they contribute. SHARP's ablations show that each design decision — the two-layer depth, the SSFT, the specific loss terms — meaningfully improves results.

Removed Component	What Breaks	Why
Perceptual loss (L_percep)	Inpainted regions (Layer 2) become blurry	Without perceptual loss, L1 pixel loss over-smooths the inpainted areas
Depth adjustment module	Artifacts on reflective surfaces, mirrors	Reflective surfaces violate Lambertian assumptions; the depth adjustment handles depth ambiguity these cases create
SSFT (Stage 2 training)	Reduced sharpness on real images	Without adapting to real data, the model's outputs are optimized for synthetic rendering style
Layer 2 depth	Ghosting holes when camera moves laterally	No occluded surface information to fill disoccluded regions
Fine-tuning depth decoder	Worse geometry → worse Gaussian placement	Frozen depth decoder predicts depth for depth estimation task, not for view synthesis initialization
Gaussian position offsets (L_δ removed)	Floaters, inaccurate geometry	Without regularization, decoder moves Gaussians arbitrarily far from geometric init

Ablations as design validation. Reading an ablation table is not just about checking that components help — it reveals the failure mode each component was designed to prevent. Every ablation corresponds to a real artifact that appeared during development and motivated that design choice.

Component Interaction

Some components interact non-trivially. The depth adjustment module and SSFT together produce a larger improvement than either alone — the adjustment module prevents depth errors from corrupting the pseudo-GT in SSFT, making the self-supervision signal cleaner. Synergy between components is common in well-designed systems.

Ablation Dashboard

Toggle components on/off to see their impact on the relative quality score. All components on = SHARP's full performance.

Quality score: 100%

Removing the SSFT stage makes SHARP less sharp on real images. What does this reveal about what SSFT actually teaches?

SSFT teaches the model to match the appearance statistics of real photographs, closing the gap between synthetic training and real-world deployment SSFT teaches the model to predict depth more accurately SSFT teaches the model to handle larger viewpoint changes

Chapter 9: Connections — Where SHARP Fits

SHARP is a node in a larger ecosystem of 3D reconstruction and novel view synthesis methods. Understanding its relationships clarifies both what it does well and where it might struggle.

SHARP vs Related Approaches

Method	Speed	Views needed	Real-time render	Limitation
NeRF	Minutes per scene	Many (50–200)	No (costly ray march)	Per-scene optimization
3DGS	Minutes per scene	Many (30–100)	Yes (100+ FPS)	Per-scene optimization
Flash3D	<1 sec	One	Yes	Lower quality, no Layer 2
Gen3C	100+ seconds	One	No	Slow diffusion, stochastic
ViewCrafter	100+ seconds	One	No	Can hallucinate large views
SHARP	<1 sec	One	Yes (100+ FPS)	Limited to nearby views

SHARP's Known Limitations

Nearby views only. SHARP can't synthesize far-off views (e.g., 90° rotation). Its Gaussians are seeded from visible geometry; surfaces that are completely out of frame have no initialization and can't be hallucinated reliably. Diffusion models can hallucinate plausible content here, at the cost of geometric accuracy.

Dynamic scenes. SHARP assumes a static scene. If the photo contains motion blur or the scene changes between capture and viewing, the 3D representation won't be valid.

Reflective and transparent surfaces. Glass and mirrors violate the Lambertian assumption that a surface looks the same from any viewpoint. The depth adjustment module partially compensates, but these remain challenging.

Future Directions

The most exciting open question: can you combine SHARP's speed and geometric grounding with diffusion's hallucination ability for large viewpoint changes? One approach: use SHARP to initialize a sparse 3D representation, then use diffusion to fill in unseen regions. This hybrid would get accurate geometry where data exists and plausible completion where it doesn't.

Another direction: extend SHARP to video input. Given a short video clip instead of a single image, the model has multiple views available and could produce much more accurate 3D. This is essentially the feedforward version of 3DGS — fast reconstruction from sparse known-pose frames.

Related lessons to explore:

NeRF & 3D Gaussian Splatting — the representation SHARP builds on
Diffusion Models — the generation approach SHARP replaces
Vision-Language Models — the backbone architecture family

Cheat Sheet

Component	Key Detail
Input	1536×1536 single RGB image
Backbone	Depth Pro (dual ViT, patch-frozen)
Depth layers	2 (visible + occluded)
Gaussians	~1.2M, 14 attributes each
Inference time	<1 second on GPU
Rendering FPS	100+
vs Gen3C (LPIPS)	25–34% improvement
Training Stage 1	100K steps, 128 A100s, synthetic data
Training Stage 2	60K steps, 32 A100s, SSFT on real data
Paper	arXiv 2512.10685
Code	github.com/apple/ml-sharp

The big picture: SHARP demonstrates a general principle — when a generation problem has enough structure (known geometry from depth, known rendering from Gaussian splatting), reframe it as a regression problem. You trade hallucination flexibility for determinism, speed, and geometric accuracy. For nearby view synthesis, that trade is overwhelmingly worth it.

SHARP: Single-Image 3Din Under a Second