Fine-tune Stable Diffusion to jointly output depth and surface normals from a single image. Generative models already "know" 3D — teach them to show it.
You have a single photograph. Maybe it's a living room, maybe it's a painting of a castle, maybe it's an underwater shot of a coral reef. You want to know the depth (how far away is each pixel?) and the surface normal (which direction does the surface face at each pixel?).
This is called monocular geometry estimation — recovering 3D structure from a single 2D image. It's fundamentally ill-posed: infinitely many 3D scenes can produce the same 2D photograph. You need strong priors — knowledge about how the world works — to resolve the ambiguity.
Until recently, the dominant strategy was discriminative: train a CNN or Vision Transformer to directly regress depth/normal maps from images. Models like MiDaS, DPT, and DepthAnything take this approach. Feed in an image, get out a depth map. Simple, fast, and it works — on images that look like the training data.
The problem? These models are trained on specific datasets: indoor scenes (NYUv2), driving scenes (KITTI), or a mix. When you show them something they've never seen — a painting, an anime frame, a satellite photo, a rendered game scene — they fail. The depth maps lose detail, foreground and background blend together, and surface normals become smeared and inaccurate.
Even DepthAnything, trained on 63.5 million images, shows significant performance drops on "unreal" images — artwork, renders, and generated images — because its discriminative backbone can only interpolate within its training distribution, not extrapolate beyond it.
Click different image types to see how discriminative models struggle with out-of-distribution inputs, while generative models (diffusion-based) maintain quality. The depth quality bars are based on GeoWizard's benchmark results.
Here's the core idea behind GeoWizard, and it's beautifully simple:
Stable Diffusion was trained on billions of internet images to generate photorealistic scenes. To do that, it had to learn how the 3D world works — lighting, occlusion, perspective, surface orientation. That knowledge is encoded in its weights. We just need to redirect the output from "generate an RGB image" to "generate a depth map and a normal map."
Why is this better than discriminative models?
Think of it this way: a discriminative model is like a student who memorized answers to specific exam questions. A generative model is like a student who truly understands the subject — they can answer questions they've never seen before because they understand the underlying principles.
Before we see how GeoWizard modifies Stable Diffusion, let's quickly recap how the original model works. If you're already fluent in latent diffusion, skim this and jump to Chapter 3.
Raw images are huge — a 512×512 image has 786,432 pixel values. Diffusion directly in pixel space is painfully slow. Stable Diffusion solves this with a VAE (Variational Autoencoder):
All the diffusion magic happens in this compact latent space. Much faster, same quality.
Take a clean latent z0. At each timestep t, add Gaussian noise:
Where αt and σt are schedule parameters that control the noise level. At t=T, it's pure noise. At t=0, it's the clean latent.
A U-Net learns to predict the noise that was added. Given a noisy latent zt and the timestep t, it predicts ε̂. Subtract the predicted noise, and you get a slightly cleaner latent zt-1. Repeat T times: pure noise → clean image.
Stable Diffusion conditions on text prompts via CLIP embeddings fed through cross-attention layers in the U-Net. For GeoWizard, the conditioning will be the input image instead of text — we want geometry conditioned on what the camera sees.
Now the clever part. Stable Diffusion generates RGB images. We want it to generate depth maps and normal maps instead. How?
The same VAE that encodes RGB images can encode depth and normal maps too. A depth map is just a single-channel image (repeated to 3 channels). A normal map is a 3-channel image where R, G, B encode the x, y, z components of the surface normal vector. The VAE doesn't care what the pixels "mean" — it just compresses spatial patterns.
Where x is the input image, d is the ground-truth depth, and n is the ground-truth normal map.
GeoWizard conditions the diffusion process on the input image in two ways:
GeoWizard starts from the pre-trained Stable Diffusion V2 weights and fine-tunes the entire U-Net. This is critical — models like DDP that train diffusion from scratch need orders of magnitude more data and compute, and still don't capture the rich priors that SD learned from LAION-5B (5 billion image-text pairs).
The fine-tuning is surprisingly lightweight: 20,000 steps on 280K images, 2 days on 8 A100 GPUs. Compare that to the years of compute baked into Stable Diffusion's pre-training.
This is GeoWizard's signature contribution: depth and normals are generated jointly in a single diffusion process, not independently.
Depth and normals are two views of the same 3D geometry. The normal at a surface point is the gradient of the depth field. If you estimate them separately, you can get contradictions: the depth says "flat wall" but the normals say "bumpy surface." Joint estimation ensures geometric consistency.
Rather than two separate U-Nets (expensive, no interaction), GeoWizard uses a single U-Net with a geometry switcher — a small signal that tells the network whether to output depth or normal at each forward pass:
Where sd and sn are one-dimensional indicator vectors. They're encoded via positional encoding and added to the time embedding in the U-Net. The network shares all its weights — only a tiny signal switches between depth and normal mode.
The real magic: during training, GeoWizard processes both depth and normal latents through the U-Net, and modifies the self-attention layers so depth can attend to normal features and vice versa. Instead of standard self-attention where queries, keys, and values come from the same domain:
The depth query attends to keys and values that concatenate both depth and normal features. Similarly for normal queries. This cross-domain attention lets depth inform normals and normals inform depth — mutual guidance that enforces geometric consistency.
Drag the timestep slider to see noise gradually resolve into depth and normal maps. Toggle between depth-only and joint output to see how joint generation produces more consistent geometry.
Let's trace the full data flow through GeoWizard, from input image to output depth and normal maps.
The input image x (576×768) is encoded by the Stable Diffusion VAE into a latent Zx (72×96×4). Ground-truth depth d and normal n are similarly encoded to Zd and Zn during training.
Random Gaussian noise is added to Zd and Zn at a shared timestep t, producing noisy latents Ztd and Ztn. Crucially, both use the same timestep — this simplifies learning when the model must handle two modalities simultaneously.
Two forward passes per training step:
This is GeoWizard's second key innovation. Different scene types have very different depth distributions:
If you train on all three mixed together, the diffusion model gets confused — it doesn't know if a mid-range depth value means "close indoor wall" or "distant outdoor building." The scene decoupler adds a one-hot scene type indicator (indoor/outdoor/object) as another positional encoding added to the time embedding.
The full architecture: image is encoded by VAE, concatenated with noisy geometric latents, processed by a shared U-Net with geometry switcher and scene decoupler. Cross-domain attention connects depth and normal branches.
After DDIM denoising (10–50 steps at inference), the clean latents Z0d and Z0n are decoded back to pixel space by the VAE decoder, producing the final depth map d̂ and normal map n̂.
GeoWizard's training is remarkably efficient given its performance. Let's break down the ingredients.
| Domain | Dataset | Samples | Source |
|---|---|---|---|
| Indoor | Hypersim + Replica | 76,347 | Synthetic renders with GT depth/normal |
| Outdoor | 3D Ken Burns + custom simulation | 115,678 | Stereo pairs + synthetic city scenes |
| Objects | Objaverse (filtered) | 85,997 | High-quality 3D objects rendered with GT |
Notice: all training data is synthetic with perfect ground-truth depth and normals. No noisy pseudo-labels. The generalization comes from Stable Diffusion's pre-training, not from diverse real-world depth data.
GeoWizard uses v-prediction as the learning target rather than epsilon-prediction:
V-prediction blends the noise and the signal, which is more numerically stable and converges faster for this task.
They also use multi-resolution noise — noise at multiple spatial frequencies rather than a single scale. This is critical because depth and normal maps have many similar values in local regions (think: a flat wall). Single-scale noise struggles with these uniform regions; multi-scale noise preserves low-frequency structure better.
GeoWizard is evaluated zero-shot on six depth benchmarks and five normal benchmarks. "Zero-shot" means the model was never trained on these datasets — it must generalize from its synthetic training data + diffusion priors.
On standard benchmarks (NYUv2, KITTI, ETH3D, ScanNet), GeoWizard matches or slightly beats Marigold (the closest generative competitor) and comes close to DepthAnything, which was trained on 63.5M images — 227× more data.
But the real story is in generalization. On out-of-distribution images — paintings, anime, renders, underwater photos — DepthAnything's depth maps lose detail and produce incorrect spatial layouts. GeoWizard maintains faithful geometry because its generative prior understands 3D structure, not just pattern matching.
GeoWizard achieves state-of-the-art on zero-shot normal estimation across all benchmarks, beating DSINE (the previous best discriminative method) and Omnidata v2. The improvement is especially visible in fine-grained details: hairlines, architectural textures, and thin structures like chair legs.
A unique advantage of joint estimation: GeoWizard's depth and normals are geometrically consistent with each other. The mean angular error between normals computed from the depth map and the directly predicted normals is 16.2° for GeoWizard's full model, compared to 18.1° without cross-domain attention and 19.1° with separate models.
Comparison of depth (AbsRel, lower is better) and normal (Mean angular error, lower is better) across methods. GeoWizard trains on only 0.28M images yet matches models trained on 63.5M.
| Configuration | AbsRel ↓ | Normal Mean ↓ | GC ↓ |
|---|---|---|---|
| Separate models (2 U-Nets) | 8.5 | 16.9 | 19.1 |
| w/o Geometry Switcher | 6.9 | 15.0 | 18.1 |
| w/o Scene Decoupler | 7.5 | 16.1 | 16.5 |
| Full GeoWizard | 6.7 | 14.8 | 16.2 |
High-quality depth and normals unlock a cascade of downstream tasks. GeoWizard demonstrates three:
With estimated depth and normals, you can reconstruct a 3D mesh from a single photo. But there's a catch: GeoWizard predicts affine-invariant depth — the relative ordering is correct, but the absolute scale and shift are unknown.
Solution: optimize a scale ŝ and shift t̂ so that the normals computed from the scaled depth match the directly predicted normals. Minimize the angular difference in spherical coordinates:
Where n̂d is the normal derived from the depth via least-square fitting, and n̂ is GeoWizard's predicted normal. This aligns the depth scale with the normals, then the BiNI algorithm fuses both for high-quality surface reconstruction.
The result: detailed 3D meshes from a single photo, capturing fine features like the beard of a stone lion or the folds of a cloak.
GeoWizard's depth and normal maps can condition ControlNet for image generation. The geometry acts as a structural scaffold: "generate a futuristic version of this scene, keeping the same 3D layout." Because GeoWizard's geometry captures fine details, the generated images maintain spatial coherence — doorways stay doorways, chair legs stay thin.
By projecting pixels to 3D using the estimated depth, then re-rendering from a new camera angle, you get a novel view of the scene. GeoWizard's accurate depth produces cleaner warps with fewer artifacts than MiDaS, especially on thin structures and edges.
GeoWizard sits at the intersection of diffusion models and 3D geometry estimation. Let's map where it fits.
Marigold is the closest precursor — it also fine-tunes Stable Diffusion for monocular depth estimation. But Marigold has two limitations GeoWizard addresses: (1) it only predicts depth, not normals, so no geometric consistency guarantee; (2) it trains on mixed scene types without decoupling, leading to layout ambiguities where foreground objects get flattened into backgrounds.
DepthAnything takes the opposite approach: massive discriminative training on 63.5M images. It achieves the best quantitative numbers on standard benchmarks but fails on out-of-distribution images. GeoWizard shows you can achieve comparable in-distribution quality with 227× less data by leveraging generative priors, plus dramatically better out-of-distribution generalization.
GeoWizard demonstrates that Stable Diffusion's latent representations encode rich 3D knowledge — a finding with broad implications. If diffusion models understand geometry well enough to produce depth and normals, what else do they know? This opens the door to extracting other physical properties (material, lighting, albedo) from the same pre-trained model.
Metric3D takes yet another approach: using camera intrinsics to produce metric (absolute-scale) depth. GeoWizard produces affine-invariant depth that needs scale recovery. They're complementary — GeoWizard has better detail and generalization, Metric3D has absolute scale.
| Aspect | GeoWizard |
|---|---|
| Base model | Stable Diffusion V2 |
| Input | Single RGB image + scene type prompt |
| Output | Affine-invariant depth + surface normal map |
| Key mechanism | Joint generation via geometry switcher + cross-domain attention |
| Scene handling | Distribution decoupler (indoor/outdoor/object) |
| Training data | 280K synthetic images (Hypersim, Replica, Objaverse, etc.) |
| Training cost | 20K steps, 2 days on 8×A100 |
| Inference | DDIM, 10–50 steps, optional ensemble |
| Key result | SOTA zero-shot depth + normal, robust OOD generalization |