Fu, Yin, Hu, Wang, Ma, Tan, Shen, Lin, Long — CUHK, Adelaide, HKUST, ShanghaiTech, HKU, Light Illusions, 2024

GeoWizard: Unleashing Diffusion Priors for 3D Geometry

Fine-tune Stable Diffusion to jointly output depth and surface normals from a single image. Generative models already "know" 3D — teach them to show it.

Prerequisites: Diffusion models (basics) + Stable Diffusion / latent diffusion + Depth & surface normals
10
Chapters
4+
Simulations

Chapter 0: The Problem

You have a single photograph. Maybe it's a living room, maybe it's a painting of a castle, maybe it's an underwater shot of a coral reef. You want to know the depth (how far away is each pixel?) and the surface normal (which direction does the surface face at each pixel?).

This is called monocular geometry estimation — recovering 3D structure from a single 2D image. It's fundamentally ill-posed: infinitely many 3D scenes can produce the same 2D photograph. You need strong priors — knowledge about how the world works — to resolve the ambiguity.

The discriminative approach

Until recently, the dominant strategy was discriminative: train a CNN or Vision Transformer to directly regress depth/normal maps from images. Models like MiDaS, DPT, and DepthAnything take this approach. Feed in an image, get out a depth map. Simple, fast, and it works — on images that look like the training data.

The problem? These models are trained on specific datasets: indoor scenes (NYUv2), driving scenes (KITTI), or a mix. When you show them something they've never seen — a painting, an anime frame, a satellite photo, a rendered game scene — they fail. The depth maps lose detail, foreground and background blend together, and surface normals become smeared and inaccurate.

The generalization gap: Discriminative depth models are pattern matchers. They learn "indoor scene with furniture = these depth patterns." But they can't hallucinate plausible geometry for an image type they've never encountered. A watercolor painting of a mountain? A screenshot from Minecraft? They have no prior for that.

Even DepthAnything, trained on 63.5 million images, shows significant performance drops on "unreal" images — artwork, renders, and generated images — because its discriminative backbone can only interpolate within its training distribution, not extrapolate beyond it.

Discriminative vs Generative Failure Modes

Click different image types to see how discriminative models struggle with out-of-distribution inputs, while generative models (diffusion-based) maintain quality. The depth quality bars are based on GeoWizard's benchmark results.

Image type
Why do discriminative depth models (CNNs/ViTs) fail on unusual image types like artwork or game renders?

Chapter 1: The Key Insight

Here's the core idea behind GeoWizard, and it's beautifully simple:

Stable Diffusion was trained on billions of internet images to generate photorealistic scenes. To do that, it had to learn how the 3D world works — lighting, occlusion, perspective, surface orientation. That knowledge is encoded in its weights. We just need to redirect the output from "generate an RGB image" to "generate a depth map and a normal map."

The leap: A generative model that can create realistic images of a kitchen must understand that countertops are flat, walls are vertical, and objects have depth. That geometric understanding is a diffusion prior — learned implicitly from billions of image-text pairs. GeoWizard fine-tunes Stable Diffusion to make this implicit knowledge explicit.

Why is this better than discriminative models?

Think of it this way: a discriminative model is like a student who memorized answers to specific exam questions. A generative model is like a student who truly understands the subject — they can answer questions they've never seen before because they understand the underlying principles.

Discriminative Approach
Train CNN/ViT from scratch on depth datasets. Fast inference, but limited to training distribution. No intrinsic 3D understanding.
↓ vs ↓
GeoWizard's Generative Approach
Fine-tune Stable Diffusion. Inherits billions of images of 3D priors. Generates depth + normals via iterative denoising. Generalizes to any visual domain.
Why can Stable Diffusion serve as a strong prior for 3D geometry estimation?

Chapter 2: Stable Diffusion Recap

Before we see how GeoWizard modifies Stable Diffusion, let's quickly recap how the original model works. If you're already fluent in latent diffusion, skim this and jump to Chapter 3.

The latent space trick

Raw images are huge — a 512×512 image has 786,432 pixel values. Diffusion directly in pixel space is painfully slow. Stable Diffusion solves this with a VAE (Variational Autoencoder):

All the diffusion magic happens in this compact latent space. Much faster, same quality.

Forward process: adding noise

Take a clean latent z0. At each timestep t, add Gaussian noise:

zt = αt z0 + σt ε,    ε ~ N(0, I)

Where αt and σt are schedule parameters that control the noise level. At t=T, it's pure noise. At t=0, it's the clean latent.

Reverse process: denoising

A U-Net learns to predict the noise that was added. Given a noisy latent zt and the timestep t, it predicts ε̂. Subtract the predicted noise, and you get a slightly cleaner latent zt-1. Repeat T times: pure noise → clean image.

Conditioning

Stable Diffusion conditions on text prompts via CLIP embeddings fed through cross-attention layers in the U-Net. For GeoWizard, the conditioning will be the input image instead of text — we want geometry conditioned on what the camera sees.

Key components to remember: (1) A VAE that maps images to/from a latent space. (2) A U-Net that denoises in latent space. (3) A noise schedule that controls the diffusion process. (4) Cross-attention layers for conditioning. GeoWizard modifies all four in subtle ways.
Why does Stable Diffusion work in latent space rather than pixel space?

Chapter 3: Adapting Diffusion for Geometry

Now the clever part. Stable Diffusion generates RGB images. We want it to generate depth maps and normal maps instead. How?

Step 1: Encode everything into latent space

The same VAE that encodes RGB images can encode depth and normal maps too. A depth map is just a single-channel image (repeated to 3 channels). A normal map is a 3-channel image where R, G, B encode the x, y, z components of the surface normal vector. The VAE doesn't care what the pixels "mean" — it just compresses spatial patterns.

Zx = VAE(x),   Zd = VAE(d),   Zn = VAE(n)

Where x is the input image, d is the ground-truth depth, and n is the ground-truth normal map.

Step 2: Condition on the input image

GeoWizard conditions the diffusion process on the input image in two ways:

Why both? CLIP embeddings offer global context ("this is an outdoor scene with mountains"). Latent concatenation offers precise spatial alignment ("this pixel is on a wall edge"). Together, they reduce randomness and ensure the generated geometry faithfully matches the input image.

Step 3: Fine-tune, don't train from scratch

GeoWizard starts from the pre-trained Stable Diffusion V2 weights and fine-tunes the entire U-Net. This is critical — models like DDP that train diffusion from scratch need orders of magnitude more data and compute, and still don't capture the rich priors that SD learned from LAION-5B (5 billion image-text pairs).

The fine-tuning is surprisingly lightweight: 20,000 steps on 280K images, 2 days on 8 A100 GPUs. Compare that to the years of compute baked into Stable Diffusion's pre-training.

How does GeoWizard condition the U-Net on the input image?

Chapter 4: Joint Depth-Normal Generation

This is GeoWizard's signature contribution: depth and normals are generated jointly in a single diffusion process, not independently.

Why joint estimation matters

Depth and normals are two views of the same 3D geometry. The normal at a surface point is the gradient of the depth field. If you estimate them separately, you can get contradictions: the depth says "flat wall" but the normals say "bumpy surface." Joint estimation ensures geometric consistency.

The geometry switcher

Rather than two separate U-Nets (expensive, no interaction), GeoWizard uses a single U-Net with a geometry switcher — a small signal that tells the network whether to output depth or normal at each forward pass:

d̂ = f(x, sd),    n̂ = f(x, sn)

Where sd and sn are one-dimensional indicator vectors. They're encoded via positional encoding and added to the time embedding in the U-Net. The network shares all its weights — only a tiny signal switches between depth and normal mode.

Cross-domain geometric self-attention

The real magic: during training, GeoWizard processes both depth and normal latents through the U-Net, and modifies the self-attention layers so depth can attend to normal features and vice versa. Instead of standard self-attention where queries, keys, and values come from the same domain:

qd = Q · ẑd,   kd = K · (ẑd ⊕ ẑn),   vd = V · (ẑd ⊕ ẑn)

The depth query attends to keys and values that concatenate both depth and normal features. Similarly for normal queries. This cross-domain attention lets depth inform normals and normals inform depth — mutual guidance that enforces geometric consistency.

The payoff: In ablations, removing cross-domain attention increases the geometric consistency error from 16.2 to 18.1 degrees. The depth and normal maps produced without it show visible contradictions, especially in distant regions. Joint estimation isn't a nice-to-have — it's essential.
Diffusion Denoising for Geometry (SHOWCASE)

Drag the timestep slider to see noise gradually resolve into depth and normal maps. Toggle between depth-only and joint output to see how joint generation produces more consistent geometry.

Timestep t T (pure noise)
Mode
What is the purpose of cross-domain geometric self-attention in GeoWizard?

Chapter 5: The Architecture

Let's trace the full data flow through GeoWizard, from input image to output depth and normal maps.

Encoding

The input image x (576×768) is encoded by the Stable Diffusion VAE into a latent Zx (72×96×4). Ground-truth depth d and normal n are similarly encoded to Zd and Zn during training.

Noising (training only)

Random Gaussian noise is added to Zd and Zn at a shared timestep t, producing noisy latents Ztd and Ztn. Crucially, both use the same timestep — this simplifies learning when the model must handle two modalities simultaneously.

U-Net forward pass

Two forward passes per training step:

  1. Depth pass: Concatenate [Zx, Ztd] along channels. Feed through U-Net with geometry switcher set to sd and scene prompt si. Cross-domain attention accesses Ztn.
  2. Normal pass: Concatenate [Zx, Ztn] along channels. Feed through U-Net with geometry switcher set to sn and scene prompt si. Cross-domain attention accesses Ztd.

Scene distribution decoupler

This is GeoWizard's second key innovation. Different scene types have very different depth distributions:

If you train on all three mixed together, the diffusion model gets confused — it doesn't know if a mid-range depth value means "close indoor wall" or "distant outdoor building." The scene decoupler adds a one-hot scene type indicator (indoor/outdoor/object) as another positional encoding added to the time embedding.

Why this works: By explicitly telling the model what type of scene it's looking at, you split one hard distribution (all scenes mixed) into three simpler sub-distributions that are each easier to learn. The ablation confirms: removing the decoupler increases AbsRel error from 6.7 to 7.5 and consistency error from 16.2 to 16.5.
GeoWizard Pipeline

The full architecture: image is encoded by VAE, concatenated with noisy geometric latents, processed by a shared U-Net with geometry switcher and scene decoupler. Cross-domain attention connects depth and normal branches.

Decoding

After DDIM denoising (10–50 steps at inference), the clean latents Z0d and Z0n are decoded back to pixel space by the VAE decoder, producing the final depth map d̂ and normal map n̂.

Why does GeoWizard use a scene distribution decoupler instead of training on all scene types jointly?

Chapter 6: Training

GeoWizard's training is remarkably efficient given its performance. Let's break down the ingredients.

Training data: 280K images from three domains

DomainDatasetSamplesSource
IndoorHypersim + Replica76,347Synthetic renders with GT depth/normal
Outdoor3D Ken Burns + custom simulation115,678Stereo pairs + synthetic city scenes
ObjectsObjaverse (filtered)85,997High-quality 3D objects rendered with GT

Notice: all training data is synthetic with perfect ground-truth depth and normals. No noisy pseudo-labels. The generalization comes from Stable Diffusion's pre-training, not from diverse real-world depth data.

Loss function: v-prediction with multi-scale noise

GeoWizard uses v-prediction as the learning target rather than epsilon-prediction:

vtd = αt εtd − σt Zd

V-prediction blends the noise and the signal, which is more numerically stable and converges faster for this task.

They also use multi-resolution noise — noise at multiple spatial frequencies rather than a single scale. This is critical because depth and normal maps have many similar values in local regions (think: a flat wall). Single-scale noise struggles with these uniform regions; multi-scale noise preserves low-frequency structure better.

Training details

Ensemble trick at inference: GeoWizard can run multiple denoising passes with different random noise initializations and average the results. This reduces noise and improves quality at the cost of proportionally more compute. In benchmarks, ensembling 3 passes typically gives the best quality-speed tradeoff.
Why does GeoWizard use multi-resolution noise instead of standard single-scale Gaussian noise?

Chapter 7: Results

GeoWizard is evaluated zero-shot on six depth benchmarks and five normal benchmarks. "Zero-shot" means the model was never trained on these datasets — it must generalize from its synthetic training data + diffusion priors.

Depth estimation

On standard benchmarks (NYUv2, KITTI, ETH3D, ScanNet), GeoWizard matches or slightly beats Marigold (the closest generative competitor) and comes close to DepthAnything, which was trained on 63.5M images — 227× more data.

But the real story is in generalization. On out-of-distribution images — paintings, anime, renders, underwater photos — DepthAnything's depth maps lose detail and produce incorrect spatial layouts. GeoWizard maintains faithful geometry because its generative prior understands 3D structure, not just pattern matching.

Normal estimation

GeoWizard achieves state-of-the-art on zero-shot normal estimation across all benchmarks, beating DSINE (the previous best discriminative method) and Omnidata v2. The improvement is especially visible in fine-grained details: hairlines, architectural textures, and thin structures like chair legs.

Geometric consistency

A unique advantage of joint estimation: GeoWizard's depth and normals are geometrically consistent with each other. The mean angular error between normals computed from the depth map and the directly predicted normals is 16.2° for GeoWizard's full model, compared to 18.1° without cross-domain attention and 19.1° with separate models.

Benchmark Results

Comparison of depth (AbsRel, lower is better) and normal (Mean angular error, lower is better) across methods. GeoWizard trains on only 0.28M images yet matches models trained on 63.5M.

Metric

Ablation summary

ConfigurationAbsRel ↓Normal Mean ↓GC ↓
Separate models (2 U-Nets)8.516.919.1
w/o Geometry Switcher6.915.018.1
w/o Scene Decoupler7.516.116.5
Full GeoWizard6.714.816.2
What is GeoWizard's key advantage over DepthAnything in real-world usage?

Chapter 8: Applications

High-quality depth and normals unlock a cascade of downstream tasks. GeoWizard demonstrates three:

1. Single-image 3D reconstruction

With estimated depth and normals, you can reconstruct a 3D mesh from a single photo. But there's a catch: GeoWizard predicts affine-invariant depth — the relative ordering is correct, but the absolute scale and shift are unknown.

Solution: optimize a scale ŝ and shift t̂ so that the normals computed from the scaled depth match the directly predicted normals. Minimize the angular difference in spherical coordinates:

minŝ, t̂ D(n̂d, n̂)

Where n̂d is the normal derived from the depth via least-square fitting, and n̂ is GeoWizard's predicted normal. This aligns the depth scale with the normals, then the BiNI algorithm fuses both for high-quality surface reconstruction.

The result: detailed 3D meshes from a single photo, capturing fine features like the beard of a stone lion or the folds of a cloak.

2. Depth-guided image generation

GeoWizard's depth and normal maps can condition ControlNet for image generation. The geometry acts as a structural scaffold: "generate a futuristic version of this scene, keeping the same 3D layout." Because GeoWizard's geometry captures fine details, the generated images maintain spatial coherence — doorways stay doorways, chair legs stay thin.

3. Novel view synthesis

By projecting pixels to 3D using the estimated depth, then re-rendering from a new camera angle, you get a novel view of the scene. GeoWizard's accurate depth produces cleaner warps with fewer artifacts than MiDaS, especially on thin structures and edges.

A unified geometric foundation: All three applications benefit from the same two outputs: a high-quality depth map and a geometrically consistent normal map. GeoWizard acts as a geometry backbone that any downstream pipeline can build on.
How does GeoWizard recover absolute depth scale from its affine-invariant depth prediction?

Chapter 9: Connections

GeoWizard sits at the intersection of diffusion models and 3D geometry estimation. Let's map where it fits.

Relation to Marigold

Marigold is the closest precursor — it also fine-tunes Stable Diffusion for monocular depth estimation. But Marigold has two limitations GeoWizard addresses: (1) it only predicts depth, not normals, so no geometric consistency guarantee; (2) it trains on mixed scene types without decoupling, leading to layout ambiguities where foreground objects get flattened into backgrounds.

Relation to DepthAnything

DepthAnything takes the opposite approach: massive discriminative training on 63.5M images. It achieves the best quantitative numbers on standard benchmarks but fails on out-of-distribution images. GeoWizard shows you can achieve comparable in-distribution quality with 227× less data by leveraging generative priors, plus dramatically better out-of-distribution generalization.

Relation to Stable Diffusion

GeoWizard demonstrates that Stable Diffusion's latent representations encode rich 3D knowledge — a finding with broad implications. If diffusion models understand geometry well enough to produce depth and normals, what else do they know? This opens the door to extracting other physical properties (material, lighting, albedo) from the same pre-trained model.

Relation to Metric3D / Metric3D v2

Metric3D takes yet another approach: using camera intrinsics to produce metric (absolute-scale) depth. GeoWizard produces affine-invariant depth that needs scale recovery. They're complementary — GeoWizard has better detail and generalization, Metric3D has absolute scale.

Cheat Sheet

AspectGeoWizard
Base modelStable Diffusion V2
InputSingle RGB image + scene type prompt
OutputAffine-invariant depth + surface normal map
Key mechanismJoint generation via geometry switcher + cross-domain attention
Scene handlingDistribution decoupler (indoor/outdoor/object)
Training data280K synthetic images (Hypersim, Replica, Objaverse, etc.)
Training cost20K steps, 2 days on 8×A100
InferenceDDIM, 10–50 steps, optional ensemble
Key resultSOTA zero-shot depth + normal, robust OOD generalization
The broader lesson: Large generative models pre-trained on web-scale data contain surprisingly rich representations of 3D geometry. Rather than building specialized geometry models from scratch, we can unlock this knowledge through lightweight fine-tuning — getting better results with orders of magnitude less data and compute.
What are GeoWizard's two key innovations beyond simply fine-tuning Stable Diffusion for depth prediction?