GeoWizard — Veanors

Chapter 0: The Problem

You have a single photograph. Maybe it's a living room, maybe it's a painting of a castle, maybe it's an underwater shot of a coral reef. You want to know the depth (how far away is each pixel?) and the surface normal (which direction does the surface face at each pixel?).

This is called monocular geometry estimation — recovering 3D structure from a single 2D image. It's fundamentally ill-posed: infinitely many 3D scenes can produce the same 2D photograph. You need strong priors — knowledge about how the world works — to resolve the ambiguity.

The discriminative approach

Until recently, the dominant strategy was discriminative: train a CNN or Vision Transformer to directly regress depth/normal maps from images. Models like MiDaS, DPT, and DepthAnything take this approach. Feed in an image, get out a depth map. Simple, fast, and it works — on images that look like the training data.

The problem? These models are trained on specific datasets: indoor scenes (NYUv2), driving scenes (KITTI), or a mix. When you show them something they've never seen — a painting, an anime frame, a satellite photo, a rendered game scene — they fail. The depth maps lose detail, foreground and background blend together, and surface normals become smeared and inaccurate.

The generalization gap: Discriminative depth models are pattern matchers. They learn "indoor scene with furniture = these depth patterns." But they can't hallucinate plausible geometry for an image type they've never encountered. A watercolor painting of a mountain? A screenshot from Minecraft? They have no prior for that.

Even DepthAnything, trained on 63.5 million images, shows significant performance drops on "unreal" images — artwork, renders, and generated images — because its discriminative backbone can only interpolate within its training distribution, not extrapolate beyond it.

Discriminative vs Generative Failure Modes

Click different image types to see how discriminative models struggle with out-of-distribution inputs, while generative models (diffusion-based) maintain quality. The depth quality bars are based on GeoWizard's benchmark results.

Image type

Why do discriminative depth models (CNNs/ViTs) fail on unusual image types like artwork or game renders?

They can only interpolate within their training distribution — they have no learned prior for image types they've never seen, so they can't produce plausible geometry They run too slowly on high-resolution images They can only predict depth, not surface normals

Chapter 1: The Key Insight

Here's the core idea behind GeoWizard, and it's beautifully simple:

Stable Diffusion was trained on billions of internet images to generate photorealistic scenes. To do that, it had to learn how the 3D world works — lighting, occlusion, perspective, surface orientation. That knowledge is encoded in its weights. We just need to redirect the output from "generate an RGB image" to "generate a depth map and a normal map."

The leap: A generative model that can create realistic images of a kitchen must understand that countertops are flat, walls are vertical, and objects have depth. That geometric understanding is a diffusion prior — learned implicitly from billions of image-text pairs. GeoWizard fine-tunes Stable Diffusion to make this implicit knowledge explicit.

Why is this better than discriminative models?

Generalization: Stable Diffusion has seen paintings, renders, photos, anime, sketches, satellite imagery — virtually every visual domain. Its geometric priors generalize to all of them.
Detail preservation: Diffusion models are trained to generate fine-grained details. When repurposed for geometry, they capture intricate surface details that discriminative models smear over.
Efficiency: Instead of training a geometry model from scratch on expensive 3D datasets, you fine-tune an existing model that already has billions of images worth of knowledge. GeoWizard trains on only 280K images.

Think of it this way: a discriminative model is like a student who memorized answers to specific exam questions. A generative model is like a student who truly understands the subject — they can answer questions they've never seen before because they understand the underlying principles.

Discriminative Approach

Train CNN/ViT from scratch on depth datasets. Fast inference, but limited to training distribution. No intrinsic 3D understanding.

↓ vs ↓

GeoWizard's Generative Approach

Fine-tune Stable Diffusion. Inherits billions of images of 3D priors. Generates depth + normals via iterative denoising. Generalizes to any visual domain.

Why can Stable Diffusion serve as a strong prior for 3D geometry estimation?

Because it was trained on a depth estimation dataset Because generating photorealistic images requires implicitly learning 3D structure — lighting, occlusion, perspective — from billions of images, creating rich transferable geometric priors Because it uses a Vision Transformer backbone

Chapter 2: Stable Diffusion Recap

Before we see how GeoWizard modifies Stable Diffusion, let's quickly recap how the original model works. If you're already fluent in latent diffusion, skim this and jump to Chapter 3.

The latent space trick

Raw images are huge — a 512×512 image has 786,432 pixel values. Diffusion directly in pixel space is painfully slow. Stable Diffusion solves this with a VAE (Variational Autoencoder):

Encoder: compresses a 512×512×3 image into a 64×64×4 latent tensor — 48× fewer values
Decoder: reconstructs the image from the latent

All the diffusion magic happens in this compact latent space. Much faster, same quality.

Forward process: adding noise

Take a clean latent z₀. At each timestep t, add Gaussian noise:

z_t = α_t z₀ + σ_t ε, ε ~ N(0, I)

Where α_t and σ_t are schedule parameters that control the noise level. At t=T, it's pure noise. At t=0, it's the clean latent.

Reverse process: denoising

A U-Net learns to predict the noise that was added. Given a noisy latent z_t and the timestep t, it predicts ε̂. Subtract the predicted noise, and you get a slightly cleaner latent z_t-1. Repeat T times: pure noise → clean image.

Conditioning

Stable Diffusion conditions on text prompts via CLIP embeddings fed through cross-attention layers in the U-Net. For GeoWizard, the conditioning will be the input image instead of text — we want geometry conditioned on what the camera sees.

Key components to remember: (1) A VAE that maps images to/from a latent space. (2) A U-Net that denoises in latent space. (3) A noise schedule that controls the diffusion process. (4) Cross-attention layers for conditioning. GeoWizard modifies all four in subtle ways.

Why does Stable Diffusion work in latent space rather than pixel space?

Because the latent space is 48× smaller (64×64×4 vs 512×512×3), making diffusion dramatically faster while preserving image quality Because pixel-space diffusion produces blurry images Because the VAE adds noise more efficiently than pixel-level Gaussian noise

Chapter 3: Adapting Diffusion for Geometry

Now the clever part. Stable Diffusion generates RGB images. We want it to generate depth maps and normal maps instead. How?

Step 1: Encode everything into latent space

The same VAE that encodes RGB images can encode depth and normal maps too. A depth map is just a single-channel image (repeated to 3 channels). A normal map is a 3-channel image where R, G, B encode the x, y, z components of the surface normal vector. The VAE doesn't care what the pixels "mean" — it just compresses spatial patterns.

Z_x = VAE(x), Z_d = VAE(d), Z_n = VAE(n)

Where x is the input image, d is the ground-truth depth, and n is the ground-truth normal map.

Step 2: Condition on the input image

GeoWizard conditions the diffusion process on the input image in two ways:

CLIP embedding via cross-attention: The image is encoded by CLIP and fed through the U-Net's cross-attention layers, providing global semantic understanding.
Latent concatenation: The image latent Z_x is concatenated with the noisy geometric latent along the channel dimension. This gives the U-Net pixel-level spatial information — it can see exactly what each region of the image looks like.

Why both? CLIP embeddings offer global context ("this is an outdoor scene with mountains"). Latent concatenation offers precise spatial alignment ("this pixel is on a wall edge"). Together, they reduce randomness and ensure the generated geometry faithfully matches the input image.

Step 3: Fine-tune, don't train from scratch

GeoWizard starts from the pre-trained Stable Diffusion V2 weights and fine-tunes the entire U-Net. This is critical — models like DDP that train diffusion from scratch need orders of magnitude more data and compute, and still don't capture the rich priors that SD learned from LAION-5B (5 billion image-text pairs).

The fine-tuning is surprisingly lightweight: 20,000 steps on 280K images, 2 days on 8 A100 GPUs. Compare that to the years of compute baked into Stable Diffusion's pre-training.

How does GeoWizard condition the U-Net on the input image?

Only through CLIP text embeddings Only through latent-space concatenation Both — CLIP embeddings via cross-attention for global context, and latent concatenation for pixel-level spatial alignment

Chapter 4: Joint Depth-Normal Generation

This is GeoWizard's signature contribution: depth and normals are generated jointly in a single diffusion process, not independently.

Why joint estimation matters

Depth and normals are two views of the same 3D geometry. The normal at a surface point is the gradient of the depth field. If you estimate them separately, you can get contradictions: the depth says "flat wall" but the normals say "bumpy surface." Joint estimation ensures geometric consistency.

The geometry switcher

Rather than two separate U-Nets (expensive, no interaction), GeoWizard uses a single U-Net with a geometry switcher — a small signal that tells the network whether to output depth or normal at each forward pass:

d̂ = f(x, s_d), n̂ = f(x, s_n)

Where s_d and s_n are one-dimensional indicator vectors. They're encoded via positional encoding and added to the time embedding in the U-Net. The network shares all its weights — only a tiny signal switches between depth and normal mode.

Cross-domain geometric self-attention

The real magic: during training, GeoWizard processes both depth and normal latents through the U-Net, and modifies the self-attention layers so depth can attend to normal features and vice versa. Instead of standard self-attention where queries, keys, and values come from the same domain:

q_d = Q · ẑ^d, k_d = K · (ẑ^d ⊕ ẑⁿ), v_d = V · (ẑ^d ⊕ ẑⁿ)

The depth query attends to keys and values that concatenate both depth and normal features. Similarly for normal queries. This cross-domain attention lets depth inform normals and normals inform depth — mutual guidance that enforces geometric consistency.

The payoff: In ablations, removing cross-domain attention increases the geometric consistency error from 16.2 to 18.1 degrees. The depth and normal maps produced without it show visible contradictions, especially in distant regions. Joint estimation isn't a nice-to-have — it's essential.

Diffusion Denoising for Geometry (SHOWCASE)

Drag the timestep slider to see noise gradually resolve into depth and normal maps. Toggle between depth-only and joint output to see how joint generation produces more consistent geometry.

Timestep t T (pure noise)

Mode

What is the purpose of cross-domain geometric self-attention in GeoWizard?

It lets depth features attend to normal features and vice versa, enabling mutual guidance that ensures geometric consistency between the two outputs It replaces the CLIP conditioning with a more efficient mechanism It reduces the number of parameters by sharing attention weights

Chapter 5: The Architecture

Let's trace the full data flow through GeoWizard, from input image to output depth and normal maps.

Encoding

The input image x (576×768) is encoded by the Stable Diffusion VAE into a latent Z_x (72×96×4). Ground-truth depth d and normal n are similarly encoded to Z_d and Z_n during training.

Noising (training only)

Random Gaussian noise is added to Z_d and Z_n at a shared timestep t, producing noisy latents Z_t^d and Z_tⁿ. Crucially, both use the same timestep — this simplifies learning when the model must handle two modalities simultaneously.

U-Net forward pass

Two forward passes per training step:

Depth pass: Concatenate [Z_x, Z_t^d] along channels. Feed through U-Net with geometry switcher set to s_d and scene prompt s_i. Cross-domain attention accesses Z_tⁿ.
Normal pass: Concatenate [Z_x, Z_tⁿ] along channels. Feed through U-Net with geometry switcher set to s_n and scene prompt s_i. Cross-domain attention accesses Z_t^d.

Scene distribution decoupler

This is GeoWizard's second key innovation. Different scene types have very different depth distributions:

Outdoor: depths range from 1m to infinity, with a long tail
Indoor: depths are constrained, typically 0.5–10m
Background-free objects: very narrow depth range around the object

If you train on all three mixed together, the diffusion model gets confused — it doesn't know if a mid-range depth value means "close indoor wall" or "distant outdoor building." The scene decoupler adds a one-hot scene type indicator (indoor/outdoor/object) as another positional encoding added to the time embedding.

Why this works: By explicitly telling the model what type of scene it's looking at, you split one hard distribution (all scenes mixed) into three simpler sub-distributions that are each easier to learn. The ablation confirms: removing the decoupler increases AbsRel error from 6.7 to 7.5 and consistency error from 16.2 to 16.5.

GeoWizard Pipeline

The full architecture: image is encoded by VAE, concatenated with noisy geometric latents, processed by a shared U-Net with geometry switcher and scene decoupler. Cross-domain attention connects depth and normal branches.

Decoding

After DDIM denoising (10–50 steps at inference), the clean latents Z₀^d and Z₀ⁿ are decoded back to pixel space by the VAE decoder, producing the final depth map d̂ and normal map n̂.

Why does GeoWizard use a scene distribution decoupler instead of training on all scene types jointly?

Because outdoor, indoor, and object scenes have fundamentally different depth distributions — mixing them creates an ambiguous distribution that's harder to learn, causing the model to produce incorrect spatial layouts Because different scene types need different U-Net architectures Because the VAE cannot encode all scene types in the same latent space

Chapter 6: Training

GeoWizard's training is remarkably efficient given its performance. Let's break down the ingredients.

Training data: 280K images from three domains

Domain	Dataset	Samples	Source
Indoor	Hypersim + Replica	76,347	Synthetic renders with GT depth/normal
Outdoor	3D Ken Burns + custom simulation	115,678	Stereo pairs + synthetic city scenes
Objects	Objaverse (filtered)	85,997	High-quality 3D objects rendered with GT

Notice: all training data is synthetic with perfect ground-truth depth and normals. No noisy pseudo-labels. The generalization comes from Stable Diffusion's pre-training, not from diverse real-world depth data.

Loss function: v-prediction with multi-scale noise

GeoWizard uses v-prediction as the learning target rather than epsilon-prediction:

v_t^d = α_t ε_t^d − σ_t Z^d

V-prediction blends the noise and the signal, which is more numerically stable and converges faster for this task.

They also use multi-resolution noise — noise at multiple spatial frequencies rather than a single scale. This is critical because depth and normal maps have many similar values in local regions (think: a flat wall). Single-scale noise struggles with these uniform regions; multi-scale noise preserves low-frequency structure better.

Training details

Base model: Stable Diffusion V2 (image-conditioned variant)
Resolution: 576 × 768
Steps: 20,000
Batch size: 256 (across 8 A100 GPUs)
Optimizer: Adam, lr = 1×10^-5
Augmentation: horizontal flip, random crop, photometric distortion
Inference: DDIM sampling with 10–50 denoising steps

Ensemble trick at inference: GeoWizard can run multiple denoising passes with different random noise initializations and average the results. This reduces noise and improves quality at the cost of proportionally more compute. In benchmarks, ensembling 3 passes typically gives the best quality-speed tradeoff.

Why does GeoWizard use multi-resolution noise instead of standard single-scale Gaussian noise?

Because depth and normal maps contain large uniform regions where single-scale noise fails to preserve low-frequency structure — multi-scale noise captures both fine details and broad spatial patterns Because multi-resolution noise trains faster with fewer GPU hours Because the VAE encoder requires noise at multiple resolutions

Chapter 7: Results

GeoWizard is evaluated zero-shot on six depth benchmarks and five normal benchmarks. "Zero-shot" means the model was never trained on these datasets — it must generalize from its synthetic training data + diffusion priors.

Depth estimation

On standard benchmarks (NYUv2, KITTI, ETH3D, ScanNet), GeoWizard matches or slightly beats Marigold (the closest generative competitor) and comes close to DepthAnything, which was trained on 63.5M images — 227× more data.

But the real story is in generalization. On out-of-distribution images — paintings, anime, renders, underwater photos — DepthAnything's depth maps lose detail and produce incorrect spatial layouts. GeoWizard maintains faithful geometry because its generative prior understands 3D structure, not just pattern matching.

Normal estimation

GeoWizard achieves state-of-the-art on zero-shot normal estimation across all benchmarks, beating DSINE (the previous best discriminative method) and Omnidata v2. The improvement is especially visible in fine-grained details: hairlines, architectural textures, and thin structures like chair legs.

Geometric consistency

A unique advantage of joint estimation: GeoWizard's depth and normals are geometrically consistent with each other. The mean angular error between normals computed from the depth map and the directly predicted normals is 16.2° for GeoWizard's full model, compared to 18.1° without cross-domain attention and 19.1° with separate models.

Benchmark Results

Comparison of depth (AbsRel, lower is better) and normal (Mean angular error, lower is better) across methods. GeoWizard trains on only 0.28M images yet matches models trained on 63.5M.

Metric

Ablation summary

Configuration	AbsRel ↓	Normal Mean ↓	GC ↓
Separate models (2 U-Nets)	8.5	16.9	19.1
w/o Geometry Switcher	6.9	15.0	18.1
w/o Scene Decoupler	7.5	16.1	16.5
Full GeoWizard	6.7	14.8	16.2

What is GeoWizard's key advantage over DepthAnything in real-world usage?

GeoWizard generalizes to out-of-distribution images (artwork, renders, unusual scenes) where DepthAnything's discriminative approach fails, despite training on 227× less data GeoWizard runs faster at inference GeoWizard uses a smaller model

Chapter 8: Applications

High-quality depth and normals unlock a cascade of downstream tasks. GeoWizard demonstrates three:

1. Single-image 3D reconstruction

With estimated depth and normals, you can reconstruct a 3D mesh from a single photo. But there's a catch: GeoWizard predicts affine-invariant depth — the relative ordering is correct, but the absolute scale and shift are unknown.

Solution: optimize a scale ŝ and shift t̂ so that the normals computed from the scaled depth match the directly predicted normals. Minimize the angular difference in spherical coordinates:

min_{ŝ, t̂} D(n̂_d, n̂)

Where n̂_d is the normal derived from the depth via least-square fitting, and n̂ is GeoWizard's predicted normal. This aligns the depth scale with the normals, then the BiNI algorithm fuses both for high-quality surface reconstruction.

The result: detailed 3D meshes from a single photo, capturing fine features like the beard of a stone lion or the folds of a cloak.

2. Depth-guided image generation

GeoWizard's depth and normal maps can condition ControlNet for image generation. The geometry acts as a structural scaffold: "generate a futuristic version of this scene, keeping the same 3D layout." Because GeoWizard's geometry captures fine details, the generated images maintain spatial coherence — doorways stay doorways, chair legs stay thin.

3. Novel view synthesis

By projecting pixels to 3D using the estimated depth, then re-rendering from a new camera angle, you get a novel view of the scene. GeoWizard's accurate depth produces cleaner warps with fewer artifacts than MiDaS, especially on thin structures and edges.

A unified geometric foundation: All three applications benefit from the same two outputs: a high-quality depth map and a geometrically consistent normal map. GeoWizard acts as a geometry backbone that any downstream pipeline can build on.

How does GeoWizard recover absolute depth scale from its affine-invariant depth prediction?

It directly predicts metric depth in meters It uses the camera focal length to convert relative depth It optimizes scale and shift parameters so that the normals derived from the depth match the directly predicted normals — using geometric consistency to recover scale

Chapter 9: Connections

GeoWizard sits at the intersection of diffusion models and 3D geometry estimation. Let's map where it fits.

Relation to Marigold

Marigold is the closest precursor — it also fine-tunes Stable Diffusion for monocular depth estimation. But Marigold has two limitations GeoWizard addresses: (1) it only predicts depth, not normals, so no geometric consistency guarantee; (2) it trains on mixed scene types without decoupling, leading to layout ambiguities where foreground objects get flattened into backgrounds.

Relation to DepthAnything

DepthAnything takes the opposite approach: massive discriminative training on 63.5M images. It achieves the best quantitative numbers on standard benchmarks but fails on out-of-distribution images. GeoWizard shows you can achieve comparable in-distribution quality with 227× less data by leveraging generative priors, plus dramatically better out-of-distribution generalization.

Relation to Stable Diffusion

GeoWizard demonstrates that Stable Diffusion's latent representations encode rich 3D knowledge — a finding with broad implications. If diffusion models understand geometry well enough to produce depth and normals, what else do they know? This opens the door to extracting other physical properties (material, lighting, albedo) from the same pre-trained model.

Relation to Metric3D / Metric3D v2

Metric3D takes yet another approach: using camera intrinsics to produce metric (absolute-scale) depth. GeoWizard produces affine-invariant depth that needs scale recovery. They're complementary — GeoWizard has better detail and generalization, Metric3D has absolute scale.

Cheat Sheet

Aspect	GeoWizard
Base model	Stable Diffusion V2
Input	Single RGB image + scene type prompt
Output	Affine-invariant depth + surface normal map
Key mechanism	Joint generation via geometry switcher + cross-domain attention
Scene handling	Distribution decoupler (indoor/outdoor/object)
Training data	280K synthetic images (Hypersim, Replica, Objaverse, etc.)
Training cost	20K steps, 2 days on 8×A100
Inference	DDIM, 10–50 steps, optional ensemble
Key result	SOTA zero-shot depth + normal, robust OOD generalization

The broader lesson: Large generative models pre-trained on web-scale data contain surprisingly rich representations of 3D geometry. Rather than building specialized geometry models from scratch, we can unlock this knowledge through lightweight fine-tuning — getting better results with orders of magnitude less data and compute.

What are GeoWizard's two key innovations beyond simply fine-tuning Stable Diffusion for depth prediction?

Joint depth-normal generation via cross-domain geometric self-attention, and the scene distribution decoupler that separates indoor/outdoor/object distributions for cleaner learning Larger U-Net and more training data Metric depth prediction and multi-view fusion

GeoWizard: Unleashing Diffusion Priors for 3D Geometry