Tan, Sun, Qin, Fu, Zhang, Adai, Zhou, Xu, Zhu, Xue, Shen — 2026

Masked Depth Modeling for Spatial Perception

Sensor failures aren't noise to discard — they're a learning signal. Treat missing depth as natural masks, and a single model unifies monocular depth estimation and depth completion. LingBot-Depth outperforms top-tier RGB-D cameras.

Prerequisites: Vision Transformers (ViT) + Masked Autoencoders (MAE) + Depth sensing basics
10
Chapters
4+
Simulations
2026
Systems

Chapter 0: The Problem

You're building a robot that needs to grasp a glass cup on a kitchen counter. Your RGB-D camera — a RealSense, an Orbbec, a Kinect — gives you both a color image and a depth map. In theory, that's everything you need: the color tells you what is there, the depth tells you where it is in 3D.

But look at the depth map. The glass cup? Gone. A black hole where the cup should be. The polished countertop? Half missing. The mirror on the wall? Completely absent. Your depth sensor has failed on exactly the surfaces that matter most.

This isn't a rare edge case. It happens constantly in real-world deployments. RGB-D cameras work by projecting infrared patterns (structured light) or matching stereo pairs. Both methods fail catastrophically on:

The scale of the problem: Even state-of-the-art commercial sensors can have 30-70% of pixels missing on challenging indoor scenes. That's not a depth map — it's a depth suggestion. And yet robots, AR systems, and autonomous vehicles all depend on dense, pixel-aligned depth for safe operation.

There are three paradigms for getting 3D geometry, and each has a fatal flaw:

Multi-view stereo
Accurate but requires multiple viewpoints, expensive post-processing, and fails on textureless regions.
Monocular depth estimation
Works from a single image but produces only relative depth — no metric scale. A 1m wall and a 10m wall look identical.
RGB-D cameras
Real-time, metric scale, pixel-aligned — but riddled with holes on the surfaces that matter most.

RGB-D cameras are the only option that gives you metric scale, dense geometry, and real-time speed simultaneously. If we could fix their holes, we'd have the ideal depth sensor. That's exactly what LingBot-Depth does.

RGB-D Sensor Failure

A simulated depth map with sensor failures. Black regions are missing depth. Drag the slider to change the scene's "difficulty" (more specular/textureless surfaces). Notice how holes cluster on glass, mirrors, and smooth surfaces — not randomly.

Scene difficulty 40%
Why do RGB-D depth sensors produce holes in the depth map on specular and transparent surfaces?

Chapter 1: The Key Insight

Here's the conventional approach to depth map holes: discard the missing pixels, interpolate from neighbors, or apply a learned inpainting model. All of these treat the holes as unfortunate noise — something to clean up and move past.

LingBot-Depth flips this entirely: the holes are the training signal.

Think about it. Where do sensor failures happen? Not randomly. They happen on specular surfaces, textureless regions, transparent objects — places where the geometry is ambiguous to the sensor. These are the hardest depth estimation challenges. And your sensor is telling you, for free, exactly which pixels are hard.

The paradigm shift: Missing depth from sensors isn't noise to discard — it's a natural mask. Just like MAE randomly masks image patches and trains a model to reconstruct them, LingBot-Depth uses sensor failures as masks and trains a model to predict the missing depth from RGB context.

Why are natural masks better than random masks? Consider two masking strategies:

Natural masks create a much harder reconstruction task. And harder tasks force the model to learn deeper representations. The model can't just interpolate from neighbors — it must understand the full RGB context to infer what a mirror's depth should be, or that a glass cup has the same shape as its visible rim.

Free curriculum: Sensor data comes with a natural difficulty distribution. Easy scenes (textured walls, matte objects) have few missing pixels — low mask ratio. Hard scenes (glass lobbies, aquariums) have many missing pixels — high mask ratio. The training data provides a free difficulty curriculum without any manual engineering.
Why are "natural masks" from sensor failures harder than random masks?

Chapter 2: Masked Depth Modeling

Let's formalize the idea. Masked Depth Modeling (MDM) follows the general paradigm of Masked Image Modeling (like MAE), but shifts the reconstruction target from appearance to geometry.

How MAE Works (Quick Recap)

In standard MAE, you split an image into patches, randomly mask ~75% of them, feed the visible patches through a ViT encoder, then use a decoder to reconstruct the masked patches' pixel values. The model learns rich visual representations because it must understand scene structure to fill in the gaps.

How MDM Differs

MDM takes two inputs: an RGB image and a (corrupted) depth map. The key differences from MAE:

Patch-Level Masking Decision

Since LingBot-Depth uses a ViT with 14x14 patches, each patch covers a region of pixels. The masking decision is made at the patch level:

Why keep some mixed patches? A patch that's half-valid, half-missing still carries useful geometric information. By keeping some of these unmasked, the model can use the partial depth signal alongside RGB context. This is strictly more informative than discarding them.

The Decoder

Standard MAE uses a shallow transformer decoder. LingBot-Depth replaces this with a ConvStack decoder — a hierarchical convolutional architecture that progressively upsamples from patch-resolution to pixel-resolution. This is better suited for dense geometric prediction because convolutions naturally preserve spatial locality and smooth gradients.

Key design choice: After the encoder, the latent depth tokens are discarded. Only the latent contextual (RGB) tokens are kept and fed to the ConvStack decoder. The decoder reconstructs the full depth map solely from enriched RGB features. This forces the encoder to transfer all geometric information into the RGB token representations via cross-modal attention.
In MDM, what happens to the depth tokens after the ViT encoder?

Chapter 3: The Unified Framework

Here's the most elegant consequence of masked depth modeling: the same architecture, with the same weights, naturally handles two seemingly different tasks — just by changing the mask ratio.

Depth Completion (Partial Masking)

When you have an RGB-D camera that produced a depth map with holes, you mask only the invalid (sensor-corrupted) tokens. The valid depth tokens flow through the encoder alongside all RGB tokens. The encoder fuses the sparse valid depth with rich visual context, and the decoder fills in the gaps.

This is standard depth completion: given RGB + sparse/incomplete depth, produce dense depth.

Monocular Depth Estimation (Full Masking)

Now imagine masking all depth tokens. Zero depth information enters the encoder. The model has only RGB tokens to work with. Yet it still predicts a complete depth map.

This is monocular depth estimation: given only RGB, infer depth from visual cues alone.

The unification: Monocular depth estimation and depth completion aren't two different tasks — they're two points on a continuum. The mask ratio is a "slider" from pure monocular (100% masked) to pure completion (only invalid pixels masked). And every point in between works: 80% masked, 50% masked, 20% masked. The model gracefully degrades as depth information decreases.

This means a single model can handle:

No architecture changes. No retraining. Just different masking.

Unified Framework

Toggle between monocular mode (all depth masked) and depth completion mode (only invalid pixels masked). Watch how the same architecture handles both tasks by changing what gets masked.

How does LingBot-Depth unify monocular depth estimation and depth completion?

Chapter 4: The Architecture

LingBot-Depth uses a standard Vision Transformer (ViT-Large) as its encoder, but with a carefully designed input pipeline and decoder that enable masked depth modeling.

Separated Patch Embeddings

The two input modalities — 3-channel RGB and 1-channel depth — are processed by separate patch embedding layers. Each modality is independently projected into a sequence of patch tokens on the same spatial grid (14x14 patches). This separation lets the self-attention layers learn how to integrate appearance and geometry, rather than forcing them into the same embedding space from the start.

Positional Embeddings

Each token receives two types of positional information:

The final positional encoding is the sum of both. This lets the model know both where a token is and what type it is.

Encoder: Joint Embedding

After masking, the full set of RGB tokens and the unmasked depth tokens are concatenated and fed into a 24-layer ViT-Large encoder (initialized from DINOv2). The self-attention mechanism allows every RGB token to attend to every depth token and vice versa. This is where the magic happens: depth tokens at a given location attend to RGB tokens at the same and nearby locations, learning to correlate appearance with geometry.

ConvStack Decoder

After the encoder, the depth tokens are discarded. Only the enriched RGB tokens (which now carry geometric information absorbed from depth tokens via attention) are fed to the decoder. A [CLS] token capturing global scene context is broadcast-added to all tokens. The ConvStack decoder then progressively upsamples through residual blocks and transposed convolutions, doubling resolution at each stage until reaching 16x the patch resolution. UV positional encodings at each scale preserve spatial layout. Final bilinear upsampling matches the original input resolution.

Why discard depth tokens? If the decoder received depth tokens directly, it could "cheat" by simply passing through valid depth values without learning to reason from RGB. By forcing all output to flow through enriched RGB tokens only, the model must learn deep cross-modal correspondences during encoding.
Interactive Masked Depth Modeling

Drag the mask ratio slider to control how much depth is hidden. The left panel shows RGB (always full), the middle shows the masked depth tokens, and the right shows the "predicted" output. At 100% masking, the model must rely entirely on RGB.

Mask ratio 65%
Why does LingBot-Depth use separate patch embedding layers for RGB and depth instead of concatenating them into a 4-channel input?

Chapter 5: The Data Pipeline

Training a model to fill in depth sensor failures requires data with both the failures (for masking) and the ground truth (for supervision). This is a chicken-and-egg problem: if you had perfect depth, you wouldn't need the model. LingBot-Depth solves this with a dual-stream pipeline producing 3M training pairs.

Synthetic Pipeline (1M samples)

The synthetic branch doesn't just render perfect depth maps — that would miss the point. Instead, it simulates the full imaging process of a real RGB-D camera:

  1. Render RGB images, perfect depth, and grayscale stereo pairs with speckle patterns in Blender
  2. Process the stereo pairs through Semi-Global Matching (SGM) — the same algorithm used in real cameras
  3. The SGM output has realistic artifacts: missing values on textureless and specular surfaces, edge noise, depth quantization
Why simulate the camera pipeline? Prior synthetic datasets render "ideal" depth and then add random noise. But real sensor failures aren't random — they correlate with material properties and lighting. By running the actual stereo-matching algorithm on simulated IR images with speckle patterns, the synthetic data inherits the same failure modes as real hardware.

Key numbers: 442 indoor scenes, resolution 960x1280. Each sample contains RGB, perfect depth, stereo pair, ground-truth disparity, and simulated sensor depth. The stereo baseline is randomly sampled between 0.05-0.2m and focal length between 16-28mm for diversity.

Real-World Pipeline (2M samples)

The real-world branch uses a custom, 3D-printed capture rig that mounts multiple commercial RGB-D cameras (Intel RealSense, Orbbec Gemini, ZED) alongside a portable PC with a touchscreen. The modular design lets operators swap cameras easily.

Since real captures don't have perfect ground-truth depth, the team computes pseudo-depth labels from the left-right IR stereo pairs using a FoundationStereo-based network trained on the synthetic data. Left-right consistency checks filter out unreliable pixels.

Scene diversity spans 30+ categories: homes, offices, hospitals, airports, museums, gyms, parking garages, and outdoor environments.

Public Datasets (7M additional)

Seven open-source RGB-D datasets supplement the pipeline: Taskonomy (4.6M), ScanNet++ (0.8M), TartanAir (0.6M), ARKitScenes (0.5M), and others. For synthetic datasets with no missing depth, random patch masking simulates sensor failures. Total training: ~10M samples.

SourceSamplesTypeNatural masks?
LingBot-Depth-S (synthetic)1.0MSimulated SGM failuresYes (SGM)
LingBot-Depth-R (real)2.1MReal sensor failuresYes (hardware)
Taskonomy4.6MSupplementaryRandom
ScanNet++0.8MSupplementaryPartial
Others (5 datasets)~1.5MSupplementaryMixed
Why does the synthetic pipeline run SGM on simulated stereo pairs instead of just rendering perfect depth and adding random noise?

Chapter 6: Training

Training a ViT-Large on 10M RGB-D samples with masked depth modeling involves several careful design decisions.

Masking Strategy

The masking ratio during training ranges from 60% to 90%. The distribution of masks varies by data source:

Encoder Initialization

The ViT-Large encoder is initialized from DINOv2 pretrained weights. This gives the model strong visual features from the start — it already understands edges, textures, objects. The MDM pretraining then teaches it to also reason about geometry.

Differential Learning Rates

The pretrained encoder and randomly initialized decoder have very different optimization needs:

Training Schedule

ParameterValue
Total iterations250,000
Global batch size1,024 (128 GPUs x 8)
OptimizerAdamW (beta1=0.9, beta2=0.999, wd=0.05)
Warmup2,000 iterations (encoder), none (decoder)
LR decayStep decay: 0.5x every 25K iterations
Gradient clippingMax norm 1.0
PrecisionMixed (BF16)
Wall time~7.5 days on 128 GPUs

Loss and Augmentation

The loss is simple: L1 loss on predicted depth, computed only at pixels with valid ground-truth values. Data augmentation includes random resized cropping, horizontal flipping, color jittering, JPEG compression artifacts, motion blur, and shot noise — all applied to the RGB image to improve robustness.

No depth augmentation: The depth modality is not artificially corrupted beyond the natural/random masking. The model sees the raw (possibly noisy) depth values as-is. The masking is the augmentation for depth.
Why does LingBot-Depth use a 10x lower learning rate for the encoder than the decoder?

Chapter 7: Results

LingBot-Depth is evaluated on three core tasks: depth completion, monocular depth estimation, and stereo matching prior initialization. The results are striking.

Depth Completion

Two protocols test depth completion under increasing difficulty:

Protocol 1: Block-wise masking. Ground-truth depth is corrupted with random block masks and Gaussian+shot noise at four severity levels (easy/medium/hard/extreme). Evaluated on iBims, NYUv2, DIODE.

MethodEasy RMSEHard RMSEExtreme RMSE
OMNI-DC0.4762.0532.214
PromptDA0.2980.6072.587
PriorDA0.4090.8452.734
LingBot-Depth0.1750.3452.011

Protocol 2: Sparse SfM inputs. Only highly sparse SfM point clouds serve as depth input (ETH3D). This is the harder test.

MethodIndoor RMSEOutdoor RMSE
OMNI-DC0.6051.069
PriorDA0.3601.238
LingBot-Depth0.1920.664
47% RMSE reduction indoors compared to the best baseline (PriorDA), and 38% outdoors. On the extreme block-masking setting, LingBot-Depth is the only method that keeps RMSE below 2.1 — all others exceed 2.2.

Monocular Depth Estimation

When used as a backbone initializer for MoGe (replacing DINOv2), MDM pretraining consistently improves depth and point map accuracy across 10 benchmarks. Relative error drops from 0.056 to 0.044 (affine-invariant), confirming that geometric knowledge learned during MDM transfers to monocular settings.

Stereo Matching Prior

As a depth prior for FoundationStereo, the MDM-pretrained encoder converges faster and achieves better final performance than both DINOv2 and DepthAnythingV2 initializations. At epoch 5, LingBot-Depth already outperforms the vanilla baseline at epoch 15 on most benchmarks.

Results: Depth Completion RMSE

RMSE comparison across methods on iBims benchmark (Protocol 1). Lower is better. LingBot-Depth (warm bar) dominates at every difficulty level.

Under the sparse SfM protocol (Protocol 2), how much does LingBot-Depth reduce indoor RMSE compared to the best baseline?

Chapter 8: Downstream Applications

A depth model is only as good as what you can build with it. LingBot-Depth enables three compelling downstream applications — all without any task-specific fine-tuning.

Video Depth Completion

Despite being trained on static images only, LingBot-Depth produces temporally consistent depth on video input. Testing on 30 FPS captures from Orbbec Gemini-335 in challenging scenes (glass lobbies, aquarium tunnels, gyms with mirrors), the model fills in massive sensor gaps and maintains smooth depth across frames.

In the aquarium tunnel test, the ZED stereo camera almost entirely fails due to refractive glass surfaces. LingBot-Depth produces geometrically plausible depth throughout the sequence. No temporal modeling, no fine-tuning — pure zero-shot generalization from MDM pretraining.

3D Point Tracking

By plugging LingBot-Depth's refined depth into SpatialTrackerV2 (replacing its default VGGT depth frontend), both camera tracking and dynamic object tracking improve significantly. In indoor scenes with extensive glass surfaces where raw depth fails, the refined depth produces smoother, more accurate camera trajectories. For dynamic objects (scooters, rowing machines), tracked 3D point trajectories show coherent motion patterns.

Drop-in replacement: LingBot-Depth doesn't require any modification to SpatialTrackerV2. It's a drop-in depth estimator that produces better input, which cascades into better tracking. This is the power of foundation models: improve one component, improve everything downstream.

Dexterous Grasping

The most striking application: a robotic dexterous grasping pipeline using a Rokae XMate-SR5 arm with an X Hand-1 dexterous hand. The grasping policy (a diffusion policy conditioned on DINOv2 RGB features + Point Transformer point cloud features) is trained on HOI4D human hand-object interactions retargeted to the robot hand.

The key: the point cloud comes from LingBot-Depth's predictions, not the raw sensor. This enables grasping of objects that defeat conventional sensors:

Aligned Latent Representations

An additional benefit: the attention visualization shows that depth tokens consistently attend to spatially corresponding RGB regions. Different depth queries in the same scene attend to distinct, position-aware regions. The encoder learns fine-grained geometric-appearance correspondences, producing latent representations where RGB and depth are naturally aligned.

The bigger picture: LingBot-Depth isn't just a depth completion model — it's a spatial perception foundation. The MDM-pretrained encoder produces latent features where visual appearance and 3D geometry are aligned. This makes the features immediately useful for any downstream task that needs to reason about 3D space.
Why can LingBot-Depth enable grasping of transparent glass cups that defeat conventional depth sensors?

Chapter 9: Connections

LingBot-Depth sits at the intersection of self-supervised learning, depth estimation, and sensor fusion. Let's map where it fits in the landscape.

Relation to MAE

MAE (Masked Autoencoders) masks and reconstructs image patches. MDM masks and reconstructs depth patches while conditioning on the full RGB image. The key innovation is replacing random masks with sensor-driven natural masks that create a harder, more informative pretraining signal.

Relation to Depth Anything / Metric3D

Depth Anything and Metric3D are monocular depth estimators — they take only RGB as input. LingBot-Depth subsumes this capability (mask all depth tokens = monocular mode) but also supports depth completion when sensor data is available. The MDM-pretrained encoder even outperforms DINOv2 as an initialization for monocular depth models.

Relation to Depth Completion Methods

OMNI-DC, PromptDA, and PriorDA are dedicated depth completion models. They treat the task in isolation. LingBot-Depth's MDM pretraining learns depth reasoning as a byproduct of self-supervised masked modeling, which provides stronger generalization across mask patterns and sparsity levels.

Relation to FoundationStereo

FoundationStereo is a stereo matching model that uses a monocular depth prior. LingBot-Depth serves as a stronger prior than DepthAnythingV2 for this purpose, demonstrating that MDM pretraining distills 3D geometric knowledge more effectively than standard visual pretraining.

Cheat Sheet

AspectLingBot-Depth
InputRGB image + (optionally corrupted) depth map
OutputDense, metric-scale depth map
BackboneViT-Large/14 (DINOv2 init)
DecoderConvStack (hierarchical conv pyramid)
PretrainingMasked Depth Modeling (60-90% mask ratio)
Training data~10M samples (3M self-curated + 7M public)
LossL1 on valid ground-truth depth pixels
Key insightSensor failures = natural masks = harder than random
UnificationMonocular (100% mask) ↔ completion (natural mask)
Key result47% RMSE reduction on sparse SfM depth completion
The broader lesson: When your data has natural corruptions — sensor failures, missing labels, noisy measurements — don't discard them. They are telling you exactly which inputs are hard. Use them as the training signal. The corruption pattern itself encodes domain knowledge about the problem's difficulty landscape.
What is the key difference between MAE's masking and LingBot-Depth's masking?