Sensor failures aren't noise to discard — they're a learning signal. Treat missing depth as natural masks, and a single model unifies monocular depth estimation and depth completion. LingBot-Depth outperforms top-tier RGB-D cameras.
You're building a robot that needs to grasp a glass cup on a kitchen counter. Your RGB-D camera — a RealSense, an Orbbec, a Kinect — gives you both a color image and a depth map. In theory, that's everything you need: the color tells you what is there, the depth tells you where it is in 3D.
But look at the depth map. The glass cup? Gone. A black hole where the cup should be. The polished countertop? Half missing. The mirror on the wall? Completely absent. Your depth sensor has failed on exactly the surfaces that matter most.
This isn't a rare edge case. It happens constantly in real-world deployments. RGB-D cameras work by projecting infrared patterns (structured light) or matching stereo pairs. Both methods fail catastrophically on:
There are three paradigms for getting 3D geometry, and each has a fatal flaw:
RGB-D cameras are the only option that gives you metric scale, dense geometry, and real-time speed simultaneously. If we could fix their holes, we'd have the ideal depth sensor. That's exactly what LingBot-Depth does.
A simulated depth map with sensor failures. Black regions are missing depth. Drag the slider to change the scene's "difficulty" (more specular/textureless surfaces). Notice how holes cluster on glass, mirrors, and smooth surfaces — not randomly.
Here's the conventional approach to depth map holes: discard the missing pixels, interpolate from neighbors, or apply a learned inpainting model. All of these treat the holes as unfortunate noise — something to clean up and move past.
LingBot-Depth flips this entirely: the holes are the training signal.
Think about it. Where do sensor failures happen? Not randomly. They happen on specular surfaces, textureless regions, transparent objects — places where the geometry is ambiguous to the sensor. These are the hardest depth estimation challenges. And your sensor is telling you, for free, exactly which pixels are hard.
Why are natural masks better than random masks? Consider two masking strategies:
Natural masks create a much harder reconstruction task. And harder tasks force the model to learn deeper representations. The model can't just interpolate from neighbors — it must understand the full RGB context to infer what a mirror's depth should be, or that a glass cup has the same shape as its visible rim.
Let's formalize the idea. Masked Depth Modeling (MDM) follows the general paradigm of Masked Image Modeling (like MAE), but shifts the reconstruction target from appearance to geometry.
In standard MAE, you split an image into patches, randomly mask ~75% of them, feed the visible patches through a ViT encoder, then use a decoder to reconstruct the masked patches' pixel values. The model learns rich visual representations because it must understand scene structure to fill in the gaps.
MDM takes two inputs: an RGB image and a (corrupted) depth map. The key differences from MAE:
Since LingBot-Depth uses a ViT with 14x14 patches, each patch covers a region of pixels. The masking decision is made at the patch level:
Standard MAE uses a shallow transformer decoder. LingBot-Depth replaces this with a ConvStack decoder — a hierarchical convolutional architecture that progressively upsamples from patch-resolution to pixel-resolution. This is better suited for dense geometric prediction because convolutions naturally preserve spatial locality and smooth gradients.
Here's the most elegant consequence of masked depth modeling: the same architecture, with the same weights, naturally handles two seemingly different tasks — just by changing the mask ratio.
When you have an RGB-D camera that produced a depth map with holes, you mask only the invalid (sensor-corrupted) tokens. The valid depth tokens flow through the encoder alongside all RGB tokens. The encoder fuses the sparse valid depth with rich visual context, and the decoder fills in the gaps.
This is standard depth completion: given RGB + sparse/incomplete depth, produce dense depth.
Now imagine masking all depth tokens. Zero depth information enters the encoder. The model has only RGB tokens to work with. Yet it still predicts a complete depth map.
This is monocular depth estimation: given only RGB, infer depth from visual cues alone.
This means a single model can handle:
No architecture changes. No retraining. Just different masking.
Toggle between monocular mode (all depth masked) and depth completion mode (only invalid pixels masked). Watch how the same architecture handles both tasks by changing what gets masked.
LingBot-Depth uses a standard Vision Transformer (ViT-Large) as its encoder, but with a carefully designed input pipeline and decoder that enable masked depth modeling.
The two input modalities — 3-channel RGB and 1-channel depth — are processed by separate patch embedding layers. Each modality is independently projected into a sequence of patch tokens on the same spatial grid (14x14 patches). This separation lets the self-attention layers learn how to integrate appearance and geometry, rather than forcing them into the same embedding space from the start.
Each token receives two types of positional information:
The final positional encoding is the sum of both. This lets the model know both where a token is and what type it is.
After masking, the full set of RGB tokens and the unmasked depth tokens are concatenated and fed into a 24-layer ViT-Large encoder (initialized from DINOv2). The self-attention mechanism allows every RGB token to attend to every depth token and vice versa. This is where the magic happens: depth tokens at a given location attend to RGB tokens at the same and nearby locations, learning to correlate appearance with geometry.
After the encoder, the depth tokens are discarded. Only the enriched RGB tokens (which now carry geometric information absorbed from depth tokens via attention) are fed to the decoder. A [CLS] token capturing global scene context is broadcast-added to all tokens. The ConvStack decoder then progressively upsamples through residual blocks and transposed convolutions, doubling resolution at each stage until reaching 16x the patch resolution. UV positional encodings at each scale preserve spatial layout. Final bilinear upsampling matches the original input resolution.
Drag the mask ratio slider to control how much depth is hidden. The left panel shows RGB (always full), the middle shows the masked depth tokens, and the right shows the "predicted" output. At 100% masking, the model must rely entirely on RGB.
Training a model to fill in depth sensor failures requires data with both the failures (for masking) and the ground truth (for supervision). This is a chicken-and-egg problem: if you had perfect depth, you wouldn't need the model. LingBot-Depth solves this with a dual-stream pipeline producing 3M training pairs.
The synthetic branch doesn't just render perfect depth maps — that would miss the point. Instead, it simulates the full imaging process of a real RGB-D camera:
Key numbers: 442 indoor scenes, resolution 960x1280. Each sample contains RGB, perfect depth, stereo pair, ground-truth disparity, and simulated sensor depth. The stereo baseline is randomly sampled between 0.05-0.2m and focal length between 16-28mm for diversity.
The real-world branch uses a custom, 3D-printed capture rig that mounts multiple commercial RGB-D cameras (Intel RealSense, Orbbec Gemini, ZED) alongside a portable PC with a touchscreen. The modular design lets operators swap cameras easily.
Since real captures don't have perfect ground-truth depth, the team computes pseudo-depth labels from the left-right IR stereo pairs using a FoundationStereo-based network trained on the synthetic data. Left-right consistency checks filter out unreliable pixels.
Scene diversity spans 30+ categories: homes, offices, hospitals, airports, museums, gyms, parking garages, and outdoor environments.
Seven open-source RGB-D datasets supplement the pipeline: Taskonomy (4.6M), ScanNet++ (0.8M), TartanAir (0.6M), ARKitScenes (0.5M), and others. For synthetic datasets with no missing depth, random patch masking simulates sensor failures. Total training: ~10M samples.
| Source | Samples | Type | Natural masks? |
|---|---|---|---|
| LingBot-Depth-S (synthetic) | 1.0M | Simulated SGM failures | Yes (SGM) |
| LingBot-Depth-R (real) | 2.1M | Real sensor failures | Yes (hardware) |
| Taskonomy | 4.6M | Supplementary | Random |
| ScanNet++ | 0.8M | Supplementary | Partial |
| Others (5 datasets) | ~1.5M | Supplementary | Mixed |
Training a ViT-Large on 10M RGB-D samples with masked depth modeling involves several careful design decisions.
The masking ratio during training ranges from 60% to 90%. The distribution of masks varies by data source:
The ViT-Large encoder is initialized from DINOv2 pretrained weights. This gives the model strong visual features from the start — it already understands edges, textures, objects. The MDM pretraining then teaches it to also reason about geometry.
The pretrained encoder and randomly initialized decoder have very different optimization needs:
| Parameter | Value |
|---|---|
| Total iterations | 250,000 |
| Global batch size | 1,024 (128 GPUs x 8) |
| Optimizer | AdamW (beta1=0.9, beta2=0.999, wd=0.05) |
| Warmup | 2,000 iterations (encoder), none (decoder) |
| LR decay | Step decay: 0.5x every 25K iterations |
| Gradient clipping | Max norm 1.0 |
| Precision | Mixed (BF16) |
| Wall time | ~7.5 days on 128 GPUs |
The loss is simple: L1 loss on predicted depth, computed only at pixels with valid ground-truth values. Data augmentation includes random resized cropping, horizontal flipping, color jittering, JPEG compression artifacts, motion blur, and shot noise — all applied to the RGB image to improve robustness.
LingBot-Depth is evaluated on three core tasks: depth completion, monocular depth estimation, and stereo matching prior initialization. The results are striking.
Two protocols test depth completion under increasing difficulty:
Protocol 1: Block-wise masking. Ground-truth depth is corrupted with random block masks and Gaussian+shot noise at four severity levels (easy/medium/hard/extreme). Evaluated on iBims, NYUv2, DIODE.
| Method | Easy RMSE | Hard RMSE | Extreme RMSE |
|---|---|---|---|
| OMNI-DC | 0.476 | 2.053 | 2.214 |
| PromptDA | 0.298 | 0.607 | 2.587 |
| PriorDA | 0.409 | 0.845 | 2.734 |
| LingBot-Depth | 0.175 | 0.345 | 2.011 |
Protocol 2: Sparse SfM inputs. Only highly sparse SfM point clouds serve as depth input (ETH3D). This is the harder test.
| Method | Indoor RMSE | Outdoor RMSE |
|---|---|---|
| OMNI-DC | 0.605 | 1.069 |
| PriorDA | 0.360 | 1.238 |
| LingBot-Depth | 0.192 | 0.664 |
When used as a backbone initializer for MoGe (replacing DINOv2), MDM pretraining consistently improves depth and point map accuracy across 10 benchmarks. Relative error drops from 0.056 to 0.044 (affine-invariant), confirming that geometric knowledge learned during MDM transfers to monocular settings.
As a depth prior for FoundationStereo, the MDM-pretrained encoder converges faster and achieves better final performance than both DINOv2 and DepthAnythingV2 initializations. At epoch 5, LingBot-Depth already outperforms the vanilla baseline at epoch 15 on most benchmarks.
RMSE comparison across methods on iBims benchmark (Protocol 1). Lower is better. LingBot-Depth (warm bar) dominates at every difficulty level.
A depth model is only as good as what you can build with it. LingBot-Depth enables three compelling downstream applications — all without any task-specific fine-tuning.
Despite being trained on static images only, LingBot-Depth produces temporally consistent depth on video input. Testing on 30 FPS captures from Orbbec Gemini-335 in challenging scenes (glass lobbies, aquarium tunnels, gyms with mirrors), the model fills in massive sensor gaps and maintains smooth depth across frames.
In the aquarium tunnel test, the ZED stereo camera almost entirely fails due to refractive glass surfaces. LingBot-Depth produces geometrically plausible depth throughout the sequence. No temporal modeling, no fine-tuning — pure zero-shot generalization from MDM pretraining.
By plugging LingBot-Depth's refined depth into SpatialTrackerV2 (replacing its default VGGT depth frontend), both camera tracking and dynamic object tracking improve significantly. In indoor scenes with extensive glass surfaces where raw depth fails, the refined depth produces smoother, more accurate camera trajectories. For dynamic objects (scooters, rowing machines), tracked 3D point trajectories show coherent motion patterns.
The most striking application: a robotic dexterous grasping pipeline using a Rokae XMate-SR5 arm with an X Hand-1 dexterous hand. The grasping policy (a diffusion policy conditioned on DINOv2 RGB features + Point Transformer point cloud features) is trained on HOI4D human hand-object interactions retargeted to the robot hand.
The key: the point cloud comes from LingBot-Depth's predictions, not the raw sensor. This enables grasping of objects that defeat conventional sensors:
An additional benefit: the attention visualization shows that depth tokens consistently attend to spatially corresponding RGB regions. Different depth queries in the same scene attend to distinct, position-aware regions. The encoder learns fine-grained geometric-appearance correspondences, producing latent representations where RGB and depth are naturally aligned.
LingBot-Depth sits at the intersection of self-supervised learning, depth estimation, and sensor fusion. Let's map where it fits in the landscape.
MAE (Masked Autoencoders) masks and reconstructs image patches. MDM masks and reconstructs depth patches while conditioning on the full RGB image. The key innovation is replacing random masks with sensor-driven natural masks that create a harder, more informative pretraining signal.
Depth Anything and Metric3D are monocular depth estimators — they take only RGB as input. LingBot-Depth subsumes this capability (mask all depth tokens = monocular mode) but also supports depth completion when sensor data is available. The MDM-pretrained encoder even outperforms DINOv2 as an initialization for monocular depth models.
OMNI-DC, PromptDA, and PriorDA are dedicated depth completion models. They treat the task in isolation. LingBot-Depth's MDM pretraining learns depth reasoning as a byproduct of self-supervised masked modeling, which provides stronger generalization across mask patterns and sparsity levels.
FoundationStereo is a stereo matching model that uses a monocular depth prior. LingBot-Depth serves as a stronger prior than DepthAnythingV2 for this purpose, demonstrating that MDM pretraining distills 3D geometric knowledge more effectively than standard visual pretraining.
| Aspect | LingBot-Depth |
|---|---|
| Input | RGB image + (optionally corrupted) depth map |
| Output | Dense, metric-scale depth map |
| Backbone | ViT-Large/14 (DINOv2 init) |
| Decoder | ConvStack (hierarchical conv pyramid) |
| Pretraining | Masked Depth Modeling (60-90% mask ratio) |
| Training data | ~10M samples (3M self-curated + 7M public) |
| Loss | L1 on valid ground-truth depth pixels |
| Key insight | Sensor failures = natural masks = harder than random |
| Unification | Monocular (100% mask) ↔ completion (natural mask) |
| Key result | 47% RMSE reduction on sparse SfM depth completion |