LingBot-Depth — Veanors

Chapter 0: The Problem

You're building a robot that needs to grasp a glass cup on a kitchen counter. Your RGB-D camera — a RealSense, an Orbbec, a Kinect — gives you both a color image and a depth map. In theory, that's everything you need: the color tells you what is there, the depth tells you where it is in 3D.

But look at the depth map. The glass cup? Gone. A black hole where the cup should be. The polished countertop? Half missing. The mirror on the wall? Completely absent. Your depth sensor has failed on exactly the surfaces that matter most.

This isn't a rare edge case. It happens constantly in real-world deployments. RGB-D cameras work by projecting infrared patterns (structured light) or matching stereo pairs. Both methods fail catastrophically on:

Specular surfaces — mirrors, glass, polished metal reflect IR patterns away from the sensor
Textureless surfaces — white walls, smooth tables give stereo matching nothing to latch onto
Transparent objects — glass, water let IR pass straight through
Complex lighting — strong sunlight washes out the IR projector's pattern

The scale of the problem: Even state-of-the-art commercial sensors can have 30-70% of pixels missing on challenging indoor scenes. That's not a depth map — it's a depth suggestion. And yet robots, AR systems, and autonomous vehicles all depend on dense, pixel-aligned depth for safe operation.

There are three paradigms for getting 3D geometry, and each has a fatal flaw:

Multi-view stereo

Accurate but requires multiple viewpoints, expensive post-processing, and fails on textureless regions.

↓

Monocular depth estimation

Works from a single image but produces only relative depth — no metric scale. A 1m wall and a 10m wall look identical.

↓

RGB-D cameras

Real-time, metric scale, pixel-aligned — but riddled with holes on the surfaces that matter most.

RGB-D cameras are the only option that gives you metric scale, dense geometry, and real-time speed simultaneously. If we could fix their holes, we'd have the ideal depth sensor. That's exactly what LingBot-Depth does.

RGB-D Sensor Failure

A simulated depth map with sensor failures. Black regions are missing depth. Drag the slider to change the scene's "difficulty" (more specular/textureless surfaces). Notice how holes cluster on glass, mirrors, and smooth surfaces — not randomly.

Scene difficulty 40%

Why do RGB-D depth sensors produce holes in the depth map on specular and transparent surfaces?

Because IR patterns are reflected away (specular) or pass through (transparent), so the sensor's stereo matching or structured-light decoding fails on those pixels Because the sensor's resolution is too low to capture fine details Because glass objects are too thin for the depth range of the sensor

Chapter 1: The Key Insight

Here's the conventional approach to depth map holes: discard the missing pixels, interpolate from neighbors, or apply a learned inpainting model. All of these treat the holes as unfortunate noise — something to clean up and move past.

LingBot-Depth flips this entirely: the holes are the training signal.

Think about it. Where do sensor failures happen? Not randomly. They happen on specular surfaces, textureless regions, transparent objects — places where the geometry is ambiguous to the sensor. These are the hardest depth estimation challenges. And your sensor is telling you, for free, exactly which pixels are hard.

The paradigm shift: Missing depth from sensors isn't noise to discard — it's a natural mask. Just like MAE randomly masks image patches and trains a model to reconstruct them, LingBot-Depth uses sensor failures as masks and trains a model to predict the missing depth from RGB context.

Why are natural masks better than random masks? Consider two masking strategies:

Random masking (standard MAE): mask 75% of patches uniformly at random. Some masked patches are easy (a flat wall segment), others are hard (an object boundary). The difficulty is random.
Natural masking (LingBot-Depth): mask the patches where the sensor failed. Every masked patch is hard — it failed for a geometric or material reason. The model must learn to reason about specular reflections, transparency, texturelessness.

Natural masks create a much harder reconstruction task. And harder tasks force the model to learn deeper representations. The model can't just interpolate from neighbors — it must understand the full RGB context to infer what a mirror's depth should be, or that a glass cup has the same shape as its visible rim.

Free curriculum: Sensor data comes with a natural difficulty distribution. Easy scenes (textured walls, matte objects) have few missing pixels — low mask ratio. Hard scenes (glass lobbies, aquariums) have many missing pixels — high mask ratio. The training data provides a free difficulty curriculum without any manual engineering.

Why are "natural masks" from sensor failures harder than random masks?

Because natural masks always target geometrically ambiguous regions (specular, textureless, transparent) where depth inference requires deep understanding of RGB context — random masks include many trivially interpolatable regions Because natural masks cover more pixels than random masks Because the sensor introduces additional noise around mask boundaries

Chapter 2: Masked Depth Modeling

Let's formalize the idea. Masked Depth Modeling (MDM) follows the general paradigm of Masked Image Modeling (like MAE), but shifts the reconstruction target from appearance to geometry.

How MAE Works (Quick Recap)

In standard MAE, you split an image into patches, randomly mask ~75% of them, feed the visible patches through a ViT encoder, then use a decoder to reconstruct the masked patches' pixel values. The model learns rich visual representations because it must understand scene structure to fill in the gaps.

How MDM Differs

MDM takes two inputs: an RGB image and a (corrupted) depth map. The key differences from MAE:

RGB is never masked. The full RGB image is always available as context. Only depth tokens are masked.
Masks come from the sensor. Instead of random masking, patches where the depth sensor failed are masked. Additional random masking fills up to the target ratio (60-90%).
The prediction target is depth, not appearance. The model predicts missing depth values, supervised by ground-truth depth maps.

Patch-Level Masking Decision

Since LingBot-Depth uses a ViT with 14x14 patches, each patch covers a region of pixels. The masking decision is made at the patch level:

If a patch has all pixels missing → always masked
If a patch has a mix of valid and invalid pixels → masked with probability 0.75
If masking from the above is insufficient → randomly sample fully valid patches to reach 60-90% mask ratio

Why keep some mixed patches? A patch that's half-valid, half-missing still carries useful geometric information. By keeping some of these unmasked, the model can use the partial depth signal alongside RGB context. This is strictly more informative than discarding them.

The Decoder

Standard MAE uses a shallow transformer decoder. LingBot-Depth replaces this with a ConvStack decoder — a hierarchical convolutional architecture that progressively upsamples from patch-resolution to pixel-resolution. This is better suited for dense geometric prediction because convolutions naturally preserve spatial locality and smooth gradients.

Key design choice: After the encoder, the latent depth tokens are discarded. Only the latent contextual (RGB) tokens are kept and fed to the ConvStack decoder. The decoder reconstructs the full depth map solely from enriched RGB features. This forces the encoder to transfer all geometric information into the RGB token representations via cross-modal attention.

In MDM, what happens to the depth tokens after the ViT encoder?

They are averaged with the RGB tokens and passed to the decoder They are discarded — only the latent contextual (RGB) tokens are retained, forcing all depth information to be encoded into RGB representations via attention They are unmasked and used as skip connections in the decoder

Chapter 3: The Unified Framework

Here's the most elegant consequence of masked depth modeling: the same architecture, with the same weights, naturally handles two seemingly different tasks — just by changing the mask ratio.

Depth Completion (Partial Masking)

When you have an RGB-D camera that produced a depth map with holes, you mask only the invalid (sensor-corrupted) tokens. The valid depth tokens flow through the encoder alongside all RGB tokens. The encoder fuses the sparse valid depth with rich visual context, and the decoder fills in the gaps.

This is standard depth completion: given RGB + sparse/incomplete depth, produce dense depth.

Monocular Depth Estimation (Full Masking)

Now imagine masking all depth tokens. Zero depth information enters the encoder. The model has only RGB tokens to work with. Yet it still predicts a complete depth map.

This is monocular depth estimation: given only RGB, infer depth from visual cues alone.

The unification: Monocular depth estimation and depth completion aren't two different tasks — they're two points on a continuum. The mask ratio is a "slider" from pure monocular (100% masked) to pure completion (only invalid pixels masked). And every point in between works: 80% masked, 50% masked, 20% masked. The model gracefully degrades as depth information decreases.

This means a single model can handle:

A high-quality sensor with 5% missing pixels (mask only those 5%)
A cheap sensor with 60% missing pixels (mask those 60%)
Sparse SfM/SLAM points (mask everything except the few valid patches)
No sensor at all (mask everything, pure monocular)

No architecture changes. No retraining. Just different masking.

Unified Framework

Toggle between monocular mode (all depth masked) and depth completion mode (only invalid pixels masked). Watch how the same architecture handles both tasks by changing what gets masked.

How does LingBot-Depth unify monocular depth estimation and depth completion?

By using two separate decoder heads, one for each task By training on both tasks with different loss functions By varying the masking ratio: 100% masked = monocular, only-invalid masked = completion — same architecture, same weights, different masks

Chapter 4: The Architecture

LingBot-Depth uses a standard Vision Transformer (ViT-Large) as its encoder, but with a carefully designed input pipeline and decoder that enable masked depth modeling.

Separated Patch Embeddings

The two input modalities — 3-channel RGB and 1-channel depth — are processed by separate patch embedding layers. Each modality is independently projected into a sequence of patch tokens on the same spatial grid (14x14 patches). This separation lets the self-attention layers learn how to integrate appearance and geometry, rather than forcing them into the same embedding space from the start.

Positional Embeddings

Each token receives two types of positional information:

2D spatial position: a shared learnable embedding telling the model where this patch is in the image grid
Modality embedding: a learned token distinguishing RGB (=1) from depth (=2) at the same spatial location

The final positional encoding is the sum of both. This lets the model know both where a token is and what type it is.

Encoder: Joint Embedding

After masking, the full set of RGB tokens and the unmasked depth tokens are concatenated and fed into a 24-layer ViT-Large encoder (initialized from DINOv2). The self-attention mechanism allows every RGB token to attend to every depth token and vice versa. This is where the magic happens: depth tokens at a given location attend to RGB tokens at the same and nearby locations, learning to correlate appearance with geometry.

ConvStack Decoder

After the encoder, the depth tokens are discarded. Only the enriched RGB tokens (which now carry geometric information absorbed from depth tokens via attention) are fed to the decoder. A [CLS] token capturing global scene context is broadcast-added to all tokens. The ConvStack decoder then progressively upsamples through residual blocks and transposed convolutions, doubling resolution at each stage until reaching 16x the patch resolution. UV positional encodings at each scale preserve spatial layout. Final bilinear upsampling matches the original input resolution.

Why discard depth tokens? If the decoder received depth tokens directly, it could "cheat" by simply passing through valid depth values without learning to reason from RGB. By forcing all output to flow through enriched RGB tokens only, the model must learn deep cross-modal correspondences during encoding.

Interactive Masked Depth Modeling

Drag the mask ratio slider to control how much depth is hidden. The left panel shows RGB (always full), the middle shows the masked depth tokens, and the right shows the "predicted" output. At 100% masking, the model must rely entirely on RGB.

Mask ratio 65%

Why does LingBot-Depth use separate patch embedding layers for RGB and depth instead of concatenating them into a 4-channel input?

Separate embeddings let the self-attention layers learn how to integrate appearance and geometry, and enable independent masking of depth tokens while keeping all RGB tokens visible Because 4-channel convolutions are not supported by standard ViT implementations To reduce the total number of parameters in the model

Chapter 5: The Data Pipeline

Training a model to fill in depth sensor failures requires data with both the failures (for masking) and the ground truth (for supervision). This is a chicken-and-egg problem: if you had perfect depth, you wouldn't need the model. LingBot-Depth solves this with a dual-stream pipeline producing 3M training pairs.

Synthetic Pipeline (1M samples)

The synthetic branch doesn't just render perfect depth maps — that would miss the point. Instead, it simulates the full imaging process of a real RGB-D camera:

Render RGB images, perfect depth, and grayscale stereo pairs with speckle patterns in Blender
Process the stereo pairs through Semi-Global Matching (SGM) — the same algorithm used in real cameras
The SGM output has realistic artifacts: missing values on textureless and specular surfaces, edge noise, depth quantization

Why simulate the camera pipeline? Prior synthetic datasets render "ideal" depth and then add random noise. But real sensor failures aren't random — they correlate with material properties and lighting. By running the actual stereo-matching algorithm on simulated IR images with speckle patterns, the synthetic data inherits the same failure modes as real hardware.

Key numbers: 442 indoor scenes, resolution 960x1280. Each sample contains RGB, perfect depth, stereo pair, ground-truth disparity, and simulated sensor depth. The stereo baseline is randomly sampled between 0.05-0.2m and focal length between 16-28mm for diversity.

Real-World Pipeline (2M samples)

The real-world branch uses a custom, 3D-printed capture rig that mounts multiple commercial RGB-D cameras (Intel RealSense, Orbbec Gemini, ZED) alongside a portable PC with a touchscreen. The modular design lets operators swap cameras easily.

Since real captures don't have perfect ground-truth depth, the team computes pseudo-depth labels from the left-right IR stereo pairs using a FoundationStereo-based network trained on the synthetic data. Left-right consistency checks filter out unreliable pixels.

Scene diversity spans 30+ categories: homes, offices, hospitals, airports, museums, gyms, parking garages, and outdoor environments.

Public Datasets (7M additional)

Seven open-source RGB-D datasets supplement the pipeline: Taskonomy (4.6M), ScanNet++ (0.8M), TartanAir (0.6M), ARKitScenes (0.5M), and others. For synthetic datasets with no missing depth, random patch masking simulates sensor failures. Total training: ~10M samples.

Source	Samples	Type	Natural masks?
LingBot-Depth-S (synthetic)	1.0M	Simulated SGM failures	Yes (SGM)
LingBot-Depth-R (real)	2.1M	Real sensor failures	Yes (hardware)
Taskonomy	4.6M	Supplementary	Random
ScanNet++	0.8M	Supplementary	Partial
Others (5 datasets)	~1.5M	Supplementary	Mixed

Why does the synthetic pipeline run SGM on simulated stereo pairs instead of just rendering perfect depth and adding random noise?

Because SGM is faster than rendering depth maps Because random noise doesn't need ground truth for supervision Because real sensor failures correlate with material and lighting conditions, and running the actual stereo-matching algorithm on simulated IR images reproduces those same failure patterns — random noise cannot

Chapter 6: Training

Training a ViT-Large on 10M RGB-D samples with masked depth modeling involves several careful design decisions.

Masking Strategy

The masking ratio during training ranges from 60% to 90%. The distribution of masks varies by data source:

Self-curated data (LingBot-Depth-S and -R): natural masks dominate. Synthetic data from SGM has higher mask ratios than real captures because SGM on simulated images is more aggressive.
Open-source synthetic (no natural failures): fully random patch masking to reach the target ratio.
Open-source real (relatively complete depth): natural masks contribute some, random masking fills the rest.

Encoder Initialization

The ViT-Large encoder is initialized from DINOv2 pretrained weights. This gives the model strong visual features from the start — it already understands edges, textures, objects. The MDM pretraining then teaches it to also reason about geometry.

Differential Learning Rates

The pretrained encoder and randomly initialized decoder have very different optimization needs:

Encoder: 1e-5 base learning rate (gentle — preserve DINOv2 features)
Decoder: 1e-4 learning rate (10x higher — must learn from scratch)

Training Schedule

Parameter	Value
Total iterations	250,000
Global batch size	1,024 (128 GPUs x 8)
Optimizer	AdamW (beta1=0.9, beta2=0.999, wd=0.05)
Warmup	2,000 iterations (encoder), none (decoder)
LR decay	Step decay: 0.5x every 25K iterations
Gradient clipping	Max norm 1.0
Precision	Mixed (BF16)
Wall time	~7.5 days on 128 GPUs

Loss and Augmentation

The loss is simple: L1 loss on predicted depth, computed only at pixels with valid ground-truth values. Data augmentation includes random resized cropping, horizontal flipping, color jittering, JPEG compression artifacts, motion blur, and shot noise — all applied to the RGB image to improve robustness.

No depth augmentation: The depth modality is not artificially corrupted beyond the natural/random masking. The model sees the raw (possibly noisy) depth values as-is. The masking is the augmentation for depth.

Why does LingBot-Depth use a 10x lower learning rate for the encoder than the decoder?

The encoder is initialized from DINOv2 with strong visual features that should be preserved, while the decoder is randomly initialized and must learn from scratch — different stages of learning need different rates Because the encoder has fewer parameters than the decoder To prevent the encoder from overfitting to the training set

Chapter 7: Results

LingBot-Depth is evaluated on three core tasks: depth completion, monocular depth estimation, and stereo matching prior initialization. The results are striking.

Depth Completion

Two protocols test depth completion under increasing difficulty:

Protocol 1: Block-wise masking. Ground-truth depth is corrupted with random block masks and Gaussian+shot noise at four severity levels (easy/medium/hard/extreme). Evaluated on iBims, NYUv2, DIODE.

Method	Easy RMSE	Hard RMSE	Extreme RMSE
OMNI-DC	0.476	2.053	2.214
PromptDA	0.298	0.607	2.587
PriorDA	0.409	0.845	2.734
LingBot-Depth	0.175	0.345	2.011

Protocol 2: Sparse SfM inputs. Only highly sparse SfM point clouds serve as depth input (ETH3D). This is the harder test.

Method	Indoor RMSE	Outdoor RMSE
OMNI-DC	0.605	1.069
PriorDA	0.360	1.238
LingBot-Depth	0.192	0.664

47% RMSE reduction indoors compared to the best baseline (PriorDA), and 38% outdoors. On the extreme block-masking setting, LingBot-Depth is the only method that keeps RMSE below 2.1 — all others exceed 2.2.

Monocular Depth Estimation

When used as a backbone initializer for MoGe (replacing DINOv2), MDM pretraining consistently improves depth and point map accuracy across 10 benchmarks. Relative error drops from 0.056 to 0.044 (affine-invariant), confirming that geometric knowledge learned during MDM transfers to monocular settings.

Stereo Matching Prior

As a depth prior for FoundationStereo, the MDM-pretrained encoder converges faster and achieves better final performance than both DINOv2 and DepthAnythingV2 initializations. At epoch 5, LingBot-Depth already outperforms the vanilla baseline at epoch 15 on most benchmarks.

Results: Depth Completion RMSE

RMSE comparison across methods on iBims benchmark (Protocol 1). Lower is better. LingBot-Depth (warm bar) dominates at every difficulty level.

Under the sparse SfM protocol (Protocol 2), how much does LingBot-Depth reduce indoor RMSE compared to the best baseline?

About 47% — from 0.360 (PriorDA) to 0.192, demonstrating strong performance even with extremely sparse depth input About 20% — a modest improvement About 5% — within the margin of error

Chapter 8: Downstream Applications

A depth model is only as good as what you can build with it. LingBot-Depth enables three compelling downstream applications — all without any task-specific fine-tuning.

Video Depth Completion

Despite being trained on static images only, LingBot-Depth produces temporally consistent depth on video input. Testing on 30 FPS captures from Orbbec Gemini-335 in challenging scenes (glass lobbies, aquarium tunnels, gyms with mirrors), the model fills in massive sensor gaps and maintains smooth depth across frames.

In the aquarium tunnel test, the ZED stereo camera almost entirely fails due to refractive glass surfaces. LingBot-Depth produces geometrically plausible depth throughout the sequence. No temporal modeling, no fine-tuning — pure zero-shot generalization from MDM pretraining.

3D Point Tracking

By plugging LingBot-Depth's refined depth into SpatialTrackerV2 (replacing its default VGGT depth frontend), both camera tracking and dynamic object tracking improve significantly. In indoor scenes with extensive glass surfaces where raw depth fails, the refined depth produces smoother, more accurate camera trajectories. For dynamic objects (scooters, rowing machines), tracked 3D point trajectories show coherent motion patterns.

Drop-in replacement: LingBot-Depth doesn't require any modification to SpatialTrackerV2. It's a drop-in depth estimator that produces better input, which cascades into better tracking. This is the power of foundation models: improve one component, improve everything downstream.

Dexterous Grasping

The most striking application: a robotic dexterous grasping pipeline using a Rokae XMate-SR5 arm with an X Hand-1 dexterous hand. The grasping policy (a diffusion policy conditioned on DINOv2 RGB features + Point Transformer point cloud features) is trained on HOI4D human hand-object interactions retargeted to the robot hand.

The key: the point cloud comes from LingBot-Depth's predictions, not the raw sensor. This enables grasping of objects that defeat conventional sensors:

Transparent glassware — invisible to IR-based depth sensors
Highly reflective bowls — specular reflection destroys structured light patterns
Thin objects — too narrow for reliable stereo matching

Aligned Latent Representations

An additional benefit: the attention visualization shows that depth tokens consistently attend to spatially corresponding RGB regions. Different depth queries in the same scene attend to distinct, position-aware regions. The encoder learns fine-grained geometric-appearance correspondences, producing latent representations where RGB and depth are naturally aligned.

The bigger picture: LingBot-Depth isn't just a depth completion model — it's a spatial perception foundation. The MDM-pretrained encoder produces latent features where visual appearance and 3D geometry are aligned. This makes the features immediately useful for any downstream task that needs to reason about 3D space.

Why can LingBot-Depth enable grasping of transparent glass cups that defeat conventional depth sensors?

Because it uses a special IR sensor that works on glass Because it infers the glass cup's depth from RGB context (visual appearance, edges, refraction patterns) even when the depth sensor returns no data for those pixels Because it averages depth values from neighboring opaque objects

Chapter 9: Connections

LingBot-Depth sits at the intersection of self-supervised learning, depth estimation, and sensor fusion. Let's map where it fits in the landscape.

Relation to MAE

MAE (Masked Autoencoders) masks and reconstructs image patches. MDM masks and reconstructs depth patches while conditioning on the full RGB image. The key innovation is replacing random masks with sensor-driven natural masks that create a harder, more informative pretraining signal.

Relation to Depth Anything / Metric3D

Depth Anything and Metric3D are monocular depth estimators — they take only RGB as input. LingBot-Depth subsumes this capability (mask all depth tokens = monocular mode) but also supports depth completion when sensor data is available. The MDM-pretrained encoder even outperforms DINOv2 as an initialization for monocular depth models.

Relation to Depth Completion Methods

OMNI-DC, PromptDA, and PriorDA are dedicated depth completion models. They treat the task in isolation. LingBot-Depth's MDM pretraining learns depth reasoning as a byproduct of self-supervised masked modeling, which provides stronger generalization across mask patterns and sparsity levels.

Relation to FoundationStereo

FoundationStereo is a stereo matching model that uses a monocular depth prior. LingBot-Depth serves as a stronger prior than DepthAnythingV2 for this purpose, demonstrating that MDM pretraining distills 3D geometric knowledge more effectively than standard visual pretraining.

Cheat Sheet

Aspect	LingBot-Depth
Input	RGB image + (optionally corrupted) depth map
Output	Dense, metric-scale depth map
Backbone	ViT-Large/14 (DINOv2 init)
Decoder	ConvStack (hierarchical conv pyramid)
Pretraining	Masked Depth Modeling (60-90% mask ratio)
Training data	~10M samples (3M self-curated + 7M public)
Loss	L1 on valid ground-truth depth pixels
Key insight	Sensor failures = natural masks = harder than random
Unification	Monocular (100% mask) ↔ completion (natural mask)
Key result	47% RMSE reduction on sparse SfM depth completion

The broader lesson: When your data has natural corruptions — sensor failures, missing labels, noisy measurements — don't discard them. They are telling you exactly which inputs are hard. Use them as the training signal. The corruption pattern itself encodes domain knowledge about the problem's difficulty landscape.

What is the key difference between MAE's masking and LingBot-Depth's masking?

MAE masks more patches than LingBot-Depth MAE uses random masks on image patches; LingBot-Depth uses sensor-failure-driven natural masks on depth patches (which always target geometrically hard regions), conditioned on full RGB MAE reconstructs images while LingBot-Depth reconstructs text tokens

Masked Depth Modeling for Spatial Perception