Introduction

Every VLA model has a vision encoder at its core. This encoder takes raw pixels — 224×224 or 336×336 RGB images from one or more cameras — and compresses them into a sequence of feature vectors that downstream components can reason about. The quality of these features determines the upper bound of what the entire system can achieve.

Computer vision has made extraordinary progress over the past decade. ImageNet classification is essentially solved. Object detection and segmentation are mature. Image generation is photorealistic. But robot vision is a different problem, and many of the assumptions baked into standard vision models are wrong for robotics.

ImageNet teaches a model that "this is a cup." A robot needs to know where the cup is in 3D space (to reach for it), how it's oriented (to plan a grasp), what it's made of (to determine grip force), and whether it's obstructed by other objects (to plan a collision-free path). Classification-oriented features may capture some of this implicitly, but they were optimized for a fundamentally different objective.

This article surveys the vision encoders used in modern VLA research: from classical CNN features through contrastively-trained ViTs (CLIP, SigLIP), self-supervised representations (DINOv2), depth sensing, point cloud processing, and the design decisions that determine whether pretrained features transfer effectively to robot control tasks.

ℹ What this article covers

What robots need from vision that ImageNet doesn't provide, CNN-based encoders and spatial softmax, Vision Transformers for robotics (CLIP, SigLIP, DINOv2), depth sensing and 3D representations (point clouds, PointNet), multi-view processing, and the critical frozen-vs-fine-tuned decision. With code examples for feature extraction.

What Robots Need to See

Beyond classification

Standard vision benchmarks evaluate features on classification (ImageNet), detection (COCO), or retrieval (Flickr30k). These tasks share a property: they require identifying what is in an image but not precisely where. A correct classification of "cup" doesn't require knowing the cup's 3D position to millimeter accuracy.

Robot manipulation requires a fundamentally different kind of visual understanding:

  • Metric spatial precision. To grasp an object, the robot needs its position in 3D space with sub-centimeter accuracy. A feature representation that preserves only semantic category ("cup") but loses metric position ("12.3cm from camera, 5.1cm to the left") is useless for manipulation.
  • Fine-grained geometry. Grasping a handle requires understanding the handle's shape, orientation, and viable grasp points. This is below the resolution of most classification features, which pool over spatial dimensions early in the network.
  • Physical properties. Is the object rigid or deformable? Heavy or light? Fragile or robust? These properties affect how the robot should interact with the object. They are partially inferable from visual appearance (a glass looks fragile, a pillow looks soft) but are not explicitly represented in standard vision features.
  • Affordances. Where can this object be grasped? Where should it be placed? What actions does it enable? Affordance understanding goes beyond recognition to predict how objects can be acted upon.

The spatial precision problem

Classification networks deliberately discard spatial information. Global average pooling reduces a spatial feature map (e.g., 7×7×2048 in ResNet-50) to a single vector (2048-d). This is useful for classification (invariance to spatial transformations is a feature, not a bug) but catastrophic for manipulation (spatial information is exactly what we need).

VLA architectures address this differently depending on the encoder:

  • CNN with spatial softmax: Extract feature maps before global pooling and compute expected spatial coordinates for each channel, producing a compact spatial representation.
  • ViT patch tokens: Use the full sequence of patch tokens (not just the [CLS] token), preserving spatial layout. A ViT-L/14 with 224px input produces 16×16 = 256 patch tokens, each encoding a 14×14 pixel region.
  • Multi-scale features: Extract features from multiple layers of the encoder to combine coarse semantic information (deeper layers) with fine spatial detail (earlier layers).

CNN-Based Encoders

Before Vision Transformers dominated, convolutional neural networks were the standard visual backbone for robot learning. Several CNN architectures remain relevant, and understanding their design choices illuminates what makes a good visual encoder for robotics.

ResNet features for manipulation

ResNet (He et al., 2016) remains a reasonable baseline encoder. For robotics, the key architectural decision is where to extract features. ResNet-50 has four residual groups producing feature maps at decreasing spatial resolution:

LayerOutput SizeChannelsSemanticsUse in Robotics
conv1 + pool56×5664Edges, texturesToo low-level; high compute
layer156×56256Simple patternsRarely used alone
layer228×28512Object partsGood spatial detail
layer314×141024Objects, contextCommon choice for manipulation
layer47×72048Semantic categoriesGood for classification, poor spatial

RT-1 (Brohan et al., 2022) used an EfficientNet-B3 (a CNN closely related to ResNet) pretrained on ImageNet. The architecture extracts spatial features before global pooling and processes them with a FiLM-conditioned (Feature-wise Linear Modulation) transformer to produce action tokens. The CNN encoder was kept frozen during training — a decision that reduced training compute significantly while maintaining visual feature quality.

Spatial softmax: extracting keypoints from CNNs

Spatial softmax (Levine et al., 2016) is a technique specifically designed for extracting manipulation-relevant features from CNNs. Instead of global average pooling (which averages out spatial information), spatial softmax applies a softmax over the spatial dimensions of each feature channel, then computes the expected (x, y) coordinates:

For feature map fc(i, j) of channel c:

wc(i, j) = exp(fc(i, j)) / ∑i',j' exp(fc(i', j'))

xc = ∑i,j j · wc(i, j),    yc = ∑i,j i · wc(i, j)

This produces a vector of 2K values (x, y for each of K channels) that represent the spatial locations where each feature channel is most active — essentially a set of learned visual keypoints. A ResNet with 512 channels at layer3 produces 1024 spatial softmax features, encoding 512 keypoint locations.

Spatial softmax was a key innovation in early deep learning for robotics. It provides a compact, spatially-aware representation that works well for policies predicting end-effector position targets. However, it has been largely superseded by ViT patch tokens, which provide spatial information without requiring a separate extraction mechanism.

Vision Transformer Encoders

The shift from CNNs to Vision Transformers (Dosovitskiy et al., 2020) has been transformative for robot learning. ViTs provide two properties that make them especially well-suited:

  1. Patch tokens preserve spatial layout. Unlike CNNs (which pool aggressively), ViTs maintain a sequence of tokens, each corresponding to a fixed image region. This is a natural representation for downstream transformer-based policies.
  2. Pre-training objectives align better with robotics. Contrastive (CLIP) and self-supervised (DINO) objectives learn richer, more transferable features than supervised ImageNet classification.

CLIP: contrastive language-image features

CLIP (Radford et al., 2021) trains a vision encoder and a text encoder jointly on 400 million image-text pairs from the internet, using a contrastive objective that aligns matching image-text pairs in a shared embedding space. The resulting vision encoder produces features that encode both visual content and semantic meaning.

For VLAs, CLIP's key advantage is language grounding. CLIP features already live in a space where visual and textual concepts are aligned. When a VLA receives the instruction "pick up the red cup," CLIP features for the red cup are already close to the embedding of the text "red cup." This alignment dramatically simplifies the task of grounding language instructions in visual observations.

CLIP ViT-L/14 (the most commonly used variant) produces 256 patch tokens of dimension 768 from a 224×224 input, or 576 patch tokens from a 336×336 input. Most early VLA architectures (RT-2, LLaVA, the VLMs that inspired VLAs) use CLIP as their visual backbone.

💡 CLIP's limitations for robotics

CLIP was trained for image-text retrieval, not manipulation. Its features are optimized for semantic similarity (is this image about "dogs" or "cats"?) rather than spatial precision (exactly where is the dog's nose?). Several studies have shown that CLIP features lose fine-grained spatial information, especially in deeper layers. For tasks requiring precise spatial reasoning (inserting a key into a lock, threading a needle), CLIP features may not provide sufficient spatial resolution. This has motivated the use of DINOv2 and other self-supervised encoders that preserve spatial structure better.

SigLIP: sigmoid loss for language-image pre-training

SigLIP (Zhai et al., 2023) replaces CLIP's softmax-based contrastive loss with a simpler sigmoid loss that operates on independent image-text pairs rather than requiring the full batch matrix of similarities. This enables training with larger batch sizes (which improves performance) and simplifies the training pipeline.

In practice, SigLIP produces features of similar quality to CLIP but scales more efficiently. SigLIP-SO400M (400M parameter ViT with Shape-Optimized training) has become the preferred vision encoder for several recent VLA architectures, including OpenVLA. It provides the same language-grounding benefits as CLIP with improved training efficiency and slightly better performance on vision-language benchmarks.

DINOv2: self-supervised features without language

DINOv2 (Oquab et al., 2024) takes a fundamentally different approach: it trains a vision encoder using self-supervised learning (specifically, self-distillation with a momentum teacher) without any text supervision. The resulting features encode visual structure — edges, textures, shapes, parts, objects — without being biased toward any particular downstream task.

For robotics, DINOv2 has a crucial advantage: spatial feature quality. Because DINOv2 is trained to produce consistent features across augmented views of the same image (including crops, color jitter, and geometric transformations), its features maintain detailed spatial structure at every layer. Multiple studies have shown that DINOv2 patch features provide better spatial correspondence than CLIP features — if you mark a point on one view of an object, DINOv2 features can locate the same point in another view more accurately.

The trade-off: DINOv2 features lack language grounding. A DINOv2 feature for "red cup" is not naturally aligned with the text "red cup." VLAs that use DINOv2 must learn this alignment from scratch during training, which requires more data and a more expressive projection layer. Some architectures (like Theia, Shang et al., 2024) distill multiple vision encoders into a single backbone, combining DINOv2's spatial precision with SigLIP's language grounding.

CLIP / SigLIP

Language-Grounded Features

Pre-aligned with text embeddings. Natural fit for language-conditioned policies. Strong semantic features but weaker spatial precision. Used by RT-2, OpenVLA, and most early VLAs.

DINOv2

Spatial-Structure Features

Self-supervised, no text bias. Superior spatial correspondence and fine-grained geometry. Requires explicit language alignment. Used by Octo, π0, and spatially-demanding tasks.

Vision Encoder Features — What Different Encoders See Interactive

Compare what different vision encoders extract from the same scene. Each encoder emphasizes different visual properties.

CLIP — strong object-level semantics, weaker spatial detail

Depth and 3D Representations

RGB images provide rich appearance information but lack explicit depth. For manipulation tasks that require precise 3D reasoning — grasping objects at specific heights, avoiding collisions, stacking blocks — depth information is essential.

Depth sensors

Three main technologies provide depth measurement:

  • Structured light (e.g., Intel RealSense D400 series): Projects a known infrared pattern onto the scene and measures its deformation to compute depth. Works well indoors at short range (0.2–10m). Fails on reflective or transparent surfaces. Cost: ~$200–$400.
  • Time-of-flight (e.g., Microsoft Azure Kinect): Measures the time for emitted light pulses to return. Better for larger scenes and outdoor use. Higher noise at close range. Cost: ~$400–$1000.
  • Stereo vision (e.g., ZED cameras): Uses two RGB cameras to compute depth from parallax, similar to human binocular vision. No active illumination needed. Depth accuracy depends on texture richness and baseline distance. Cost: ~$300–$500.

Depth maps are typically represented as single-channel images where each pixel value encodes distance from the camera (in meters). They can be concatenated with RGB images to form RGBD inputs (4-channel images) or converted to 3D point clouds.

Point clouds

A point cloud is an unordered set of 3D points {(xi, yi, zi)}i=1N, optionally augmented with color or normal vectors. Point clouds are the natural output of depth sensors when combined with known camera intrinsics (the pinhole camera model converts each pixel + depth value into a 3D point).

Point clouds have several advantages over 2D images for manipulation:

  • Viewpoint invariance. The same object produces different images from different camera angles, but the same point cloud (up to occlusion). Policies operating on point clouds can generalize across camera viewpoints more easily.
  • Metric 3D. Point coordinates are in real-world units (meters). The robot can directly compute distances, plan collision-free paths, and reason about spatial relationships without learning an implicit 3D model from 2D projections.
  • Natural for grasping. Grasp planning requires understanding the 3D geometry of objects and the space around them. Point clouds provide this directly.

PointNet and PointNet++

Processing point clouds with neural networks requires architectures that handle unordered sets of varying size. Standard CNNs (which expect grid-structured inputs) cannot be directly applied.

PointNet (Qi et al., 2017) introduced a simple but effective architecture: apply a shared MLP to each point independently, then aggregate across all points using a symmetric function (max pooling) to produce a global feature. The key insight is that max pooling is invariant to input ordering, satisfying the permutation-invariance requirement of point sets.

f(x1, ..., xn) = g(h(x1), h(x2), ..., h(xn))

where h: R3 → RK is a shared MLP, g = max-pool

PointNet++ (Qi et al., 2017) extends PointNet with hierarchical feature learning. It groups points into local neighborhoods using farthest point sampling and ball queries, applies PointNet within each local group, and repeats at increasing scales. This captures local geometric structure (which vanilla PointNet misses due to global pooling).

In modern VLA research, point cloud processing is less common than RGB-based approaches because most large-scale demonstration datasets (Open X-Embodiment, Bridge V2) collect only RGB observations. However, point clouds remain important for precise manipulation tasks and are used in several specialized systems (e.g., contact-rich assembly, bin picking, deformable object manipulation).

3D Point Cloud — From Depth to 3D Coordinates Interactive

Rotate the point cloud by dragging. Toggle between depth map view and 3D point cloud view.

Depth map — each pixel encodes distance

Multi-View Representations

A single camera provides a 2D projection of a 3D scene. Objects behind the robot are invisible. Objects at similar depths are hard to distinguish. Self-occlusion hides parts of objects. Multiple cameras address these limitations at the cost of increased observation dimensionality and the need for feature fusion.

Common multi-view setups in VLA research:

  • Third-person + wrist camera: The most common setup. A fixed overhead or angled camera provides scene context (where objects are, what the robot is doing). A camera mounted on the robot's wrist provides close-up detail for fine manipulation (seeing small features, precise alignment). RT-1 used a single third-person view; many later systems add a wrist camera for improved manipulation precision.
  • Stereo views: Two cameras at a known baseline provide depth through stereo matching, enabling 3D reconstruction without an active depth sensor.
  • Multiple third-person views: Some setups use 3–4 cameras from different angles to reduce occlusion. This is common in simulation-based training but impractical for many real-world deployments.

Feature fusion for multi-view inputs can be done at several levels:

  • Early fusion: Concatenate images along the channel dimension before encoding. Simple but loses the spatial relationship between views.
  • Late fusion: Encode each view independently, then concatenate or cross-attend the resulting feature sequences. More flexible and the dominant approach. Octo processes multiple camera views by encoding each independently and concatenating the token sequences.
  • 3D fusion: Project features into a shared 3D coordinate frame using known camera extrinsics. Provides geometric consistency but requires calibrated cameras.

Feature Extraction Strategies

Given a pretrained vision encoder, there are important decisions about how to use its features for robot policy learning.

Frozen vs. fine-tuned encoders

The simplest approach is to freeze the vision encoder: use it as a fixed feature extractor and only train downstream components (projection layers, policy networks). This is fast, memory-efficient, and prevents catastrophic forgetting of pretrained features. RT-1 used a frozen EfficientNet. RT-2 keeps the vision encoder frozen during its initial training stages.

The alternative is to fine-tune the encoder end-to-end alongside the policy. This allows the encoder to adapt its features for manipulation-specific requirements (e.g., attending more to object edges and grasp points, less to background textures). The risk is catastrophic forgetting: with limited robot data, fine-tuning can degrade the general visual features that made pretraining valuable.

The prevailing wisdom has evolved:

  • Early VLAs (2022–2023): Freeze the encoder. Pretrained features are too valuable to risk corrupting with small robot datasets.
  • Current best practice (2024+): Use parameter-efficient fine-tuning (LoRA, adapters) to adapt the encoder while preserving most pretrained weights. OpenVLA uses LoRA to fine-tune the SigLIP encoder, adding only ~1% additional parameters.

Which layer to extract from

Vision encoders produce features at multiple levels of abstraction. For ViTs, the standard approach is to use the output of the final transformer block (all patch tokens). But this isn't always optimal:

  • Final layer: Most abstract, best for semantic understanding (object identity, scene category). Standard choice for classification and retrieval.
  • Intermediate layers: Better spatial detail, less abstract. Some robotics papers report better manipulation performance with features from 2/3 depth rather than the final layer.
  • Multi-layer fusion: Concatenate features from multiple layers to get both spatial precision and semantic abstraction. Increases representation dimensionality but can improve performance for tasks requiring both fine spatial control and semantic understanding.
💡 The surprising importance of resolution

Increasing the input resolution of the vision encoder consistently improves manipulation performance across multiple studies. LLaVA-1.5's jump from 224px to 336px significantly improved VQA scores. For robotics, higher resolution means finer spatial features: a ViT-L/14 at 224px has 16×16 = 256 patches (each covering 14×14 pixels); at 336px it has 24×24 = 576 patches. Each patch covers the same physical pixel area (14×14) but the image covers more scene detail. This matters especially for tasks requiring precise alignment or small object manipulation.

Feature Extraction — Layer Depth vs. Spatial Detail Interactive

See how features change across encoder layers. Early layers capture edges and textures; deeper layers capture objects and semantics at the cost of spatial precision.

Layer 1 — edges, textures, local patterns

Encoder Comparison

Here is a comprehensive comparison of the major vision encoders used in VLA research, with their key properties and typical use cases:

Encoder Type Pre-training Params Patch Tokens Used By
CLIP ViT-L/14 ViT-L Contrastive (400M pairs) 304M 256 (224px) RT-2, LLaVA, early VLMs
SigLIP-SO400M ViT-SO400M Sigmoid contrastive 400M 729 (384px) OpenVLA, PaLI-X
DINOv2 ViT-L ViT-L Self-supervised (142M images) 304M 256 (224px) Octo, π0, Theia
EfficientNet-B3 CNN ImageNet supervised 12M N/A (spatial maps) RT-1
VC-1 ViT-L MAE on Ego4D video 304M 256 (224px) Various research
R3M ResNet-50 Video-language contrastive 23M N/A (spatial maps) Various research

The VLA architecture determines how encoder features flow into the policy:

Image 224×224 Vision Encoder 256 × 768-d tokens Projection 256 × 4096-d tokens LLM Backbone

Code Examples

Extracting features from different encoders

python
import torch
from PIL import Image
from torchvision import transforms

# ── CLIP feature extraction ──────────────────────────────
from transformers import CLIPVisionModel, CLIPProcessor

clip_model = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

image = Image.open("scene.jpg")
inputs = clip_processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = clip_model(**inputs, output_hidden_states=True)

# Patch tokens (spatial features) — NOT the [CLS] token
patch_tokens = outputs.last_hidden_state[:, 1:, :]  # (1, 256, 1024)
print(f"CLIP patch tokens: {patch_tokens.shape}")

# ── SigLIP feature extraction ──────────────────────────
from transformers import SiglipVisionModel, AutoProcessor

siglip = SiglipVisionModel.from_pretrained("google/siglip-so400m-patch14-384")
proc = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")
inputs = proc(images=image, return_tensors="pt")

with torch.no_grad():
    out = siglip(**inputs)
siglip_tokens = out.last_hidden_state  # (1, 729, 1152) at 384px
print(f"SigLIP tokens: {siglip_tokens.shape}")

# ── DINOv2 feature extraction ───────────────────────────
dinov2 = torch.hub.load("facebookresearch/dinov2", "dinov2_vitl14")
dinov2.eval()

transform = transforms.Compose([
    transforms.Resize(224),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])
img_tensor = transform(image).unsqueeze(0)

with torch.no_grad():
    dino_features = dinov2.forward_features(img_tensor)
    patch_tokens_dino = dino_features["x_norm_patchtokens"]  # (1, 256, 1024)
print(f"DINOv2 patch tokens: {patch_tokens_dino.shape}")

Spatial softmax for CNN features

python
import torch
import torch.nn as nn
import torch.nn.functional as F

class SpatialSoftmax(nn.Module):
    """Extract expected 2D keypoint positions from CNN feature maps."""

    def __init__(self, height: int, width: int, num_channels: int,
                 temperature: float = 1.0):
        super().__init__()
        self.height = height
        self.width = width
        self.temperature = nn.Parameter(
            torch.tensor(temperature), requires_grad=True
        )

        # Create fixed coordinate grids
        pos_x, pos_y = torch.meshgrid(
            torch.linspace(-1, 1, width),
            torch.linspace(-1, 1, height),
            indexing='xy'
        )
        self.register_buffer('pos_x', pos_x.reshape(-1))  # (H*W,)
        self.register_buffer('pos_y', pos_y.reshape(-1))

    def forward(self, feature_map: torch.Tensor) -> torch.Tensor:
        """
        Args: feature_map (B, C, H, W)
        Returns: keypoints (B, C*2) — expected (x,y) for each channel
        """
        B, C, H, W = feature_map.shape
        flat = feature_map.reshape(B, C, -1)  # (B, C, H*W)

        # Spatial softmax
        weights = F.softmax(flat / self.temperature, dim=-1)  # (B, C, H*W)

        # Expected coordinates
        expected_x = (weights * self.pos_x).sum(dim=-1)  # (B, C)
        expected_y = (weights * self.pos_y).sum(dim=-1)

        return torch.cat([expected_x, expected_y], dim=-1)  # (B, 2C)

# Usage with ResNet
import torchvision.models as models
resnet = models.resnet50(pretrained=True)
# Extract features from layer3 (14x14 spatial, 1024 channels)
encoder = nn.Sequential(*list(resnet.children())[:7])  # up to layer3

spatial_softmax = SpatialSoftmax(height=14, width=14, num_channels=1024)

with torch.no_grad():
    features = encoder(img_tensor)       # (1, 1024, 14, 14)
    keypoints = spatial_softmax(features) # (1, 2048) — 1024 (x,y) pairs
print(f"Spatial softmax keypoints: {keypoints.shape}")

Point cloud from depth image

python
import numpy as np

def depth_to_pointcloud(depth: np.ndarray, intrinsics: dict,
                        rgb: np.ndarray = None) -> np.ndarray:
    """Convert depth image to 3D point cloud using camera intrinsics.

    Args:
        depth: (H, W) depth image in meters
        intrinsics: dict with 'fx', 'fy', 'cx', 'cy' (focal lengths and principal point)
        rgb: optional (H, W, 3) color image

    Returns:
        points: (N, 3) or (N, 6) if rgb provided, where N = H*W valid points
    """
    H, W = depth.shape
    fx, fy = intrinsics['fx'], intrinsics['fy']
    cx, cy = intrinsics['cx'], intrinsics['cy']

    # Create pixel coordinate grids
    u, v = np.meshgrid(np.arange(W), np.arange(H))

    # Back-project to 3D using pinhole camera model
    z = depth
    x = (u - cx) * z / fx
    y = (v - cy) * z / fy

    # Stack into point cloud
    points = np.stack([x, y, z], axis=-1).reshape(-1, 3)

    # Filter invalid depth values
    valid = (depth.reshape(-1) > 0.1) & (depth.reshape(-1) < 10.0)
    points = points[valid]

    if rgb is not None:
        colors = rgb.reshape(-1, 3)[valid] / 255.0
        points = np.concatenate([points, colors], axis=-1)

    return points

# Example: Intel RealSense D435 intrinsics
intrinsics = {'fx': 615.0, 'fy': 615.0, 'cx': 320.0, 'cy': 240.0}
depth = np.load("depth_frame.npy")  # (480, 640) in meters
rgb = np.array(Image.open("color_frame.png"))

pcd = depth_to_pointcloud(depth, intrinsics, rgb)
print(f"Point cloud: {pcd.shape[0]} points, {pcd.shape[1]} dims")

References

Seminal papers and key works referenced in this article.

  1. Radford et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML, 2021. arXiv
  2. Oquab et al. "DINOv2: Learning Robust Visual Features without Supervision." TMLR, 2024. arXiv
  3. Qi et al. "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation." CVPR, 2017. arXiv
  4. Nair et al. "R3M: A Universal Visual Representation for Robot Manipulation." CoRL, 2022. arXiv
  5. Majumdar et al. "Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?" NeurIPS, 2023. arXiv