CoTracker — Veanors

Chapter 0: The Problem

You are watching a video of a car driving through a city. You want to track 100 points on the car's surface as it moves, turns, and occasionally disappears behind a lamppost. Each point has a 2D position in each frame, and you need to predict where it goes next.

Existing methods — TAP-Net, TAPIR, PIPs — track each point independently. They take one point, match it across frames, and produce a trajectory. Then they repeat for the next point. Each track knows nothing about the others.

This is deeply wasteful. If 50 of those points sit on the car's hood, they all move together when the car turns. If one point gets occluded behind a lamppost, the other 49 on the same surface tell you exactly where it went. Independent tracking throws away all of this information.

The failure mode: When a point is occluded, an independent tracker has no idea where it went. It either freezes, drifts randomly, or snaps to a nearby texture. But a joint tracker can look at neighboring points that are still visible and infer the occluded point's position from their motion. Points on the same rigid surface form a constellation — if you know where the constellation is, you know where every star in it should be.

The simulation below shows this failure. On the left, points are tracked independently — when one is occluded, it drifts away. On the right, points share information — the occluded point stays locked to its neighbors.

Independent vs Joint Tracking

Watch the points move across the surface. When the occluder (gray bar) passes over, independently tracked points drift while jointly tracked points recover. Press Play to start.

Why does independent point tracking fail when points are occluded?

Each point tracker has no access to information from other tracks — so when a point is occluded, there is nothing to constrain where it should be Independent trackers use lower resolution features Independent trackers cannot process more than 10 frames

Chapter 1: The Key Insight

CoTracker's core idea is deceptively simple: track points together, not apart.

If you track N points across T frames, you have an N×T grid of positions. Independent trackers fill in each row of this grid separately. CoTracker fills in the entire grid at once, using a transformer that attends across both the time dimension (how does this point move over time?) and the track dimension (how do different points relate to each other?).

Independent Tracking

Process each of N tracks separately. No information sharing. Complexity: N independent forward passes.

↓ vs ↓

Joint Tracking (CoTracker)

Process all N tracks simultaneously. Transformer attention lets tracks exchange information. Single forward pass for all tracks.

Why does joint tracking help? Three reasons:

Motion correlation: Nearby points on the same object share the same rigid motion. If point A moves 10 pixels right, point B (5 pixels away on the same surface) probably moves ~10 pixels right too. Joint attention captures this automatically.
Occlusion recovery: When point A is hidden, points B, C, D on the same object are still visible. Their motion constrains where A must be. This is impossible with independent tracking.
Support points: CoTracker can even add extra points that the user didn't request. These "support points" form a grid across the image and provide additional context. More points = more correlations = better tracking for everyone.

The ablation proof: In CoTracker's experiments, switching from independent to joint tracking improves Average Jaccard by 4–6 points on TAP-Vid-DAVIS. Adding support points on top of that improves it by another 2–3 points. The correlations between tracks are a real, measurable signal.

What are the two dimensions along which CoTracker's transformer applies attention?

Spatial X and spatial Y coordinates of each point Time (how each point moves across frames) and tracks (how different points relate to each other) RGB channels and depth channels

Chapter 2: Point Tracking Basics

Before diving into CoTracker's architecture, let's define the problem precisely. The TAP (Tracking Any Point) benchmark introduced a clean formulation:

What is a "track"?

A track is a sequence of 2D positions (x_t, y_t) for a single physical point across video frames t = t_i, ..., T. The track starts at time t_i when the point first appears (or is queried), and the tracker must predict its position in all subsequent frames.

Crucially, each track also has a visibility flag v_t ∈ {0, 1}. The point is visible (v=1) when it can be seen in the image, and occluded (v=0) when something blocks it — or when it leaves the camera's field of view entirely. The tracker must predict both position and visibility.

TAP-Vid Metrics

Performance is measured by three metrics:

Metric	What it measures
δ^vis_avg	Fraction of visible points tracked within 1, 2, 4, 8, 16 pixels, averaged over these thresholds. Pure position accuracy.
OA (Occlusion Accuracy)	Binary accuracy of visibility prediction. Can the tracker tell when a point is occluded?
AJ (Average Jaccard)	Joint metric combining position accuracy and occlusion prediction. The hardest metric — you need both right.

Evaluation Protocols

TAP-Vid uses two protocols:

First: Each point is queried at the first frame where it is visible. The tracker runs forward in time (causal).
Strided: Points are queried every 5 frames. Tracking is bidirectional — causal trackers run the video forwards and backwards and combine results.

Key distinction from optical flow: Optical flow predicts dense motion (every pixel) between adjacent frames. Point tracking predicts sparse motion (selected points) across many frames. Flow is joint but short-range. Prior trackers are long-range but independent. CoTracker is both joint and long-range.

What does the Average Jaccard (AJ) metric measure that δ^vis_avg alone does not?

AJ jointly evaluates both position accuracy and occlusion prediction — you must correctly predict visibility in addition to position AJ uses higher resolution images for evaluation AJ measures tracking speed in addition to accuracy

Chapter 3: The Sliding Window

Videos can be thousands of frames long. Processing all frames at once with a transformer is impossible — the memory cost of attention scales quadratically with sequence length. CoTracker's solution: process the video in overlapping sliding windows.

How It Works

Say the video has T' frames and the window size is T = 8. CoTracker splits the video into J = ⌈2T'/T − 1⌉ windows, each of length T, with an overlap of T/2 frames.

Window 1

Frames 1–8. Initialize tracks with query positions. Apply transformer M times to refine.

↓ overlap frames 5–8

Window 2

Frames 5–12. Initialize with refined predictions from Window 1 for frames 5–8. Broadcast last position for frames 9–12. Refine M times.

↓ overlap frames 9–12

Window 3

Frames 9–16. Same pattern. Previous window's output becomes next window's initialization.

This is essentially a recurrent network. Each window's output initializes the next window's input. Information propagates forward through the video one window at a time, allowing tracks to persist for arbitrarily long sequences.

The Overlap is Critical

The T/2 frame overlap ensures continuity. Without overlap, the tracks would "reset" at each window boundary — there would be no way to transfer information from window j to window j+1. The overlap provides the bridge: the refined positions from the end of window j become the initialization for the start of window j+1.

Tracking through long occlusions: If a point is occluded for 20 frames but visible before and after, independent single-window tracking would lose it. With sliding windows and joint tracking, neighboring visible points carry the occluded point's position forward through multiple windows until it reappears. The recurrent nature of the windows acts as memory.

Track Features

A subtle detail: while position estimates P carry forward between windows, the track feature vectors Q are re-initialized from the original query features at each window. This prevents feature drift — the appearance template stays anchored to what the point originally looked like.

Why do consecutive sliding windows overlap by T/2 frames instead of having no overlap?

To double the number of training samples per sequence To reduce GPU memory usage The overlap transfers refined track predictions from one window to the next, providing continuity — without it, tracks would lose all state at window boundaries

Chapter 4: The Joint Tracking Architecture

This is the core of CoTracker. The architecture is a transformer that operates on a 2D grid of tokens — one dimension is time, the other is tracks — and iteratively refines track estimates.

Step 1: Image Features

A CNN extracts dense feature maps φ(I_t) from each video frame I_t. These are computed once and shared across all tracks. The features are computed at multiple scales (S = 4 scales) for matching at different resolutions.

Step 2: Token Construction

For each track i and time t, CoTracker constructs a token Gⁱ_t by concatenating:

Estimated position P̂ⁱ_t relative to the starting position (displacement so far)
Visibility estimate v̂ⁱ_t
Track features Qⁱ_t (appearance template, initialized from image features at the query point)
Correlation features Cⁱ_t (similarity between track features and image features around the current estimated position — like a RAFT correlation volume, but per-track)
Positional encodings of the displacement, starting location, and time

Step 3: Factored Attention

Naively attending over all N×T tokens is O(N²T²) — far too expensive. CoTracker factorizes the attention into two alternating operations:

Attention Type	What it does	Cost
Time attention	Each track attends to itself across all T frames. "How has this point moved over time?"	O(T²) per track
Track attention	At each time step, all N tracks attend to each other. "How do different points relate to each other right now?"	O(N²) per frame

These two operations interleave: time attention, then track attention, then time attention, and so on. The total cost is O(N² + T²) instead of O(N²T²).

Step 4: Iterative Refinement

The transformer is applied M times. Each application produces small updates ΔP̂ (position correction) and ΔQ (feature update). The estimates are refined additively:

P̂^(m+1) = P̂^(m) + ΔP̂, Q^(m+1) = Q^(m) + ΔQ

Visibility is predicted only once, after the final iteration, as v̂ = σ(W Q^(M)). The intuition: you need an accurate position estimate before you can reliably decide if a point is visible.

Why iterative updates? This is the same idea as RAFT. A single forward pass must make a large prediction — "the point moved 50 pixels right." Iterative refinement makes many small corrections — "the point moved 48... no, 49.5... no, 50.2 pixels." Each iteration re-computes the correlation features C around the current estimate, so the network can "zoom in" on the right answer.

Joint Tracking Architecture

Animated demonstration: points on a surface are tracked jointly. Toggle between independent and joint mode. When a point is occluded, joint tracking uses neighbor information to maintain it. Drag the slider to scrub through frames.

Frame 0

Why does CoTracker factorize attention into separate time and track operations instead of full N×T self-attention?

Full N×T attention is O(N²T²), which is prohibitively expensive — factorization reduces this to O(N² + T²) while still capturing both temporal and cross-track dependencies Full attention produces worse gradients during training Factorized attention uses fewer parameters

Chapter 5: Token Proxies

Factorized attention reduces the cost from O(N²T²) to O(N² + T²). But there's still that N² term. If you want to track N = 70,000 points (quasi-dense tracking), even O(N²) is ~5 billion operations per attention layer. That doesn't fit on a GPU.

The solution: proxy tokens. Instead of every track attending to every other track (O(N²)), introduce K proxy tokens where K ≪ N, and have tracks attend to proxies instead of each other.

How Proxies Work

Proxy tokens are K learned, fixed tokens (like "virtual tracks") that are concatenated to the list of real tracks at the transformer input.

Time attention: Proxies are processed identically to regular tracks. Each proxy attends to itself across all T frames.
Track attention: This is where the magic happens. Regular tracks cross-attend to proxies, not to each other. Proxies self-attend among themselves. After the attention layer, proxy outputs are discarded — they only serve as information bottlenecks.

Cost = O(NK + K² + T²)

Since K is fixed (and small), this is linear in N. You can now scale to 70K points on a single GPU.

What do proxies learn? Think of proxies as "information hubs." Each proxy learns to represent a cluster of related tracks — maybe all points on the same object, or all points with similar motion. Tracks deposit their information into proxies, proxies aggregate and redistribute. It's like a message-passing network with a bottleneck layer. Similar in spirit to register tokens in Vision Transformers (Darcet et al., 2023).

Token Proxy Visualization

Left: Full track attention (every track attends to every other — O(N²)). Right: Proxy attention (tracks attend to K proxies — O(NK)). Adjust N and K to see how the connection count changes.

N (tracks) 12

K (proxies) 4

How do proxy tokens reduce the memory complexity of track attention from O(N²) to O(NK)?

By reducing the number of attention heads By using sparse attention masks that skip every other track Regular tracks cross-attend to K proxy tokens instead of to each other — each track attends to K proxies rather than N tracks, making the cost linear in N

Chapter 6: Training

CoTracker is trained on TAP-Vid-Kubric, a synthetic dataset generated by the Kubric engine. It consists of sequences showing 3D rigid objects falling under gravity and bouncing, with ground-truth point tracks.

Why Synthetic Data?

Real-world point tracking annotations are extremely expensive. You need to label the same physical point across hundreds of frames, even through occlusions. Synthetic data gives you:

Perfect ground truth: The 3D positions of every surface point are known exactly, so projecting them to 2D gives pixel-accurate tracks.
Free occlusion labels: The renderer knows exactly which points are visible in each frame.
Unlimited data: You can generate as many sequences as you need.

Unrolled Training

Because CoTracker processes videos with sliding windows (like a recurrent network), it is trained with unrolled optimization — backpropagating through multiple windows to teach the model to maintain tracks across window boundaries.

The training loss sums over all transformer iterations m and all windows j:

L₁(P̂, P) = ∑_j=1^J ∑_m=1^M γ^M−m ‖ P̂^(m,j) − P^(j) ‖

The discount factor γ = 0.8 weights later iterations more heavily — early refinement steps are exploratory, late ones should be accurate. A second loss L₂ is cross-entropy on the visibility predictions.

Training Details

Parameter	Value
Training set	TAP-Vid-Kubric (6,000 sequences)
Sequence length	T' = 24 frames
Window size	T = 8 frames
Tracks per batch	N = 768
Training iterations	50,000
Hardware	32 NVIDIA A100 80GB GPUs
Training time	~40 hours
Feature scales	S = 4
Correlation radius	Δ = 3

Generalization from synthetic to real: Despite being trained only on synthetic Kubric data, CoTracker generalizes well to real-world videos (TAP-Vid-DAVIS). This is because the core computation — matching features and reasoning about motion correlations — transfers across domains. The model learns physics-agnostic motion reasoning, not Kubric-specific patterns.

Why does the training loss use a discount factor γ = 0.8 that weights early transformer iterations less than later ones?

To reduce overfitting to the training data Early iterations are coarse refinements — demanding high accuracy from them would create conflicting gradients. Later iterations should be accurate, so they get higher weight. To make the loss function differentiable

Chapter 7: Results

CoTracker was evaluated against TAP-Net, PIPs, PIPs++, MFT, OmniMotion, and TAPIR across multiple benchmarks. The results are striking.

TAP-Vid-DAVIS (Real Videos, First Protocol)

Method	AJ ↑	δ^vis_avg ↑	OA ↑
TAP-Net	33.0	48.6	78.8
PIPs	42.2	64.8	77.7
OmniMotion	52.8	66.9	87.1
TAPIR	56.2	70.0	86.5
CoTracker	62.2	75.7	89.3

CoTracker achieves 62.2 AJ, beating TAPIR by 6 points and OmniMotion by nearly 10 points. The gap is largest on AJ, which requires both accurate positions and correct occlusion prediction — exactly where joint tracking shines.

TAP-Vid-DAVIS (Strided Protocol)

Method	AJ ↑	δ^vis_avg ↑	OA ↑
TAP-Net	38.4	53.1	82.3
OmniMotion	51.7	67.5	85.3
TAPIR	61.3	73.6	88.8
CoTracker	65.9	79.4	89.9

Key Ablations

What matters most? Ablations on TAP-Vid-DAVIS reveal:

Configuration	AJ	Δ
Full CoTracker (joint + unrolled)	60.4	—
No joint tracking (independent tracks)	56.3	−4.1
No unrolled training (single window)	56.0	−4.4
No support points	57.8	−2.6

Both innovations matter: Joint tracking and unrolled training each contribute ~4 points of AJ. They are complementary — joint tracking exploits spatial correlations between tracks, while unrolled training teaches the model to maintain tracks across window boundaries. Together, they are more than the sum of their parts.

Results Comparison

Average Jaccard (AJ) on TAP-Vid-DAVIS (First protocol). Higher is better.

According to ablations, which two design choices contribute the most to CoTracker's performance?

Joint tracking (sharing information between tracks) and unrolled training (backpropagating through multiple windows) — each adds ~4 AJ points Multi-scale features and larger CNN backbone Support points and correlation radius

Chapter 8: Dense Tracking

So far, we've discussed tracking sparse sets of points — maybe a few hundred selected by the user. But CoTracker's architecture, especially with proxy tokens, scales to much larger point sets. This enables quasi-dense tracking.

Regular Grid Tracking

Instead of tracking user-selected points, lay down a dense regular grid across the first frame — say, one point every 4 pixels. On a 256×256 image, that's 64×64 = 4,096 points. With proxy tokens (K=4), CoTracker can track all of them jointly.

At inference, CoTracker has been demonstrated tracking up to 70,000 points simultaneously on a single GPU. This approaches the density of optical flow while maintaining the long-range tracking capability of point trackers.

Applications of Dense Tracking

Scene flow: Dense 2D tracks from multiple views can be lifted to 3D motion fields.
Video editing: Dense tracks provide a motion atlas for propagating edits across frames.
Dynamic 3D reconstruction: Dense correspondences across frames constrain the 3D structure of deformable objects.
Motion segmentation: Points with coherent motion naturally cluster into objects.

From sparse to dense: The proxy token mechanism is what makes dense tracking possible. Without proxies, tracking N=70K points would require O(N²) = ~5 billion attention operations. With K=4 proxies, it's O(NK) = ~280K operations — a reduction of over 10,000×.

Dense grid tracking also reveals motion structure that sparse tracking misses. On the DAVIS benchmark, visualizing all tracked points as a displacement field shows the flow of every surface — not just the selected ones. This is reminiscent of optical flow but extends across the full video.

What makes it possible for CoTracker to scale from sparse tracking (~100 points) to quasi-dense tracking (~70K points) on a single GPU?

Using a more efficient CNN backbone Reducing the number of transformer iterations from M=6 to M=2 Proxy tokens replace O(N²) track self-attention with O(NK) cross-attention, making the cost linear in the number of tracks

Chapter 9: Connections

CoTracker sits at the intersection of optical flow, point tracking, and transformer architectures. Let's map where it fits.

Relation to RAFT

CoTracker borrows heavily from RAFT: the iterative refinement paradigm, the correlation volumes computed at multiple scales, and the idea of making small corrections rather than large predictions. The key difference is that RAFT does dense, two-frame flow, while CoTracker does sparse, multi-frame tracking with cross-track attention.

Relation to TAPIR

TAPIR is CoTracker's closest competitor. It uses a two-stage approach: first, a global matching stage that finds coarse correspondences, then a PIPs-like refinement stage. TAPIR tracks points independently — no information sharing between tracks. CoTracker shows that adding cross-track attention on top of a similar architecture yields significant improvements, especially for occluded points.

Relation to CoTracker3 (2024)

The follow-up, CoTracker3, extends the ideas here in a key way: it is trained on real-world data using pseudo-labels from teacher models, rather than only on synthetic Kubric data. This dramatically improves generalization. The architecture remains similar, but the training recipe is what changes.

Relation to VGGSfM

VGGSfM (Visual Geometry Grounded Structure from Motion) uses CoTracker's dense point tracks as input to a differentiable SfM pipeline. This demonstrates a key application: dense, accurate point tracks across video frames provide the correspondences that 3D reconstruction algorithms need.

Cheat Sheet

Aspect	CoTracker
Input	Video frames + query point locations + start times
Output	2D tracks (x,y per frame) + visibility flags
Feature backbone	CNN (trained end-to-end), S=4 scales
Core mechanism	Factored time + track attention, iterated M times
Scaling trick	K proxy tokens → O(NK) instead of O(N²)
Long-video handling	Overlapping sliding windows (recurrent)
Training data	TAP-Vid-Kubric (synthetic, 24-frame sequences)
Training strategy	Unrolled optimization across windows
Key result	62.2 AJ on TAP-Vid-DAVIS (vs 56.2 TAPIR)
Scale	70K points jointly on single GPU

The broader lesson: When data has structure, exploit it. Points in a video are not statistically independent — they live on objects, share motion, and constrain each other. CoTracker's contribution is showing that a simple change — adding cross-track attention — unlocks this structure and yields large, measurable improvements.

What is the fundamental difference between CoTracker and TAPIR?

CoTracker uses a CNN backbone while TAPIR uses a ViT TAPIR tracks each point independently, while CoTracker tracks all points jointly using cross-track attention — allowing tracks to share information and improve each other CoTracker trains on real data while TAPIR trains on synthetic data

It is Better to Track Together