Karaev, Rocco, Graham, Neverova, Vedaldi, Rupprecht — Meta AI + Oxford VGG, 2023

It is Better to Track Together

Track thousands of 2D points jointly across long video sequences. Points move together — exploit their correlations to handle occlusions, track through disappearances, and scale to 70K points on a single GPU.

Prerequisites: Transformers (attention) + Optical flow basics + Correlation volumes
10
Chapters
4+
Simulations

Chapter 0: The Problem

You are watching a video of a car driving through a city. You want to track 100 points on the car's surface as it moves, turns, and occasionally disappears behind a lamppost. Each point has a 2D position in each frame, and you need to predict where it goes next.

Existing methods — TAP-Net, TAPIR, PIPs — track each point independently. They take one point, match it across frames, and produce a trajectory. Then they repeat for the next point. Each track knows nothing about the others.

This is deeply wasteful. If 50 of those points sit on the car's hood, they all move together when the car turns. If one point gets occluded behind a lamppost, the other 49 on the same surface tell you exactly where it went. Independent tracking throws away all of this information.

The failure mode: When a point is occluded, an independent tracker has no idea where it went. It either freezes, drifts randomly, or snaps to a nearby texture. But a joint tracker can look at neighboring points that are still visible and infer the occluded point's position from their motion. Points on the same rigid surface form a constellation — if you know where the constellation is, you know where every star in it should be.

The simulation below shows this failure. On the left, points are tracked independently — when one is occluded, it drifts away. On the right, points share information — the occluded point stays locked to its neighbors.

Independent vs Joint Tracking

Watch the points move across the surface. When the occluder (gray bar) passes over, independently tracked points drift while jointly tracked points recover. Press Play to start.

Why does independent point tracking fail when points are occluded?

Chapter 1: The Key Insight

CoTracker's core idea is deceptively simple: track points together, not apart.

If you track N points across T frames, you have an N×T grid of positions. Independent trackers fill in each row of this grid separately. CoTracker fills in the entire grid at once, using a transformer that attends across both the time dimension (how does this point move over time?) and the track dimension (how do different points relate to each other?).

Independent Tracking
Process each of N tracks separately. No information sharing. Complexity: N independent forward passes.
↓ vs ↓
Joint Tracking (CoTracker)
Process all N tracks simultaneously. Transformer attention lets tracks exchange information. Single forward pass for all tracks.

Why does joint tracking help? Three reasons:

The ablation proof: In CoTracker's experiments, switching from independent to joint tracking improves Average Jaccard by 4–6 points on TAP-Vid-DAVIS. Adding support points on top of that improves it by another 2–3 points. The correlations between tracks are a real, measurable signal.
What are the two dimensions along which CoTracker's transformer applies attention?

Chapter 2: Point Tracking Basics

Before diving into CoTracker's architecture, let's define the problem precisely. The TAP (Tracking Any Point) benchmark introduced a clean formulation:

What is a "track"?

A track is a sequence of 2D positions (xt, yt) for a single physical point across video frames t = ti, ..., T. The track starts at time ti when the point first appears (or is queried), and the tracker must predict its position in all subsequent frames.

Crucially, each track also has a visibility flag vt ∈ {0, 1}. The point is visible (v=1) when it can be seen in the image, and occluded (v=0) when something blocks it — or when it leaves the camera's field of view entirely. The tracker must predict both position and visibility.

TAP-Vid Metrics

Performance is measured by three metrics:

MetricWhat it measures
δvisavgFraction of visible points tracked within 1, 2, 4, 8, 16 pixels, averaged over these thresholds. Pure position accuracy.
OA (Occlusion Accuracy)Binary accuracy of visibility prediction. Can the tracker tell when a point is occluded?
AJ (Average Jaccard)Joint metric combining position accuracy and occlusion prediction. The hardest metric — you need both right.

Evaluation Protocols

TAP-Vid uses two protocols:

Key distinction from optical flow: Optical flow predicts dense motion (every pixel) between adjacent frames. Point tracking predicts sparse motion (selected points) across many frames. Flow is joint but short-range. Prior trackers are long-range but independent. CoTracker is both joint and long-range.
What does the Average Jaccard (AJ) metric measure that δvisavg alone does not?

Chapter 3: The Sliding Window

Videos can be thousands of frames long. Processing all frames at once with a transformer is impossible — the memory cost of attention scales quadratically with sequence length. CoTracker's solution: process the video in overlapping sliding windows.

How It Works

Say the video has T' frames and the window size is T = 8. CoTracker splits the video into J = ⌈2T'/T − 1⌉ windows, each of length T, with an overlap of T/2 frames.

Window 1
Frames 1–8. Initialize tracks with query positions. Apply transformer M times to refine.
↓ overlap frames 5–8
Window 2
Frames 5–12. Initialize with refined predictions from Window 1 for frames 5–8. Broadcast last position for frames 9–12. Refine M times.
↓ overlap frames 9–12
Window 3
Frames 9–16. Same pattern. Previous window's output becomes next window's initialization.

This is essentially a recurrent network. Each window's output initializes the next window's input. Information propagates forward through the video one window at a time, allowing tracks to persist for arbitrarily long sequences.

The Overlap is Critical

The T/2 frame overlap ensures continuity. Without overlap, the tracks would "reset" at each window boundary — there would be no way to transfer information from window j to window j+1. The overlap provides the bridge: the refined positions from the end of window j become the initialization for the start of window j+1.

Tracking through long occlusions: If a point is occluded for 20 frames but visible before and after, independent single-window tracking would lose it. With sliding windows and joint tracking, neighboring visible points carry the occluded point's position forward through multiple windows until it reappears. The recurrent nature of the windows acts as memory.

Track Features

A subtle detail: while position estimates P carry forward between windows, the track feature vectors Q are re-initialized from the original query features at each window. This prevents feature drift — the appearance template stays anchored to what the point originally looked like.

Why do consecutive sliding windows overlap by T/2 frames instead of having no overlap?

Chapter 4: The Joint Tracking Architecture

This is the core of CoTracker. The architecture is a transformer that operates on a 2D grid of tokens — one dimension is time, the other is tracks — and iteratively refines track estimates.

Step 1: Image Features

A CNN extracts dense feature maps φ(It) from each video frame It. These are computed once and shared across all tracks. The features are computed at multiple scales (S = 4 scales) for matching at different resolutions.

Step 2: Token Construction

For each track i and time t, CoTracker constructs a token Git by concatenating:

Step 3: Factored Attention

Naively attending over all N×T tokens is O(N2T2) — far too expensive. CoTracker factorizes the attention into two alternating operations:

Attention TypeWhat it doesCost
Time attentionEach track attends to itself across all T frames. "How has this point moved over time?"O(T2) per track
Track attentionAt each time step, all N tracks attend to each other. "How do different points relate to each other right now?"O(N2) per frame

These two operations interleave: time attention, then track attention, then time attention, and so on. The total cost is O(N2 + T2) instead of O(N2T2).

Step 4: Iterative Refinement

The transformer is applied M times. Each application produces small updates ΔP̂ (position correction) and ΔQ (feature update). The estimates are refined additively:

(m+1) = P̂(m) + ΔP̂,   Q(m+1) = Q(m) + ΔQ

Visibility is predicted only once, after the final iteration, as v̂ = σ(W Q(M)). The intuition: you need an accurate position estimate before you can reliably decide if a point is visible.

Why iterative updates? This is the same idea as RAFT. A single forward pass must make a large prediction — "the point moved 50 pixels right." Iterative refinement makes many small corrections — "the point moved 48... no, 49.5... no, 50.2 pixels." Each iteration re-computes the correlation features C around the current estimate, so the network can "zoom in" on the right answer.
Joint Tracking Architecture

Animated demonstration: points on a surface are tracked jointly. Toggle between independent and joint mode. When a point is occluded, joint tracking uses neighbor information to maintain it. Drag the slider to scrub through frames.

Frame 0
Why does CoTracker factorize attention into separate time and track operations instead of full N×T self-attention?

Chapter 5: Token Proxies

Factorized attention reduces the cost from O(N2T2) to O(N2 + T2). But there's still that N2 term. If you want to track N = 70,000 points (quasi-dense tracking), even O(N2) is ~5 billion operations per attention layer. That doesn't fit on a GPU.

The solution: proxy tokens. Instead of every track attending to every other track (O(N2)), introduce K proxy tokens where K ≪ N, and have tracks attend to proxies instead of each other.

How Proxies Work

Proxy tokens are K learned, fixed tokens (like "virtual tracks") that are concatenated to the list of real tracks at the transformer input.

Cost = O(NK + K2 + T2)

Since K is fixed (and small), this is linear in N. You can now scale to 70K points on a single GPU.

What do proxies learn? Think of proxies as "information hubs." Each proxy learns to represent a cluster of related tracks — maybe all points on the same object, or all points with similar motion. Tracks deposit their information into proxies, proxies aggregate and redistribute. It's like a message-passing network with a bottleneck layer. Similar in spirit to register tokens in Vision Transformers (Darcet et al., 2023).
Token Proxy Visualization

Left: Full track attention (every track attends to every other — O(N²)). Right: Proxy attention (tracks attend to K proxies — O(NK)). Adjust N and K to see how the connection count changes.

N (tracks) 12
K (proxies) 4
How do proxy tokens reduce the memory complexity of track attention from O(N²) to O(NK)?

Chapter 6: Training

CoTracker is trained on TAP-Vid-Kubric, a synthetic dataset generated by the Kubric engine. It consists of sequences showing 3D rigid objects falling under gravity and bouncing, with ground-truth point tracks.

Why Synthetic Data?

Real-world point tracking annotations are extremely expensive. You need to label the same physical point across hundreds of frames, even through occlusions. Synthetic data gives you:

Unrolled Training

Because CoTracker processes videos with sliding windows (like a recurrent network), it is trained with unrolled optimization — backpropagating through multiple windows to teach the model to maintain tracks across window boundaries.

The training loss sums over all transformer iterations m and all windows j:

L1(P̂, P) = ∑j=1Jm=1M γM−m ‖ P̂(m,j) − P(j)

The discount factor γ = 0.8 weights later iterations more heavily — early refinement steps are exploratory, late ones should be accurate. A second loss L2 is cross-entropy on the visibility predictions.

Training Details

ParameterValue
Training setTAP-Vid-Kubric (6,000 sequences)
Sequence lengthT' = 24 frames
Window sizeT = 8 frames
Tracks per batchN = 768
Training iterations50,000
Hardware32 NVIDIA A100 80GB GPUs
Training time~40 hours
Feature scalesS = 4
Correlation radiusΔ = 3
Generalization from synthetic to real: Despite being trained only on synthetic Kubric data, CoTracker generalizes well to real-world videos (TAP-Vid-DAVIS). This is because the core computation — matching features and reasoning about motion correlations — transfers across domains. The model learns physics-agnostic motion reasoning, not Kubric-specific patterns.
Why does the training loss use a discount factor γ = 0.8 that weights early transformer iterations less than later ones?

Chapter 7: Results

CoTracker was evaluated against TAP-Net, PIPs, PIPs++, MFT, OmniMotion, and TAPIR across multiple benchmarks. The results are striking.

TAP-Vid-DAVIS (Real Videos, First Protocol)

MethodAJ ↑δvisavgOA ↑
TAP-Net33.048.678.8
PIPs42.264.877.7
OmniMotion52.866.987.1
TAPIR56.270.086.5
CoTracker62.275.789.3

CoTracker achieves 62.2 AJ, beating TAPIR by 6 points and OmniMotion by nearly 10 points. The gap is largest on AJ, which requires both accurate positions and correct occlusion prediction — exactly where joint tracking shines.

TAP-Vid-DAVIS (Strided Protocol)

MethodAJ ↑δvisavgOA ↑
TAP-Net38.453.182.3
OmniMotion51.767.585.3
TAPIR61.373.688.8
CoTracker65.979.489.9

Key Ablations

What matters most? Ablations on TAP-Vid-DAVIS reveal:

ConfigurationAJΔ
Full CoTracker (joint + unrolled)60.4
No joint tracking (independent tracks)56.3−4.1
No unrolled training (single window)56.0−4.4
No support points57.8−2.6
Both innovations matter: Joint tracking and unrolled training each contribute ~4 points of AJ. They are complementary — joint tracking exploits spatial correlations between tracks, while unrolled training teaches the model to maintain tracks across window boundaries. Together, they are more than the sum of their parts.
Results Comparison

Average Jaccard (AJ) on TAP-Vid-DAVIS (First protocol). Higher is better.

According to ablations, which two design choices contribute the most to CoTracker's performance?

Chapter 8: Dense Tracking

So far, we've discussed tracking sparse sets of points — maybe a few hundred selected by the user. But CoTracker's architecture, especially with proxy tokens, scales to much larger point sets. This enables quasi-dense tracking.

Regular Grid Tracking

Instead of tracking user-selected points, lay down a dense regular grid across the first frame — say, one point every 4 pixels. On a 256×256 image, that's 64×64 = 4,096 points. With proxy tokens (K=4), CoTracker can track all of them jointly.

At inference, CoTracker has been demonstrated tracking up to 70,000 points simultaneously on a single GPU. This approaches the density of optical flow while maintaining the long-range tracking capability of point trackers.

Applications of Dense Tracking

From sparse to dense: The proxy token mechanism is what makes dense tracking possible. Without proxies, tracking N=70K points would require O(N2) = ~5 billion attention operations. With K=4 proxies, it's O(NK) = ~280K operations — a reduction of over 10,000×.

Dense grid tracking also reveals motion structure that sparse tracking misses. On the DAVIS benchmark, visualizing all tracked points as a displacement field shows the flow of every surface — not just the selected ones. This is reminiscent of optical flow but extends across the full video.

What makes it possible for CoTracker to scale from sparse tracking (~100 points) to quasi-dense tracking (~70K points) on a single GPU?

Chapter 9: Connections

CoTracker sits at the intersection of optical flow, point tracking, and transformer architectures. Let's map where it fits.

Relation to RAFT

CoTracker borrows heavily from RAFT: the iterative refinement paradigm, the correlation volumes computed at multiple scales, and the idea of making small corrections rather than large predictions. The key difference is that RAFT does dense, two-frame flow, while CoTracker does sparse, multi-frame tracking with cross-track attention.

Relation to TAPIR

TAPIR is CoTracker's closest competitor. It uses a two-stage approach: first, a global matching stage that finds coarse correspondences, then a PIPs-like refinement stage. TAPIR tracks points independently — no information sharing between tracks. CoTracker shows that adding cross-track attention on top of a similar architecture yields significant improvements, especially for occluded points.

Relation to CoTracker3 (2024)

The follow-up, CoTracker3, extends the ideas here in a key way: it is trained on real-world data using pseudo-labels from teacher models, rather than only on synthetic Kubric data. This dramatically improves generalization. The architecture remains similar, but the training recipe is what changes.

Relation to VGGSfM

VGGSfM (Visual Geometry Grounded Structure from Motion) uses CoTracker's dense point tracks as input to a differentiable SfM pipeline. This demonstrates a key application: dense, accurate point tracks across video frames provide the correspondences that 3D reconstruction algorithms need.

Cheat Sheet

AspectCoTracker
InputVideo frames + query point locations + start times
Output2D tracks (x,y per frame) + visibility flags
Feature backboneCNN (trained end-to-end), S=4 scales
Core mechanismFactored time + track attention, iterated M times
Scaling trickK proxy tokens → O(NK) instead of O(N²)
Long-video handlingOverlapping sliding windows (recurrent)
Training dataTAP-Vid-Kubric (synthetic, 24-frame sequences)
Training strategyUnrolled optimization across windows
Key result62.2 AJ on TAP-Vid-DAVIS (vs 56.2 TAPIR)
Scale70K points jointly on single GPU
The broader lesson: When data has structure, exploit it. Points in a video are not statistically independent — they live on objects, share motion, and constrain each other. CoTracker's contribution is showing that a simple change — adding cross-track attention — unlocks this structure and yields large, measurable improvements.
What is the fundamental difference between CoTracker and TAPIR?