Track thousands of 2D points jointly across long video sequences. Points move together — exploit their correlations to handle occlusions, track through disappearances, and scale to 70K points on a single GPU.
You are watching a video of a car driving through a city. You want to track 100 points on the car's surface as it moves, turns, and occasionally disappears behind a lamppost. Each point has a 2D position in each frame, and you need to predict where it goes next.
Existing methods — TAP-Net, TAPIR, PIPs — track each point independently. They take one point, match it across frames, and produce a trajectory. Then they repeat for the next point. Each track knows nothing about the others.
This is deeply wasteful. If 50 of those points sit on the car's hood, they all move together when the car turns. If one point gets occluded behind a lamppost, the other 49 on the same surface tell you exactly where it went. Independent tracking throws away all of this information.
The simulation below shows this failure. On the left, points are tracked independently — when one is occluded, it drifts away. On the right, points share information — the occluded point stays locked to its neighbors.
Watch the points move across the surface. When the occluder (gray bar) passes over, independently tracked points drift while jointly tracked points recover. Press Play to start.
CoTracker's core idea is deceptively simple: track points together, not apart.
If you track N points across T frames, you have an N×T grid of positions. Independent trackers fill in each row of this grid separately. CoTracker fills in the entire grid at once, using a transformer that attends across both the time dimension (how does this point move over time?) and the track dimension (how do different points relate to each other?).
Why does joint tracking help? Three reasons:
Before diving into CoTracker's architecture, let's define the problem precisely. The TAP (Tracking Any Point) benchmark introduced a clean formulation:
A track is a sequence of 2D positions (xt, yt) for a single physical point across video frames t = ti, ..., T. The track starts at time ti when the point first appears (or is queried), and the tracker must predict its position in all subsequent frames.
Crucially, each track also has a visibility flag vt ∈ {0, 1}. The point is visible (v=1) when it can be seen in the image, and occluded (v=0) when something blocks it — or when it leaves the camera's field of view entirely. The tracker must predict both position and visibility.
Performance is measured by three metrics:
| Metric | What it measures |
|---|---|
| δvisavg | Fraction of visible points tracked within 1, 2, 4, 8, 16 pixels, averaged over these thresholds. Pure position accuracy. |
| OA (Occlusion Accuracy) | Binary accuracy of visibility prediction. Can the tracker tell when a point is occluded? |
| AJ (Average Jaccard) | Joint metric combining position accuracy and occlusion prediction. The hardest metric — you need both right. |
TAP-Vid uses two protocols:
Videos can be thousands of frames long. Processing all frames at once with a transformer is impossible — the memory cost of attention scales quadratically with sequence length. CoTracker's solution: process the video in overlapping sliding windows.
Say the video has T' frames and the window size is T = 8. CoTracker splits the video into J = ⌈2T'/T − 1⌉ windows, each of length T, with an overlap of T/2 frames.
This is essentially a recurrent network. Each window's output initializes the next window's input. Information propagates forward through the video one window at a time, allowing tracks to persist for arbitrarily long sequences.
The T/2 frame overlap ensures continuity. Without overlap, the tracks would "reset" at each window boundary — there would be no way to transfer information from window j to window j+1. The overlap provides the bridge: the refined positions from the end of window j become the initialization for the start of window j+1.
A subtle detail: while position estimates P carry forward between windows, the track feature vectors Q are re-initialized from the original query features at each window. This prevents feature drift — the appearance template stays anchored to what the point originally looked like.
This is the core of CoTracker. The architecture is a transformer that operates on a 2D grid of tokens — one dimension is time, the other is tracks — and iteratively refines track estimates.
A CNN extracts dense feature maps φ(It) from each video frame It. These are computed once and shared across all tracks. The features are computed at multiple scales (S = 4 scales) for matching at different resolutions.
For each track i and time t, CoTracker constructs a token Git by concatenating:
Naively attending over all N×T tokens is O(N2T2) — far too expensive. CoTracker factorizes the attention into two alternating operations:
| Attention Type | What it does | Cost |
|---|---|---|
| Time attention | Each track attends to itself across all T frames. "How has this point moved over time?" | O(T2) per track |
| Track attention | At each time step, all N tracks attend to each other. "How do different points relate to each other right now?" | O(N2) per frame |
These two operations interleave: time attention, then track attention, then time attention, and so on. The total cost is O(N2 + T2) instead of O(N2T2).
The transformer is applied M times. Each application produces small updates ΔP̂ (position correction) and ΔQ (feature update). The estimates are refined additively:
Visibility is predicted only once, after the final iteration, as v̂ = σ(W Q(M)). The intuition: you need an accurate position estimate before you can reliably decide if a point is visible.
Animated demonstration: points on a surface are tracked jointly. Toggle between independent and joint mode. When a point is occluded, joint tracking uses neighbor information to maintain it. Drag the slider to scrub through frames.
Factorized attention reduces the cost from O(N2T2) to O(N2 + T2). But there's still that N2 term. If you want to track N = 70,000 points (quasi-dense tracking), even O(N2) is ~5 billion operations per attention layer. That doesn't fit on a GPU.
The solution: proxy tokens. Instead of every track attending to every other track (O(N2)), introduce K proxy tokens where K ≪ N, and have tracks attend to proxies instead of each other.
Proxy tokens are K learned, fixed tokens (like "virtual tracks") that are concatenated to the list of real tracks at the transformer input.
Since K is fixed (and small), this is linear in N. You can now scale to 70K points on a single GPU.
Left: Full track attention (every track attends to every other — O(N²)). Right: Proxy attention (tracks attend to K proxies — O(NK)). Adjust N and K to see how the connection count changes.
CoTracker is trained on TAP-Vid-Kubric, a synthetic dataset generated by the Kubric engine. It consists of sequences showing 3D rigid objects falling under gravity and bouncing, with ground-truth point tracks.
Real-world point tracking annotations are extremely expensive. You need to label the same physical point across hundreds of frames, even through occlusions. Synthetic data gives you:
Because CoTracker processes videos with sliding windows (like a recurrent network), it is trained with unrolled optimization — backpropagating through multiple windows to teach the model to maintain tracks across window boundaries.
The training loss sums over all transformer iterations m and all windows j:
The discount factor γ = 0.8 weights later iterations more heavily — early refinement steps are exploratory, late ones should be accurate. A second loss L2 is cross-entropy on the visibility predictions.
| Parameter | Value |
|---|---|
| Training set | TAP-Vid-Kubric (6,000 sequences) |
| Sequence length | T' = 24 frames |
| Window size | T = 8 frames |
| Tracks per batch | N = 768 |
| Training iterations | 50,000 |
| Hardware | 32 NVIDIA A100 80GB GPUs |
| Training time | ~40 hours |
| Feature scales | S = 4 |
| Correlation radius | Δ = 3 |
CoTracker was evaluated against TAP-Net, PIPs, PIPs++, MFT, OmniMotion, and TAPIR across multiple benchmarks. The results are striking.
| Method | AJ ↑ | δvisavg ↑ | OA ↑ |
|---|---|---|---|
| TAP-Net | 33.0 | 48.6 | 78.8 |
| PIPs | 42.2 | 64.8 | 77.7 |
| OmniMotion | 52.8 | 66.9 | 87.1 |
| TAPIR | 56.2 | 70.0 | 86.5 |
| CoTracker | 62.2 | 75.7 | 89.3 |
CoTracker achieves 62.2 AJ, beating TAPIR by 6 points and OmniMotion by nearly 10 points. The gap is largest on AJ, which requires both accurate positions and correct occlusion prediction — exactly where joint tracking shines.
| Method | AJ ↑ | δvisavg ↑ | OA ↑ |
|---|---|---|---|
| TAP-Net | 38.4 | 53.1 | 82.3 |
| OmniMotion | 51.7 | 67.5 | 85.3 |
| TAPIR | 61.3 | 73.6 | 88.8 |
| CoTracker | 65.9 | 79.4 | 89.9 |
What matters most? Ablations on TAP-Vid-DAVIS reveal:
| Configuration | AJ | Δ |
|---|---|---|
| Full CoTracker (joint + unrolled) | 60.4 | — |
| No joint tracking (independent tracks) | 56.3 | −4.1 |
| No unrolled training (single window) | 56.0 | −4.4 |
| No support points | 57.8 | −2.6 |
Average Jaccard (AJ) on TAP-Vid-DAVIS (First protocol). Higher is better.
So far, we've discussed tracking sparse sets of points — maybe a few hundred selected by the user. But CoTracker's architecture, especially with proxy tokens, scales to much larger point sets. This enables quasi-dense tracking.
Instead of tracking user-selected points, lay down a dense regular grid across the first frame — say, one point every 4 pixels. On a 256×256 image, that's 64×64 = 4,096 points. With proxy tokens (K=4), CoTracker can track all of them jointly.
At inference, CoTracker has been demonstrated tracking up to 70,000 points simultaneously on a single GPU. This approaches the density of optical flow while maintaining the long-range tracking capability of point trackers.
Dense grid tracking also reveals motion structure that sparse tracking misses. On the DAVIS benchmark, visualizing all tracked points as a displacement field shows the flow of every surface — not just the selected ones. This is reminiscent of optical flow but extends across the full video.
CoTracker sits at the intersection of optical flow, point tracking, and transformer architectures. Let's map where it fits.
CoTracker borrows heavily from RAFT: the iterative refinement paradigm, the correlation volumes computed at multiple scales, and the idea of making small corrections rather than large predictions. The key difference is that RAFT does dense, two-frame flow, while CoTracker does sparse, multi-frame tracking with cross-track attention.
TAPIR is CoTracker's closest competitor. It uses a two-stage approach: first, a global matching stage that finds coarse correspondences, then a PIPs-like refinement stage. TAPIR tracks points independently — no information sharing between tracks. CoTracker shows that adding cross-track attention on top of a similar architecture yields significant improvements, especially for occluded points.
The follow-up, CoTracker3, extends the ideas here in a key way: it is trained on real-world data using pseudo-labels from teacher models, rather than only on synthetic Kubric data. This dramatically improves generalization. The architecture remains similar, but the training recipe is what changes.
VGGSfM (Visual Geometry Grounded Structure from Motion) uses CoTracker's dense point tracks as input to a differentiable SfM pipeline. This demonstrates a key application: dense, accurate point tracks across video frames provide the correspondences that 3D reconstruction algorithms need.
| Aspect | CoTracker |
|---|---|
| Input | Video frames + query point locations + start times |
| Output | 2D tracks (x,y per frame) + visibility flags |
| Feature backbone | CNN (trained end-to-end), S=4 scales |
| Core mechanism | Factored time + track attention, iterated M times |
| Scaling trick | K proxy tokens → O(NK) instead of O(N²) |
| Long-video handling | Overlapping sliding windows (recurrent) |
| Training data | TAP-Vid-Kubric (synthetic, 24-frame sequences) |
| Training strategy | Unrolled optimization across windows |
| Key result | 62.2 AJ on TAP-Vid-DAVIS (vs 56.2 TAPIR) |
| Scale | 70K points jointly on single GPU |