The sim-to-real gap cripples point trackers trained on synthetic data. CoTracker3 closes it: use imperfect trackers to pseudo-label real videos, filter with cycle consistency, and train a student that outperforms every teacher — using 1,000× less data than BootsTAPIR.
You want to track a point — say, a fingertip, a ball, a corner of a box — across every frame of a video. Not bounding-box tracking, not optical flow between adjacent frames. Point tracking: given a single pixel in one frame, find exactly where it lands in every subsequent frame, even through occlusions and fast motion.
Modern point trackers (TAPIR, CoTracker, LocoTrack) are powerful, but they share a dirty secret: they are almost entirely trained on synthetic data. Specifically, on Kubric — a dataset of rendered 3D objects bouncing around in simulated scenes with perfect ground-truth tracks.
Why synthetic? Because annotating real videos with pixel-accurate point tracks is essentially impossible at scale. You would need a human to mark the exact pixel location of a point in every single frame. For a 250-frame video tracking 100 points, that is 25,000 annotations — per video.
The simulation below shows this concretely. On the left, a synthetic scene with clean geometric shapes: the tracker nails every point. On the right, a noisy real-world-like scene: the same tracker drifts and loses points.
Watch points being tracked across frames. Left: clean synthetic scene (Kubric-like). Right: noisy real-world scene. The tracker trained on synthetic data struggles with real-world noise, texture, and occlusion.
BootsTAPIR tried to fix this by training on 15 million unlabelled real videos with a complex self-training protocol involving augmentations, EMA updates, and loss masks. It helped, but the recipe was expensive and over-engineered.
CoTracker3 asks: can we do this much more simply?
CoTracker3's core idea is beautifully simple: use imperfect trackers to label real videos, then train a student on those noisy labels.
Wait — if the teacher trackers are imperfect, won't the labels be wrong? Won't the student learn the teachers' mistakes? Yes, some labels will be wrong. But several forces conspire to make the student better than any individual teacher:
The teachers are CoTracker3 online, CoTracker3 offline, CoTracker, and TAPIR — all trained only on Kubric. During training, a random teacher is sampled for each batch. Over multiple epochs, the same video gets pseudo-labeled by different teachers, providing diverse training signal.
How do you know if a pseudo-label is good or bad? You don't have ground truth — that's the whole point. But you have a powerful geometric check: cycle consistency.
The idea is simple. Track a point forward through the video, then track it backward from where it ended up. If you land back near where you started, the track was probably correct. If you end up far away, something went wrong.
Given a query point Q = (tq, xq, yq) at frame tq:
Try it below. Click to place a query point, then watch as it tracks forward and backward. Drag the threshold slider to see how strict filtering affects which tracks survive.
A point is tracked forward N frames (teal path), then backward from its endpoint (warm path). The gap between start and return determines if the track is accepted. Adjust the threshold to filter.
In practice, cycle consistency is applied per-point, per-frame. A track that is cycle-consistent for most of its duration but drifts at frame 50 can still contribute clean labels for frames 1–49. The filter is granular, not all-or-nothing.
Now let's see the full pipeline in action. This is the heart of CoTracker3's training recipe — and it is remarkably simple compared to BootsTAPIR's complex protocol.
Gather a dataset of unlabelled real videos. CoTracker3 uses around 100,000 internet-like videos (30 seconds each) featuring diverse scenes with humans, animals, and dynamic objects. No annotations needed.
For each video, run SIFT on randomly selected frames to find good-to-track keypoints. SIFT detects distinctive corners and edges — points that are likely to produce reliable tracks. If SIFT can't find enough points on a frame, skip that video entirely.
Randomly sample one of four teacher models (CoTracker3 online, CoTracker3 offline, CoTracker, TAPIR) and run it on the video. The teacher produces tracks for all query points. Over multiple epochs, different teachers label the same video, providing diverse pseudo-labels.
Run the forward-backward test on each pseudo-track. Discard tracks with large cycle gaps. This removes the worst teacher mistakes and keeps the reliable tracks.
Train the student CoTracker3 on the surviving pseudo-labels using the same Huber loss as for synthetic data. Freeze the confidence and visibility heads (since pseudo-labels don't have reliable visibility annotations) and add a separate head for tracks only.
Watch the full pipeline animate: teacher trackers produce pseudo-labels on real videos, cycle consistency filters bad tracks, and the surviving labels train a student that surpasses every teacher.
CoTracker3 isn't just a new training recipe — it's also a leaner architecture. The model strips away components that previous trackers treated as essential, and replaces complex modules with simpler alternatives.
For each frame, extract convolutional features at 4 scales. Compute 4D correlations between query features and track-estimate features. Feed correlations, displacement embeddings, confidence, and visibility into a transformer with factorized time attention and cross-track group attention. The transformer outputs incremental updates. Repeat M times, resampling features at each iteration.
| Component | CoTracker | LocoTrack | CoTracker3 |
|---|---|---|---|
| Global matching | No | Yes | No |
| Correlation processing | N/A | Ad-hoc module | Simple MLP |
| Cross-track attention | Yes | No | Yes |
| Visibility prediction | Separate net | Separate net | Joint iterative |
| Parameters | 45M | 25M | 25M |
| Speed (μs/frame/point) | 472 | 290 | 209 |
CoTracker3's training happens in two stages. The first uses synthetic data with perfect labels. The second adds real data with pseudo-labels. Let's walk through both.
The model starts by training on Kubric, the same synthetic dataset used by all modern point trackers. Kubric provides perfect ground-truth tracks, perfect visibility labels, and perfect confidence targets. This gives the model a solid foundation in the mechanics of tracking: understanding motion, correlation, and iterative refinement.
The pre-trained model is then fine-tuned on real videos with pseudo-labels from the teacher pipeline. But there's a subtle trick: the confidence and visibility heads are frozen during this stage.
This split-head strategy avoids catastrophic forgetting of visibility and confidence predictions while still improving track accuracy on real data. The ablation confirms: freezing the head improves AJ by +0.8 and OA by +3.9 on TAP-Vid.
Both variants share the same architecture but differ in how they see videos during training:
Let's pin down the specifics of CoTracker3's training recipe.
| Source | Type | Size | Purpose |
|---|---|---|---|
| Kubric | Synthetic | Standard | Stage 1 pre-training with GT labels |
| Internet videos | Real, unlabelled | ~100k videos, 30s each | Stage 2 pseudo-label fine-tuning |
The real videos feature diverse scenes — primarily humans and animals, the kinds of dynamic content where tracking matters most.
CoTracker3 uses SIFT keypoints to select which points to track. The intuition: SIFT detects corners and edges that are distinctive and "good to track" — exactly the kind of points where pseudo-labels are most reliable. If SIFT fails to find enough keypoints on any selected frame, the entire video is skipped, maintaining training data quality.
The ablation shows SIFT is slightly better than SuperPoint, DISK, or uniform random sampling, though all perform comparably. The robustness to this choice is reassuring.
Three losses, all applied at each of M iterative updates with exponentially increasing weights (γ = 0.8):
The Huber loss (threshold = 6) on track positions, with occluded points weighted at 1/5 of visible points. Later iterations matter more (γ decay).
The confidence loss: BCE between predicted confidence and an indicator of whether the prediction is within 12 pixels of ground truth.
The visibility loss: BCE between predicted and ground-truth visibility.
All four teachers are trained only on Kubric synthetic data. They are frozen during student training — no EMA, no joint optimization:
The ablation in Table 5 shows that removing any teacher hurts. Even weaker teachers like CoTracker contribute complementary knowledge that the student can extract.
CoTracker3 is evaluated on the TAP-Vid benchmark suite (Kinetics, DAVIS, RGB-Stacking) and Dynamic Replica. The results are striking.
| Method | Training Data | AJ ↑ | δvisavg ↑ | OA ↑ |
|---|---|---|---|---|
| TAPIR | Kubric | 56.2 | 70.0 | 86.5 |
| CoTracker | Kubric | 61.8 | 76.1 | 88.3 |
| LocoTrack | Kubric | 62.9 | 75.3 | 87.2 |
| BootsTAPIR | Kubric + 15M real | 61.4 | 73.6 | 88.7 |
| CoTracker3 online | Kubric + 15k real | 63.8 | 76.3 | 90.2 |
| CoTracker3 offline | Kubric + 15k real | 64.4 | 76.9 | 91.2 |
This is where cross-track attention really shines. Dynamic Replica provides ground truth for occluded points:
| Method | δvisavg ↑ | δoccavg ↑ |
|---|---|---|
| LocoTrack | 71.4 | 29.8 |
| BootsTAPIR | 69.0 | 28.0 |
| CoTracker3 online | 72.9 | 41.0 |
| CoTracker3 offline | 69.8 | 41.8 |
CoTracker3 tracks occluded points with δoccavg of 41.8 — a massive +12 points over BootsTAPIR. The cross-track attention lets the model infer occluded positions from the motion of surrounding visible points.
Average Jaccard (AJ) on TAP-Vid benchmarks (mean of Kinetics, DAVIS, RGB-Stacking). Higher is better. Data amount shown in parentheses.
Perhaps the most remarkable result in CoTracker3 is how little data it needs. The scaling curve tells a clear story:
| Real Videos Used | CoTracker3 Online AJ | vs BootsTAPIR (15M videos) |
|---|---|---|
| 0 (Kubric only) | 62.2 | −0.6 below BootsTAPIR |
| 100 | 63.0 | +0.2 above BootsTAPIR |
| 1,000 | 63.5 | +0.7 above |
| 5,000 | 63.8 | +1.0 above |
| 15,000 | 64.0 | +1.2 above |
| 100,000 | 64.0 | +1.2 above (plateau) |
The student eventually becomes better than every teacher on every type of scene in the dataset. At that point, the pseudo-labels are more noisy than helpful — the student is being taught by inferiors. This is a known phenomenon in knowledge distillation: the student can outgrow the teacher.
The fix? Use the improved student as a new teacher and re-generate pseudo-labels. The paper shows that self-training (using CoTracker3's own predictions as annotations) provides an additional +1.2 AJ improvement. This bootstrapping could, in principle, be iterated.
Yes. The pseudo-labeling pipeline is architecture-agnostic. When applied to LocoTrack and the original CoTracker, both improve significantly:
CoTracker3 sits at the intersection of several important research threads. Let's map the landscape.
CoTracker (Karaev et al., 2024) introduced joint point tracking with cross-track attention and virtual tracks. CoTracker3 inherits cross-track attention (crucial for occlusion handling) but strips away the sliding window architecture, removes the separate visibility network, and halves the parameter count. Think of CoTracker3 as CoTracker distilled to its essential components.
VGGSfM (Wang et al., 2024) uses point tracking as a component for Structure-from-Motion, validating tracks through 3D reconstruction. CoTracker3 can serve as the tracking backbone for VGGSfM and similar 3D reconstruction pipelines, providing more robust tracks on real-world video.
CoTracker3's pseudo-labeling pipeline is a form of semi-supervised learning closely related to the Noisy Student framework in image classification. The pattern is the same: train a teacher on labeled data, generate pseudo-labels on unlabeled data, train a student on the combined dataset. The student exceeds the teacher because it learns from more diverse (real) data.
The same principle drives progress in NLP. Large language models generate synthetic training data for smaller models (distillation), and self-play / self-improvement loops use a model's own outputs as training targets. The key enabler in all cases is a cheap quality filter — cycle consistency for tracking, perplexity/consistency checks for text.
BootsTAPIR (Doersch et al., 2024) pioneered using real videos for point tracker training, but with a heavy protocol. CoTracker3 shows that the complex machinery (EMA, augmentations, multiple loss masks) is unnecessary when you have (a) multiple diverse teachers, (b) cycle consistency filtering, and (c) a cleaner architecture.
| Aspect | CoTracker3 |
|---|---|
| Input | Video + query point (frame, x, y) |
| Output | Per-frame (x, y) track + visibility + confidence |
| Backbone | Convolutional feature extractor (4 scales) |
| Core mechanism | 4D correlation + transformer iterative refinement |
| Training data | Kubric (synthetic) + 15k real videos (pseudo-labeled) |
| Pseudo-label filter | Cycle consistency (forward-backward test) |
| Parameters | 25M (2× smaller than CoTracker) |
| Key result | SOTA on TAP-Vid with 1000× less data than BootsTAPIR |
| Variants | Online (streaming) + Offline (full video) |