Karaev, Makarov, Wang, Neverova, Vedaldi, Rupprecht — Meta AI + Oxford VGG, 2024

Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

The sim-to-real gap cripples point trackers trained on synthetic data. CoTracker3 closes it: use imperfect trackers to pseudo-label real videos, filter with cycle consistency, and train a student that outperforms every teacher — using 1,000× less data than BootsTAPIR.

Prerequisites: Point tracking basics + Transformers (attention) + Self-supervised / semi-supervised learning
10
Chapters
4
Simulations

Chapter 0: The Problem

You want to track a point — say, a fingertip, a ball, a corner of a box — across every frame of a video. Not bounding-box tracking, not optical flow between adjacent frames. Point tracking: given a single pixel in one frame, find exactly where it lands in every subsequent frame, even through occlusions and fast motion.

Modern point trackers (TAPIR, CoTracker, LocoTrack) are powerful, but they share a dirty secret: they are almost entirely trained on synthetic data. Specifically, on Kubric — a dataset of rendered 3D objects bouncing around in simulated scenes with perfect ground-truth tracks.

Why synthetic? Because annotating real videos with pixel-accurate point tracks is essentially impossible at scale. You would need a human to mark the exact pixel location of a point in every single frame. For a 250-frame video tracking 100 points, that is 25,000 annotations — per video.

The sim-to-real gap: Kubric videos have clean textures, simple lighting, rigid objects, and no motion blur. Real videos have reflections, deformable objects, chaotic backgrounds, and camera shake. A tracker trained only on Kubric learns a biased prior — it works well on synthetic benchmarks but degrades on real-world footage.

The simulation below shows this concretely. On the left, a synthetic scene with clean geometric shapes: the tracker nails every point. On the right, a noisy real-world-like scene: the same tracker drifts and loses points.

Synthetic vs Real Tracking

Watch points being tracked across frames. Left: clean synthetic scene (Kubric-like). Right: noisy real-world scene. The tracker trained on synthetic data struggles with real-world noise, texture, and occlusion.

Frame 0

BootsTAPIR tried to fix this by training on 15 million unlabelled real videos with a complex self-training protocol involving augmentations, EMA updates, and loss masks. It helped, but the recipe was expensive and over-engineered.

CoTracker3 asks: can we do this much more simply?

Why are modern point trackers primarily trained on synthetic data like Kubric?

Chapter 1: The Key Insight

CoTracker3's core idea is beautifully simple: use imperfect trackers to label real videos, then train a student on those noisy labels.

Wait — if the teacher trackers are imperfect, won't the labels be wrong? Won't the student learn the teachers' mistakes? Yes, some labels will be wrong. But several forces conspire to make the student better than any individual teacher:

Force 1: Real data distribution
Even noisy labels on real videos help the student learn the true distribution of real-world motion, textures, and lighting — reducing the sim-to-real gap.
Force 2: Teacher ensembling
Multiple teachers with complementary strengths pseudo-label the same videos. Online trackers stick to query points well. Offline trackers handle occlusions better. The student sees diverse perspectives.
Force 3: Noise averaging
Different teachers make different mistakes. Over many training samples, the errors cancel out while the correct signal accumulates. The student learns the consensus.
Force 4: Cycle consistency filtering
Bad tracks are detected and removed before training. Only high-quality pseudo-labels survive. This dramatically raises the signal-to-noise ratio.
The distillation insight: The student doesn't just imitate one teacher — it learns from a committee of teachers on real data, with bad labels filtered out. This is why pseudo-labeling + filtering can produce a student that surpasses every teacher individually. The same principle appears in knowledge distillation, boosting, and ensemble learning.

The teachers are CoTracker3 online, CoTracker3 offline, CoTracker, and TAPIR — all trained only on Kubric. During training, a random teacher is sampled for each batch. Over multiple epochs, the same video gets pseudo-labeled by different teachers, providing diverse training signal.

Why can a student tracker become better than its teacher trackers, even though it trains on the teachers' imperfect predictions?

Chapter 2: Cycle Consistency

How do you know if a pseudo-label is good or bad? You don't have ground truth — that's the whole point. But you have a powerful geometric check: cycle consistency.

The idea is simple. Track a point forward through the video, then track it backward from where it ended up. If you land back near where you started, the track was probably correct. If you end up far away, something went wrong.

The Forward-Backward Test

Given a query point Q = (tq, xq, yq) at frame tq:

  1. Forward pass: Track Q from frame tq to frame T, producing track P1, P2, ..., PT.
  2. Backward pass: Take the endpoint PT = (T, xT, yT) and track it backward from frame T to frame tq, producing a return track P'T, P'T-1, ..., P'tq.
  3. Check: Compare P'tq with the original query point Q. If ||P'tq - Q|| is small, the track is cycle-consistent. If the gap is large, reject the track.
gap = ||P'tq − Q|| = ||(x'q − xq, y'q − yq)||2
Why does this work? If a tracker drifts to the wrong location — say, it locks onto a similar-looking but different point — the backward pass will track from that wrong location and almost certainly not return to the original. The probability of two independent errors canceling perfectly is very low. Only genuinely correct tracks survive the round trip.

Try it below. Click to place a query point, then watch as it tracks forward and backward. Drag the threshold slider to see how strict filtering affects which tracks survive.

Cycle Consistency Check

A point is tracked forward N frames (teal path), then backward from its endpoint (warm path). The gap between start and return determines if the track is accepted. Adjust the threshold to filter.

Drift amount 15px
Threshold 10px

In practice, cycle consistency is applied per-point, per-frame. A track that is cycle-consistent for most of its duration but drifts at frame 50 can still contribute clean labels for frames 1–49. The filter is granular, not all-or-nothing.

Cycle consistency is a necessary but not sufficient condition. A tracker that stays perfectly still on a static texture is cycle-consistent but wrong if the true point moved. However, combined with SIFT-based query selection (which biases toward distinctive, trackable points), the false-positive rate is low enough for effective training.
A tracker tracks point P forward 100 frames, then backward. The return point is 2 pixels from the original. With a threshold of 5 pixels, is this track accepted or rejected?

Chapter 3: The Pseudo-Labeling Pipeline

Now let's see the full pipeline in action. This is the heart of CoTracker3's training recipe — and it is remarkably simple compared to BootsTAPIR's complex protocol.

Step 1: Collect Real Videos

Gather a dataset of unlabelled real videos. CoTracker3 uses around 100,000 internet-like videos (30 seconds each) featuring diverse scenes with humans, animals, and dynamic objects. No annotations needed.

Step 2: Sample Query Points

For each video, run SIFT on randomly selected frames to find good-to-track keypoints. SIFT detects distinctive corners and edges — points that are likely to produce reliable tracks. If SIFT can't find enough points on a frame, skip that video entirely.

Step 3: Run Teacher Trackers

Randomly sample one of four teacher models (CoTracker3 online, CoTracker3 offline, CoTracker, TAPIR) and run it on the video. The teacher produces tracks for all query points. Over multiple epochs, different teachers label the same video, providing diverse pseudo-labels.

Step 4: Filter with Cycle Consistency

Run the forward-backward test on each pseudo-track. Discard tracks with large cycle gaps. This removes the worst teacher mistakes and keeps the reliable tracks.

Step 5: Train the Student

Train the student CoTracker3 on the surviving pseudo-labels using the same Huber loss as for synthetic data. Freeze the confidence and visibility heads (since pseudo-labels don't have reliable visibility annotations) and add a separate head for tracks only.

What BootsTAPIR needed that CoTracker3 doesn't: BootsTAPIR required augmentations applied to student predictions, three different loss masks, exponential moving average (EMA) of model weights, and 15 million videos. CoTracker3 uses none of these. No augmentations, no masks, no EMA, and only 15k videos for state-of-the-art results.
Pseudo-Labeling Pipeline

Watch the full pipeline animate: teacher trackers produce pseudo-labels on real videos, cycle consistency filters bad tracks, and the surviving labels train a student that surpasses every teacher.

Ready
Why does CoTracker3 randomly sample a different teacher for each training batch?

Chapter 4: Architectural Simplification

CoTracker3 isn't just a new training recipe — it's also a leaner architecture. The model strips away components that previous trackers treated as essential, and replaces complex modules with simpler alternatives.

What's Removed

What's Simplified

The Architecture Flow

For each frame, extract convolutional features at 4 scales. Compute 4D correlations between query features and track-estimate features. Feed correlations, displacement embeddings, confidence, and visibility into a transformer with factorized time attention and cross-track group attention. The transformer outputs incremental updates. Repeat M times, resampling features at each iteration.

Speed and size: CoTracker3 has 25M parameters (2× fewer than CoTracker's 45M) and runs 27% faster than LocoTrack despite having cross-track attention. The simplifications compound: fewer components = fewer parameters = faster inference.
ComponentCoTrackerLocoTrackCoTracker3
Global matchingNoYesNo
Correlation processingN/AAd-hoc moduleSimple MLP
Cross-track attentionYesNoYes
Visibility predictionSeparate netSeparate netJoint iterative
Parameters45M25M25M
Speed (μs/frame/point)472290209
What previously "essential" component does CoTracker3 remove, and why doesn't this hurt performance?

Chapter 5: Semi-Supervised Training

CoTracker3's training happens in two stages. The first uses synthetic data with perfect labels. The second adds real data with pseudo-labels. Let's walk through both.

Stage 1: Pre-training on Kubric

The model starts by training on Kubric, the same synthetic dataset used by all modern point trackers. Kubric provides perfect ground-truth tracks, perfect visibility labels, and perfect confidence targets. This gives the model a solid foundation in the mechanics of tracking: understanding motion, correlation, and iterative refinement.

Stage 2: Fine-tuning on Real + Pseudo-labels

The pre-trained model is then fine-tuned on real videos with pseudo-labels from the teacher pipeline. But there's a subtle trick: the confidence and visibility heads are frozen during this stage.

Why freeze confidence and visibility? Pseudo-labels don't have reliable visibility annotations — the teacher can't tell you with certainty whether a point is occluded, only where it thinks the point is. If you train the visibility head on unreliable pseudo-labels, it forgets what it learned from Kubric (where visibility labels are perfect). Solution: freeze it. A separate linear head handles track predictions, while the visibility head retains its Kubric-learned knowledge.

This split-head strategy avoids catastrophic forgetting of visibility and confidence predictions while still improving track accuracy on real data. The ablation confirms: freezing the head improves AJ by +0.8 and OA by +3.9 on TAP-Vid.

Online vs Offline Training

Both variants share the same architecture but differ in how they see videos during training:

Offline advantage for occlusions: By seeing all frames at once, the offline model can interpolate trajectories behind occlusions — if a point disappears at frame 30 and reappears at frame 50, the model uses both the before and after context. The online model only sees forward, so it must guess.
Why are the confidence and visibility heads frozen during fine-tuning on pseudo-labeled real data?

Chapter 6: Training Details

Let's pin down the specifics of CoTracker3's training recipe.

Data Sources

SourceTypeSizePurpose
KubricSyntheticStandardStage 1 pre-training with GT labels
Internet videosReal, unlabelled~100k videos, 30s eachStage 2 pseudo-label fine-tuning

The real videos feature diverse scenes — primarily humans and animals, the kinds of dynamic content where tracking matters most.

Query Point Sampling

CoTracker3 uses SIFT keypoints to select which points to track. The intuition: SIFT detects corners and edges that are distinctive and "good to track" — exactly the kind of points where pseudo-labels are most reliable. If SIFT fails to find enough keypoints on any selected frame, the entire video is skipped, maintaining training data quality.

The ablation shows SIFT is slightly better than SuperPoint, DISK, or uniform random sampling, though all perform comparably. The robustness to this choice is reassuring.

Loss Functions

Three losses, all applied at each of M iterative updates with exponentially increasing weights (γ = 0.8):

Ltrack = ∑m=1M γM−m (1occ/5 + 1vis) · Huber(P(m), P*)

The Huber loss (threshold = 6) on track positions, with occluded points weighted at 1/5 of visible points. Later iterations matter more (γ decay).

Lconf = ∑m=1M γM−m BCE(σ(C(m)), 1||P(m) − P*|| < 12)

The confidence loss: BCE between predicted confidence and an indicator of whether the prediction is within 12 pixels of ground truth.

Loccl = ∑m=1M γM−m BCE(σ(V(m)), V*)

The visibility loss: BCE between predicted and ground-truth visibility.

During pseudo-label training: only Ltrack is used. The confidence and visibility heads are frozen, so Lconf and Loccl are not computed. This simplifies the training loop and avoids corrupting learned visibility knowledge.

Teacher Models

All four teachers are trained only on Kubric synthetic data. They are frozen during student training — no EMA, no joint optimization:

The ablation in Table 5 shows that removing any teacher hurts. Even weaker teachers like CoTracker contribute complementary knowledge that the student can extract.

Why does the Huber loss weight occluded points at only 1/5 of visible points?

Chapter 7: Results

CoTracker3 is evaluated on the TAP-Vid benchmark suite (Kinetics, DAVIS, RGB-Stacking) and Dynamic Replica. The results are striking.

TAP-Vid Benchmarks (Mean Across All Three)

MethodTraining DataAJ ↑δvisavgOA ↑
TAPIRKubric56.270.086.5
CoTrackerKubric61.876.188.3
LocoTrackKubric62.975.387.2
BootsTAPIRKubric + 15M real61.473.688.7
CoTracker3 onlineKubric + 15k real63.876.390.2
CoTracker3 offlineKubric + 15k real64.476.991.2
The headline number: CoTracker3 offline beats BootsTAPIR across the board — higher AJ (+3.0), higher δvisavg (+3.3), higher OA (+2.5) — while using 1,000× less real training data (15k vs 15M videos) and a far simpler training protocol.

Occlusion Tracking (Dynamic Replica)

This is where cross-track attention really shines. Dynamic Replica provides ground truth for occluded points:

Methodδvisavgδoccavg
LocoTrack71.429.8
BootsTAPIR69.028.0
CoTracker3 online72.941.0
CoTracker3 offline69.841.8

CoTracker3 tracks occluded points with δoccavg of 41.8 — a massive +12 points over BootsTAPIR. The cross-track attention lets the model infer occluded positions from the motion of surrounding visible points.

Results Comparison

Average Jaccard (AJ) on TAP-Vid benchmarks (mean of Kinetics, DAVIS, RGB-Stacking). Higher is better. Data amount shown in parentheses.

On Dynamic Replica, CoTracker3 offline achieves δoccavg of 41.8 vs BootsTAPIR's 28.0. What architectural feature enables this large gap in occluded point tracking?

Chapter 8: Data Efficiency

Perhaps the most remarkable result in CoTracker3 is how little data it needs. The scaling curve tells a clear story:

The Scaling Curve

Real Videos UsedCoTracker3 Online AJvs BootsTAPIR (15M videos)
0 (Kubric only)62.2−0.6 below BootsTAPIR
10063.0+0.2 above BootsTAPIR
1,00063.5+0.7 above
5,00063.8+1.0 above
15,00064.0+1.2 above
100,00064.0+1.2 above (plateau)
100 videos is enough to beat BootsTAPIR. Just 100 real videos — 0.001% of BootsTAPIR's 15M — pushes CoTracker3 above state-of-the-art. Performance plateaus around 30k videos. Beyond that, the student has likely surpassed all its teachers and can't extract more from their pseudo-labels.

Why the Plateau?

The student eventually becomes better than every teacher on every type of scene in the dataset. At that point, the pseudo-labels are more noisy than helpful — the student is being taught by inferiors. This is a known phenomenon in knowledge distillation: the student can outgrow the teacher.

The fix? Use the improved student as a new teacher and re-generate pseudo-labels. The paper shows that self-training (using CoTracker3's own predictions as annotations) provides an additional +1.2 AJ improvement. This bootstrapping could, in principle, be iterated.

Does This Work for Other Architectures?

Yes. The pseudo-labeling pipeline is architecture-agnostic. When applied to LocoTrack and the original CoTracker, both improve significantly:

Data efficiency vs protocol complexity: BootsTAPIR's 15M videos + EMA + augmentations + loss masks achieve less than CoTracker3's 15k videos + random teacher sampling + cycle consistency. The lesson: a clean protocol on a small, filtered dataset beats a complex protocol on a massive, noisy one.
Why does CoTracker3's performance plateau after ~30k real training videos?

Chapter 9: Connections

CoTracker3 sits at the intersection of several important research threads. Let's map the landscape.

Relation to CoTracker

CoTracker (Karaev et al., 2024) introduced joint point tracking with cross-track attention and virtual tracks. CoTracker3 inherits cross-track attention (crucial for occlusion handling) but strips away the sliding window architecture, removes the separate visibility network, and halves the parameter count. Think of CoTracker3 as CoTracker distilled to its essential components.

Relation to VGGSfM

VGGSfM (Wang et al., 2024) uses point tracking as a component for Structure-from-Motion, validating tracks through 3D reconstruction. CoTracker3 can serve as the tracking backbone for VGGSfM and similar 3D reconstruction pipelines, providing more robust tracks on real-world video.

Relation to Self-Supervised Learning

CoTracker3's pseudo-labeling pipeline is a form of semi-supervised learning closely related to the Noisy Student framework in image classification. The pattern is the same: train a teacher on labeled data, generate pseudo-labels on unlabeled data, train a student on the combined dataset. The student exceeds the teacher because it learns from more diverse (real) data.

Relation to Pseudo-Labeling in NLP

The same principle drives progress in NLP. Large language models generate synthetic training data for smaller models (distillation), and self-play / self-improvement loops use a model's own outputs as training targets. The key enabler in all cases is a cheap quality filter — cycle consistency for tracking, perplexity/consistency checks for text.

Relation to BootsTAPIR

BootsTAPIR (Doersch et al., 2024) pioneered using real videos for point tracker training, but with a heavy protocol. CoTracker3 shows that the complex machinery (EMA, augmentations, multiple loss masks) is unnecessary when you have (a) multiple diverse teachers, (b) cycle consistency filtering, and (c) a cleaner architecture.

Cheat Sheet

AspectCoTracker3
InputVideo + query point (frame, x, y)
OutputPer-frame (x, y) track + visibility + confidence
BackboneConvolutional feature extractor (4 scales)
Core mechanism4D correlation + transformer iterative refinement
Training dataKubric (synthetic) + 15k real videos (pseudo-labeled)
Pseudo-label filterCycle consistency (forward-backward test)
Parameters25M (2× smaller than CoTracker)
Key resultSOTA on TAP-Vid with 1000× less data than BootsTAPIR
VariantsOnline (streaming) + Offline (full video)
The broader lesson: When training data is the bottleneck, don't collect more data — get better labels from the data you have. Multiple imperfect teachers + cheap consistency checks can outperform massive brute-force data collection. Simplicity in architecture and training protocol scales better than complexity.
What general machine learning principle does CoTracker3's pseudo-labeling approach exemplify?