CoTracker3 — Veanors

Chapter 0: The Problem

You want to track a point — say, a fingertip, a ball, a corner of a box — across every frame of a video. Not bounding-box tracking, not optical flow between adjacent frames. Point tracking: given a single pixel in one frame, find exactly where it lands in every subsequent frame, even through occlusions and fast motion.

Modern point trackers (TAPIR, CoTracker, LocoTrack) are powerful, but they share a dirty secret: they are almost entirely trained on synthetic data. Specifically, on Kubric — a dataset of rendered 3D objects bouncing around in simulated scenes with perfect ground-truth tracks.

Why synthetic? Because annotating real videos with pixel-accurate point tracks is essentially impossible at scale. You would need a human to mark the exact pixel location of a point in every single frame. For a 250-frame video tracking 100 points, that is 25,000 annotations — per video.

The sim-to-real gap: Kubric videos have clean textures, simple lighting, rigid objects, and no motion blur. Real videos have reflections, deformable objects, chaotic backgrounds, and camera shake. A tracker trained only on Kubric learns a biased prior — it works well on synthetic benchmarks but degrades on real-world footage.

The simulation below shows this concretely. On the left, a synthetic scene with clean geometric shapes: the tracker nails every point. On the right, a noisy real-world-like scene: the same tracker drifts and loses points.

Synthetic vs Real Tracking

Watch points being tracked across frames. Left: clean synthetic scene (Kubric-like). Right: noisy real-world scene. The tracker trained on synthetic data struggles with real-world noise, texture, and occlusion.

Frame 0

BootsTAPIR tried to fix this by training on 15 million unlabelled real videos with a complex self-training protocol involving augmentations, EMA updates, and loss masks. It helped, but the recipe was expensive and over-engineered.

CoTracker3 asks: can we do this much more simply?

Why are modern point trackers primarily trained on synthetic data like Kubric?

Because annotating real videos with pixel-accurate point tracks at scale is essentially impossible — it requires marking exact pixel locations in every frame for every point Because synthetic data produces better trackers than real data Because real videos are too large to fit in GPU memory

Chapter 1: The Key Insight

CoTracker3's core idea is beautifully simple: use imperfect trackers to label real videos, then train a student on those noisy labels.

Wait — if the teacher trackers are imperfect, won't the labels be wrong? Won't the student learn the teachers' mistakes? Yes, some labels will be wrong. But several forces conspire to make the student better than any individual teacher:

Force 1: Real data distribution

Even noisy labels on real videos help the student learn the true distribution of real-world motion, textures, and lighting — reducing the sim-to-real gap.

↓

Force 2: Teacher ensembling

Multiple teachers with complementary strengths pseudo-label the same videos. Online trackers stick to query points well. Offline trackers handle occlusions better. The student sees diverse perspectives.

↓

Force 3: Noise averaging

Different teachers make different mistakes. Over many training samples, the errors cancel out while the correct signal accumulates. The student learns the consensus.

↓

Force 4: Cycle consistency filtering

Bad tracks are detected and removed before training. Only high-quality pseudo-labels survive. This dramatically raises the signal-to-noise ratio.

The distillation insight: The student doesn't just imitate one teacher — it learns from a committee of teachers on real data, with bad labels filtered out. This is why pseudo-labeling + filtering can produce a student that surpasses every teacher individually. The same principle appears in knowledge distillation, boosting, and ensemble learning.

The teachers are CoTracker3 online, CoTracker3 offline, CoTracker, and TAPIR — all trained only on Kubric. During training, a random teacher is sampled for each batch. Over multiple epochs, the same video gets pseudo-labeled by different teachers, providing diverse training signal.

Why can a student tracker become better than its teacher trackers, even though it trains on the teachers' imperfect predictions?

Because the student has more parameters Because Kubric labels are always correct Because multiple teachers provide diverse and complementary signals on real data, bad labels are filtered out via cycle consistency, and the real data distribution reduces the sim-to-real gap

Chapter 2: Cycle Consistency

How do you know if a pseudo-label is good or bad? You don't have ground truth — that's the whole point. But you have a powerful geometric check: cycle consistency.

The idea is simple. Track a point forward through the video, then track it backward from where it ended up. If you land back near where you started, the track was probably correct. If you end up far away, something went wrong.

The Forward-Backward Test

Given a query point Q = (t_q, x_q, y_q) at frame t_q:

Forward pass: Track Q from frame t_q to frame T, producing track P₁, P₂, ..., P_T.
Backward pass: Take the endpoint P_T = (T, x_T, y_T) and track it backward from frame T to frame t_q, producing a return track P'_T, P'_T-1, ..., P'_{t_q}.
Check: Compare P'_{t_q} with the original query point Q. If ||P'_{t_q} - Q|| is small, the track is cycle-consistent. If the gap is large, reject the track.

gap = ||P'_{t_q} − Q|| = ||(x'_q − x_q, y'_q − y_q)||₂

Why does this work? If a tracker drifts to the wrong location — say, it locks onto a similar-looking but different point — the backward pass will track from that wrong location and almost certainly not return to the original. The probability of two independent errors canceling perfectly is very low. Only genuinely correct tracks survive the round trip.

Try it below. Click to place a query point, then watch as it tracks forward and backward. Drag the threshold slider to see how strict filtering affects which tracks survive.

Cycle Consistency Check

A point is tracked forward N frames (teal path), then backward from its endpoint (warm path). The gap between start and return determines if the track is accepted. Adjust the threshold to filter.

Drift amount 15px

Threshold 10px

In practice, cycle consistency is applied per-point, per-frame. A track that is cycle-consistent for most of its duration but drifts at frame 50 can still contribute clean labels for frames 1–49. The filter is granular, not all-or-nothing.

Cycle consistency is a necessary but not sufficient condition. A tracker that stays perfectly still on a static texture is cycle-consistent but wrong if the true point moved. However, combined with SIFT-based query selection (which biases toward distinctive, trackable points), the false-positive rate is low enough for effective training.

A tracker tracks point P forward 100 frames, then backward. The return point is 2 pixels from the original. With a threshold of 5 pixels, is this track accepted or rejected?

Accepted — the 2-pixel gap is below the 5-pixel threshold, so the track is considered cycle-consistent Rejected — any non-zero gap means the track drifted Cannot determine without knowing the frame rate

Chapter 3: The Pseudo-Labeling Pipeline

Now let's see the full pipeline in action. This is the heart of CoTracker3's training recipe — and it is remarkably simple compared to BootsTAPIR's complex protocol.

Step 1: Collect Real Videos

Gather a dataset of unlabelled real videos. CoTracker3 uses around 100,000 internet-like videos (30 seconds each) featuring diverse scenes with humans, animals, and dynamic objects. No annotations needed.

Step 2: Sample Query Points

For each video, run SIFT on randomly selected frames to find good-to-track keypoints. SIFT detects distinctive corners and edges — points that are likely to produce reliable tracks. If SIFT can't find enough points on a frame, skip that video entirely.

Step 3: Run Teacher Trackers

Randomly sample one of four teacher models (CoTracker3 online, CoTracker3 offline, CoTracker, TAPIR) and run it on the video. The teacher produces tracks for all query points. Over multiple epochs, different teachers label the same video, providing diverse pseudo-labels.

Step 4: Filter with Cycle Consistency

Run the forward-backward test on each pseudo-track. Discard tracks with large cycle gaps. This removes the worst teacher mistakes and keeps the reliable tracks.

Step 5: Train the Student

Train the student CoTracker3 on the surviving pseudo-labels using the same Huber loss as for synthetic data. Freeze the confidence and visibility heads (since pseudo-labels don't have reliable visibility annotations) and add a separate head for tracks only.

What BootsTAPIR needed that CoTracker3 doesn't: BootsTAPIR required augmentations applied to student predictions, three different loss masks, exponential moving average (EMA) of model weights, and 15 million videos. CoTracker3 uses none of these. No augmentations, no masks, no EMA, and only 15k videos for state-of-the-art results.

Pseudo-Labeling Pipeline

Watch the full pipeline animate: teacher trackers produce pseudo-labels on real videos, cycle consistency filters bad tracks, and the surviving labels train a student that surpasses every teacher.

Ready

Why does CoTracker3 randomly sample a different teacher for each training batch?

To reduce computational cost So that over multiple epochs the same video receives pseudo-labels from teachers with complementary strengths, preventing overfitting to any single teacher's biases Because only one teacher can run on the GPU at a time

Chapter 4: Architectural Simplification

CoTracker3 isn't just a new training recipe — it's also a leaner architecture. The model strips away components that previous trackers treated as essential, and replaces complex modules with simpler alternatives.

What's Removed

No global matching stage. TAPIR, BootsTAPIR, and LocoTrack all use a global matching module that compares the query point to every location in every frame. CoTracker3 drops this entirely and relies solely on local correlation features. This is simpler and, surprisingly, doesn't hurt performance — the iterative refinement handles it.
No sliding window (offline mode). CoTracker processed videos in overlapping windows of T' frames, sliding forward by T'/2. CoTracker3 offline processes the entire video as a single window, enabling bidirectional tracking. The online variant still uses windows for real-time streaming.

What's Simplified

4D correlation via MLP. LocoTrack introduced 4D correlation volumes — comparing every feature around the query to every feature around the current track estimate. LocoTrack used a custom ad-hoc architecture to process these. CoTracker3 replaces it with a simple MLP projection. Same information, much less code.
Unified iterative updates. In CoTracker, visibility was predicted by a separate network. In CoTracker3, tracks, confidence, and visibility are all updated together at each iteration: (P, C, V) ← (P, C, V) + Δ(P, C, V).
Simplified token grid. The transformer input is just: correlation features + Fourier-encoded displacements + current confidence + current visibility. No extra embeddings or complex token construction.

The Architecture Flow

For each frame, extract convolutional features at 4 scales. Compute 4D correlations between query features and track-estimate features. Feed correlations, displacement embeddings, confidence, and visibility into a transformer with factorized time attention and cross-track group attention. The transformer outputs incremental updates. Repeat M times, resampling features at each iteration.

Speed and size: CoTracker3 has 25M parameters (2× fewer than CoTracker's 45M) and runs 27% faster than LocoTrack despite having cross-track attention. The simplifications compound: fewer components = fewer parameters = faster inference.

Component	CoTracker	LocoTrack	CoTracker3
Global matching	No	Yes	No
Correlation processing	N/A	Ad-hoc module	Simple MLP
Cross-track attention	Yes	No	Yes
Visibility prediction	Separate net	Separate net	Joint iterative
Parameters	45M	25M	25M
Speed (μs/frame/point)	472	290	209

What previously "essential" component does CoTracker3 remove, and why doesn't this hurt performance?

The global matching stage — the iterative refinement with 4D local correlations is sufficient to locate points without exhaustive global search The convolutional backbone — transformers can process raw pixels The Huber loss — L2 loss works better

Chapter 5: Semi-Supervised Training

CoTracker3's training happens in two stages. The first uses synthetic data with perfect labels. The second adds real data with pseudo-labels. Let's walk through both.

Stage 1: Pre-training on Kubric

The model starts by training on Kubric, the same synthetic dataset used by all modern point trackers. Kubric provides perfect ground-truth tracks, perfect visibility labels, and perfect confidence targets. This gives the model a solid foundation in the mechanics of tracking: understanding motion, correlation, and iterative refinement.

Stage 2: Fine-tuning on Real + Pseudo-labels

The pre-trained model is then fine-tuned on real videos with pseudo-labels from the teacher pipeline. But there's a subtle trick: the confidence and visibility heads are frozen during this stage.

Why freeze confidence and visibility? Pseudo-labels don't have reliable visibility annotations — the teacher can't tell you with certainty whether a point is occluded, only where it thinks the point is. If you train the visibility head on unreliable pseudo-labels, it forgets what it learned from Kubric (where visibility labels are perfect). Solution: freeze it. A separate linear head handles track predictions, while the visibility head retains its Kubric-learned knowledge.

This split-head strategy avoids catastrophic forgetting of visibility and confidence predictions while still improving track accuracy on real data. The ablation confirms: freezing the head improves AJ by +0.8 and OA by +3.9 on TAP-Vid.

Online vs Offline Training

Both variants share the same architecture but differ in how they see videos during training:

Online: Processes videos in sliding windows of T' frames, advancing by T'/2. Only tracks forward from the query frame. The overlapped predictions from the previous window initialize the next.
Offline: Sees the entire video as one window. Tracks both forward and backward from the query frame. Randomly trims videos between T/2 and T frames during training to avoid overfitting to a specific length.

Offline advantage for occlusions: By seeing all frames at once, the offline model can interpolate trajectories behind occlusions — if a point disappears at frame 30 and reappears at frame 50, the model uses both the before and after context. The online model only sees forward, so it must guess.

Why are the confidence and visibility heads frozen during fine-tuning on pseudo-labeled real data?

To reduce computational cost Because pseudo-labels lack reliable visibility annotations, and training on them would cause the head to forget the accurate visibility knowledge learned from Kubric's perfect labels Because the visibility head has already converged

Chapter 6: Training Details

Let's pin down the specifics of CoTracker3's training recipe.

Data Sources

Source	Type	Size	Purpose
Kubric	Synthetic	Standard	Stage 1 pre-training with GT labels
Internet videos	Real, unlabelled	~100k videos, 30s each	Stage 2 pseudo-label fine-tuning

The real videos feature diverse scenes — primarily humans and animals, the kinds of dynamic content where tracking matters most.

Query Point Sampling

CoTracker3 uses SIFT keypoints to select which points to track. The intuition: SIFT detects corners and edges that are distinctive and "good to track" — exactly the kind of points where pseudo-labels are most reliable. If SIFT fails to find enough keypoints on any selected frame, the entire video is skipped, maintaining training data quality.

The ablation shows SIFT is slightly better than SuperPoint, DISK, or uniform random sampling, though all perform comparably. The robustness to this choice is reassuring.

Loss Functions

Three losses, all applied at each of M iterative updates with exponentially increasing weights (γ = 0.8):

L_track = ∑_m=1^M γ^M−m (1_occ/5 + 1_vis) · Huber(P^(m), P*)

The Huber loss (threshold = 6) on track positions, with occluded points weighted at 1/5 of visible points. Later iterations matter more (γ decay).

L_conf = ∑_m=1^M γ^M−m BCE(σ(C^(m)), 1_{||P^(m) − P*|| < 12})

The confidence loss: BCE between predicted confidence and an indicator of whether the prediction is within 12 pixels of ground truth.

L_occl = ∑_m=1^M γ^M−m BCE(σ(V^(m)), V*)

The visibility loss: BCE between predicted and ground-truth visibility.

During pseudo-label training: only L_track is used. The confidence and visibility heads are frozen, so L_conf and L_occl are not computed. This simplifies the training loop and avoids corrupting learned visibility knowledge.

Teacher Models

All four teachers are trained only on Kubric synthetic data. They are frozen during student training — no EMA, no joint optimization:

CoTracker3 online — good at staying near query points
CoTracker3 offline — good at handling occlusions
CoTracker — joint tracking with cross-track attention
TAPIR — strong global matching baseline

The ablation in Table 5 shows that removing any teacher hurts. Even weaker teachers like CoTracker contribute complementary knowledge that the student can extract.

Why does the Huber loss weight occluded points at only 1/5 of visible points?

Because tracking visible points accurately is the primary objective — occluded points are inherently uncertain and should contribute less to the gradient to prioritize learning on reliable visible data Because occluded points are never evaluated at test time Because there are 5 times more occluded points than visible points

Chapter 7: Results

CoTracker3 is evaluated on the TAP-Vid benchmark suite (Kinetics, DAVIS, RGB-Stacking) and Dynamic Replica. The results are striking.

TAP-Vid Benchmarks (Mean Across All Three)

Method	Training Data	AJ ↑	δ^vis_avg ↑	OA ↑
TAPIR	Kubric	56.2	70.0	86.5
CoTracker	Kubric	61.8	76.1	88.3
LocoTrack	Kubric	62.9	75.3	87.2
BootsTAPIR	Kubric + 15M real	61.4	73.6	88.7
CoTracker3 online	Kubric + 15k real	63.8	76.3	90.2
CoTracker3 offline	Kubric + 15k real	64.4	76.9	91.2

The headline number: CoTracker3 offline beats BootsTAPIR across the board — higher AJ (+3.0), higher δ^vis_avg (+3.3), higher OA (+2.5) — while using 1,000× less real training data (15k vs 15M videos) and a far simpler training protocol.

Occlusion Tracking (Dynamic Replica)

This is where cross-track attention really shines. Dynamic Replica provides ground truth for occluded points:

Method	δ^vis_avg ↑	δ^occ_avg ↑
LocoTrack	71.4	29.8
BootsTAPIR	69.0	28.0
CoTracker3 online	72.9	41.0
CoTracker3 offline	69.8	41.8

CoTracker3 tracks occluded points with δ^occ_avg of 41.8 — a massive +12 points over BootsTAPIR. The cross-track attention lets the model infer occluded positions from the motion of surrounding visible points.

Results Comparison

Average Jaccard (AJ) on TAP-Vid benchmarks (mean of Kinetics, DAVIS, RGB-Stacking). Higher is better. Data amount shown in parentheses.

On Dynamic Replica, CoTracker3 offline achieves δ^occ_avg of 41.8 vs BootsTAPIR's 28.0. What architectural feature enables this large gap in occluded point tracking?

Larger convolutional backbone Cross-track attention — it allows the model to infer positions of occluded points based on the motion of nearby visible points, plus bidirectional access to all frames in offline mode More training data

Chapter 8: Data Efficiency

Perhaps the most remarkable result in CoTracker3 is how little data it needs. The scaling curve tells a clear story:

The Scaling Curve

Real Videos Used	CoTracker3 Online AJ	vs BootsTAPIR (15M videos)
0 (Kubric only)	62.2	−0.6 below BootsTAPIR
100	63.0	+0.2 above BootsTAPIR
1,000	63.5	+0.7 above
5,000	63.8	+1.0 above
15,000	64.0	+1.2 above
100,000	64.0	+1.2 above (plateau)

100 videos is enough to beat BootsTAPIR. Just 100 real videos — 0.001% of BootsTAPIR's 15M — pushes CoTracker3 above state-of-the-art. Performance plateaus around 30k videos. Beyond that, the student has likely surpassed all its teachers and can't extract more from their pseudo-labels.

Why the Plateau?

The student eventually becomes better than every teacher on every type of scene in the dataset. At that point, the pseudo-labels are more noisy than helpful — the student is being taught by inferiors. This is a known phenomenon in knowledge distillation: the student can outgrow the teacher.

The fix? Use the improved student as a new teacher and re-generate pseudo-labels. The paper shows that self-training (using CoTracker3's own predictions as annotations) provides an additional +1.2 AJ improvement. This bootstrapping could, in principle, be iterated.

Does This Work for Other Architectures?

Yes. The pseudo-labeling pipeline is architecture-agnostic. When applied to LocoTrack and the original CoTracker, both improve significantly:

LocoTrack: Benefits similarly to CoTracker3 for visible points, but still struggles with occlusions (no cross-track attention).
CoTracker: Starts weaker but keeps improving even at 100k videos, likely because it hasn't yet surpassed its teachers.

Data efficiency vs protocol complexity: BootsTAPIR's 15M videos + EMA + augmentations + loss masks achieve less than CoTracker3's 15k videos + random teacher sampling + cycle consistency. The lesson: a clean protocol on a small, filtered dataset beats a complex protocol on a massive, noisy one.

Why does CoTracker3's performance plateau after ~30k real training videos?

Because the student has surpassed all its teachers, and the pseudo-labels become more noisy than helpful — the student is being taught by models it has already exceeded Because the model has reached the maximum number of parameters Because 30k is the maximum batch size

Chapter 9: Connections

CoTracker3 sits at the intersection of several important research threads. Let's map the landscape.

Relation to CoTracker

CoTracker (Karaev et al., 2024) introduced joint point tracking with cross-track attention and virtual tracks. CoTracker3 inherits cross-track attention (crucial for occlusion handling) but strips away the sliding window architecture, removes the separate visibility network, and halves the parameter count. Think of CoTracker3 as CoTracker distilled to its essential components.

Relation to VGGSfM

VGGSfM (Wang et al., 2024) uses point tracking as a component for Structure-from-Motion, validating tracks through 3D reconstruction. CoTracker3 can serve as the tracking backbone for VGGSfM and similar 3D reconstruction pipelines, providing more robust tracks on real-world video.

Relation to Self-Supervised Learning

CoTracker3's pseudo-labeling pipeline is a form of semi-supervised learning closely related to the Noisy Student framework in image classification. The pattern is the same: train a teacher on labeled data, generate pseudo-labels on unlabeled data, train a student on the combined dataset. The student exceeds the teacher because it learns from more diverse (real) data.

Relation to Pseudo-Labeling in NLP

The same principle drives progress in NLP. Large language models generate synthetic training data for smaller models (distillation), and self-play / self-improvement loops use a model's own outputs as training targets. The key enabler in all cases is a cheap quality filter — cycle consistency for tracking, perplexity/consistency checks for text.

Relation to BootsTAPIR

BootsTAPIR (Doersch et al., 2024) pioneered using real videos for point tracker training, but with a heavy protocol. CoTracker3 shows that the complex machinery (EMA, augmentations, multiple loss masks) is unnecessary when you have (a) multiple diverse teachers, (b) cycle consistency filtering, and (c) a cleaner architecture.

Cheat Sheet

Aspect	CoTracker3
Input	Video + query point (frame, x, y)
Output	Per-frame (x, y) track + visibility + confidence
Backbone	Convolutional feature extractor (4 scales)
Core mechanism	4D correlation + transformer iterative refinement
Training data	Kubric (synthetic) + 15k real videos (pseudo-labeled)
Pseudo-label filter	Cycle consistency (forward-backward test)
Parameters	25M (2× smaller than CoTracker)
Key result	SOTA on TAP-Vid with 1000× less data than BootsTAPIR
Variants	Online (streaming) + Offline (full video)

The broader lesson: When training data is the bottleneck, don't collect more data — get better labels from the data you have. Multiple imperfect teachers + cheap consistency checks can outperform massive brute-force data collection. Simplicity in architecture and training protocol scales better than complexity.

What general machine learning principle does CoTracker3's pseudo-labeling approach exemplify?

Reinforcement learning from human feedback Semi-supervised / Noisy Student learning: train teachers on labeled data, generate filtered pseudo-labels on unlabeled data, train a student that surpasses the teachers through data diversity and noise averaging Generative adversarial training

Simpler and Better Point Tracking by Pseudo-Labelling Real Videos