Track-On2:
Enhancing Online Point Tracking with Memory

To appear at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

TL;DR We present Track-On2, a streamlined memory-augmented transformer for online long-term point tracking. It tracks points frame-by-frame with no future frames, eliminating window/full-video processing, and achieves high FPS with a low GPU-memory footprint.






Method Overview

Method overview figure

We introduce Track-On2, a simple transformer-based method for online, frame-by-frame point tracking. The pipeline has three parts:
(i) Visual Encoder top-left, which extracts multi-scale features from each frame with a DINOv3-based ViT-Adapter and fuses them via an FPN;
(ii) Query Decoder, which decodes interest-point queries by attending to current-frame features and the memory propagated from the previous frame;
(iii) Point Prediction right, which estimates correspondences in a coarse-to-fine manner; first by patch classification from feature similarity, then by offset regression from the top patch candidates. Before selecting the top patches, we re-rank candidates by enriching each query with local information from the top-k patches. After re-ranking, the refined queries are written to memory for the next frame.






Quantitative Results

We report δavg for BootsTAPNext-B, CoTracker3 (Video), and Track-On2 (higher is better). Track-On2 achieves the best δavg; on four of five datasets (DAVIS, RoboTAP, Dynamic Replica, and PointOdyssey), and is competitive on Kinetics. This evidences robustness across domains (internet videos, robotics, synthetic scenes) and time scales, from short clips to very long sequences, while operating fully online without future frames.

Method DAVIS Kinetics RoboTAP Dynamic
Replica
Point Odyssey
BootsTAPNext 78.5 70.6 75.0 46.2 9.9
CoTracker3 76.9 67.8 78.0 72.3 44.5
Track-On2 79.9 69.3 80.5 74.5 45.1





Efficiency

Efficiency plot figure

Inference efficiency vs. memory length (Li) when tracking N points: with our default Li=72, Track-On2 tracks 256 points at >30 FPS using 0.52 GB; real-time capable.






Qualitative Results

Here, we provide qualitative results on three datasets: DAVIS, Kinetics, and RoboTAP.
DAVIS
DAVIS
DAVIS
DAVIS
DAVIS
DAVIS
Kinetics
Kinetics
Kinetics
RoboTAP
RoboTAP
RoboTAP

Paper

Track-On2: Enhancing Online Point Tracking with Memory

Görkay Aydemir, Weidi Xie and Fatma Güney


@article{aydemir2025trackon2,
  title   = {Track-On2: Enhancing Online Point Tracking with Memory},
  author  = {Aydemir, G\"orkay and Xie, Weidi and G\"uney, Fatma},
  journal = {arXiv preprint arXiv:2509.19115},
  year    = {2025}}