Track-On2:
Enhancing Online Point Tracking with Memory


Gorkay Aydemir1
Weidi Xie3
Fatma Guney1, 2


Department of Computer Engineering, Koc University
KUIS AI Center
School of Artificial Intelligence, Shanghai Jiao Tong University









TL;DR We present Track-On2, a streamlined memory-augmented transformer for online long-term point tracking. It tracks points frame-by-frame with no future frames, eliminating window/full-video processing, and achieves high FPS with a low GPU-memory footprint.






Method Overview

Method overview figure

We introduce Track-On2, a simple transformer-based method for online, frame-by-frame point tracking. The pipeline has three parts:
(i) Visual Encoder bottom-left, which extracts multi-scale features from each frame with a DINOv3-based ViT-Adapter and fuses them via an FPN;
(ii) Query Decoder, which decodes interest-point queries by attending to current-frame features and the memory propagated from the previous frame;
(iii) Point Prediction right, which estimates correspondences in a coarse-to-fine manner; first by patch classification from feature similarity, then by offset regression from the top patch candidates. Before selecting the top patches, we re-rank candidates by enriching each query with local information from the top-k patches. After re-ranking, the refined queries are written to memory for the next frame.






Quantitative Results

We report δavg for BootsTAPNext-B, CoTracker3 (Video), and Track-On2 (higher is better). Track-On2 achieves the best δavg; on four of five datasets (DAVIS, RoboTAP, Dynamic Replica, and PointOdyssey), and is competitive on Kinetics. This evidences robustness across domains (internet videos, robotics, synthetic scenes) and time scales, from short clips to very long sequences, while operating fully online without future frames.

Method DAVIS Kinetics RoboTAP Dynamic
Replica
Point Odyssey
BootsTAPNext 78.5 70.6 75.0 46.2 9.9
CoTracker3 76.9 67.8 78.0 72.3 44.5
Track-On2 79.9 69.3 80.5 74.5 45.1





Efficiency

Efficiency plot figure

Inference efficiency vs. memory length (Li) when tracking N points: with our default Li=72, Track-On2 tracks 256 points at >30 FPS using 0.79 GB; real-time capable.






Qualitative Results

Here, we provide qualitative results on three datasets: DAVIS, Kinetics, and RoboTAP.
DAVIS
DAVIS
DAVIS
DAVIS
DAVIS
DAVIS
Kinetics
Kinetics
Kinetics
RoboTAP
RoboTAP
RoboTAP

Paper

Track-On2: Enhancing Online Point Tracking with Memory

Gorkay Aydemir, Weidi Xie and Fatma Guney


@article{Aydemir2025TrackOn2,
			title={{Track-On2}: Enhancing Online Point Tracking with Memory},
			author={Aydemir, G\"orkay and Xie, Weidi and G\"uney, Fatma},
			journal={arXiv preprint arXiv:2509.19115},
			year={2025}}