Zero-shot point tracking on TAP-Vid DAVIS. Predictions are indicated by red stars, representing the most similar locations, while ground truth correspondences are shown as blue stars.
Large-scale vision foundation models have demonstrated remarkable success across various tasks, underscoring their robust generalization capabilities.
While their proficiency in two-view correspondence has been explored, their effectiveness in long-term correspondence within complex environments remains unexplored.
To address this, we evaluate the geometric awareness of visual foundation models in the context of point tracking:
(i) in zero-shot settings, without any training;
(ii) by probing with low-capacity layers;
(iii) by fine-tuning with Low Rank Adaptation (LoRA).
Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings.
Furthermore, DINOv2 achieves performance comparable to supervised models in adaptation settings, demonstrating its potential as a strong initialization for correspondence learning.
We leverage correlation maps, specifically cosine similarity between the query and target view features, to establish correspondences between two views:
(i) Zero-Shot: We assess the geometric awareness of the encoders by selecting the most similar location from the correlation map.
(ii) Probing: We perform probing by training low-capacity layers on the correlation map, computed from a frozen backbone, akin to linear probing.
(iii) Adaptation: We fine-tune the model using Low Rank Adaptation (LoRA) to evaluate its effectiveness as an initialization method.
Zero-Shot Evaluation. These results show the zero-shot evaluation results on the TAP-Vid datasets. Stable Diffusion exhibit superior geometric correspondence abilities in zero-shot settings on average.
Probing and Adapting DINOv2. This table shows different setups for DINOv2 on the TAP-Vid DAVIS, including the number of learnable parameters. The setups include zero-shot, probing, and adaptation with various LoRA ranks. DINOv2 achieves better performance tahn supervised models under lighter training setup.
Qualitative results on TAP-Vid DAVIS. These GIFs show the performance of zero-shot point tracking on TAP-Vid DAVIS. Query points from four videos are used to generate correlation map. The models evaluated are Stable Diffusion, DINOv2, and SAM. Predictions are indicated by red stars, representing the most similar locations, while ground truth correspondences are shown as blue stars. Red lines, connect them to illustrate the spatial difference, or error, between the ground truth and the prediction. Warmer colors in the correlation maps indicate higher similarity.
Gorkay Aydemir, Weidi Xie, and Fatma Guney
@article{aydemir2024can,
title={Can Visual Foundation Models Achieve Long-term Point Tracking?},
author={Aydemir, G{\"o}rkay and Xie, Weidi and G{\"u}ney, Fatma},
journal={arXiv preprint arXiv:2408.13575},
year={2024}
}