Can Visual Foundation Models
Achieve Long-term Point Tracking?

Gorkay Aydemir¹

Weidi Xie^{3, 4}

Fatma Guney^{1, 2}

¹Department of Computer Engineering, Koc University

²KUIS AI Center

³CMIC, Shanghai Jiao Tong University

⁴Shanghai AI Laboratory

EVAL-FoMo Workshop
ECCV 2024

Paper [arXiv]

Code [GitHub]

TL;DR We assess the geometric awareness of vision foundation models for long-term point tracking. Our results show that Stable Diffusion and DINOv2 excel in zero-shot settings, with DINOv2 matching supervised models after training in lighter setup, highlighting its potential for correspondence learning.

Zero-shot point tracking on TAP-Vid DAVIS. Predictions are indicated by red stars, representing the most similar locations, while ground truth correspondences are shown as blue stars.

Abstract

Large-scale vision foundation models have demonstrated remarkable success across various tasks, underscoring their robust generalization capabilities. While their proficiency in two-view correspondence has been explored, their effectiveness in long-term correspondence within complex environments remains unexplored. To address this, we evaluate the geometric awareness of visual foundation models in the context of point tracking:
(i) in zero-shot settings, without any training;
(ii) by probing with low-capacity layers;
(iii) by fine-tuning with Low Rank Adaptation (LoRA).
Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings. Furthermore, DINOv2 achieves performance comparable to supervised models in adaptation settings, demonstrating its potential as a strong initialization for correspondence learning.

Methodology

We leverage correlation maps, specifically cosine similarity between the query and target view features, to establish correspondences between two views:
(i) Zero-Shot: We assess the geometric awareness of the encoders by selecting the most similar location from the correlation map.
(ii) Probing: We perform probing by training low-capacity layers on the correlation map, computed from a frozen backbone, akin to linear probing.
(iii) Adaptation: We fine-tune the model using Low Rank Adaptation (LoRA) to evaluate its effectiveness as an initialization method.

Quantitative Results

Zero-Shot Evaluation. These results show the zero-shot evaluation results on the TAP-Vid datasets. Stable Diffusion exhibit superior geometric correspondence abilities in zero-shot settings on average.

Probing and Adapting DINOv2. This table shows different setups for DINOv2 on the TAP-Vid DAVIS, including the number of learnable parameters. The setups include zero-shot, probing, and adaptation with various LoRA ranks. DINOv2 achieves better performance tahn supervised models under lighter training setup.

Qualitative Results

Qualitative results on TAP-Vid DAVIS. These GIFs show the performance of zero-shot point tracking on TAP-Vid DAVIS. Query points from four videos are used to generate correlation map. The models evaluated are Stable Diffusion, DINOv2, and SAM. Predictions are indicated by red stars, representing the most similar locations, while ground truth correspondences are shown as blue stars. Red lines, connect them to illustrate the spatial difference, or error, between the ground truth and the prediction. Warmer colors in the correlation maps indicate higher similarity.

Paper

Can Visual Foundation Models Achieve Long-term Point Tracking?

Gorkay Aydemir, Weidi Xie, and Fatma Guney


          @article{aydemir2024can,
            title={Can Visual Foundation Models Achieve Long-term Point Tracking?},
            author={Aydemir, G{\"o}rkay and Xie, Weidi and G{\"u}ney, Fatma},
            journal={arXiv preprint arXiv:2408.13575},
            year={2024}
          }