Publications

Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

Published in ICML, 2024

Recent research shows that using a pre-trained vision-language model like CLIP to align a query image with finer text descriptions from a large language model enhances zero-shot performance. We find that these descriptions align better with local image areas rather than the whole image. To leverage this, we introduce weighted visual-text cross alignment (WCA), which uses localized visual prompting to identify and align these areas with the descriptions, creating a similarity matrix. Our score function, based on weighted similarities, significantly improves zero-shot performance, achieving results comparable to few-shot learning methods.

Recommended citation: Jinhao Li, Haopeng Li, Sarah Erfani, Lei Feng, James Bailey, and Feng Liu. (2024). "Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models." Proceedings of the International Conference on Machine Learning (ICML).
Download Paper