Posts by Collection

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

Published in ICML, 2024

Recent research shows that using a pre-trained vision-language model like CLIP to align a query image with finer text descriptions from a large language model enhances zero-shot performance. We find that these descriptions align better with local image areas rather than the whole image. To leverage this, we introduce weighted visual-text cross alignment (WCA), which uses localized visual prompting to identify and align these areas with the descriptions, creating a similarity matrix. Our score function, based on weighted similarities, significantly improves zero-shot performance, achieving results comparable to few-shot learning methods.

Recommended citation: Jinhao Li, Haopeng Li, Sarah Erfani, Lei Feng, James Bailey, and Feng Liu. (2024). "Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models." Proceedings of the International Conference on Machine Learning (ICML).
Download Paper

teaching

Statistical Machine Learning, 2023 Semester 2

Master course, The University of Melbourne, Computing and Information Systems, 2023

Statistical Machine Learning, 2024 Semester 1

Master course, The University of Melbourne, Computing and Information Systems, 2024

Jinhao Li

Posts by Collection

portfolio

Portfolio item number 1

Portfolio item number 2

publications

Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

talks

Talk 1 on Relevant Topic in Your Field

Tutorial 1 on Relevant Topic in Your Field

Talk 2 on Relevant Topic in Your Field

Conference Proceeding talk 3 on Relevant Topic in Your Field

teaching

Statistical Machine Learning, 2023 Semester 2

Statistical Machine Learning, 2024 Semester 1