Tuning Vision-Language Models With Multiple Prototypes Clustering.

Meng-Hao Guo , Yi Zhang , Tai-Jiang Mu , Sharon X Huang , Shi-Min Hu

IEEE Trans Pattern Anal Mach Intell

Published: December 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Benefiting from advances in large-scale pre-training, foundation models, have demonstrated remarkable capability in the fields of natural language processing, computer vision, among others. However, to achieve expert-level performance in specific applications, such models often need to be fine-tuned with domain-specific knowledge. In this paper, we focus on enabling vision-language models to unleash more potential for visual understanding tasks under few-shot tuning. Specifically, we propose a novel adapter, dubbed as lusterAdapter, which is based on trainable multiple prototypes clustering algorithm, for tuning the CLIP model. It can not only alleviate the concern of catastrophic forgetting of foundation models by introducing anchors to inherit common knowledge, but also improve the utilization efficiency of few annotated samples via bringing in clustering and domain priors, thereby improving the performance of few-shot tuning. We have conducted extensive experiments on 11 common classification benchmarks. The results show our method significantly surpasses the original CLIP and achieves state-of-the-art (SOTA) performance under all benchmarks and settings. For example, under the 16-shot setting, our method exhibits a remarkable improvement over the original CLIP by 19.6%, and also surpasses TIP-Adapter and GraphAdapter by 2.7% and 2.2%, respectively, in terms of average accuracy across the 11 benchmarks.

Download full-text PDF	Source
http://dx.doi.org/10.1109/TPAMI.2024.3460180	DOI Listing

Publication Analysis

Top Keywords

vision-language models

multiple prototypes

prototypes clustering

foundation models

few-shot tuning

original clip

models

tuning

tuning vision-language

models multiple

Similar Publications

Performance of vision language models for optic disc swelling identification on fundus photographs.

Front Digit Health

August 2025

Department of Ophthalmology, Stanford University, Palo Alto, CA, United States.

Kelvin Zhenghao Li , Tuyet Thao Nguyen , Heather E Moss

Introduction: Vision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical context. In this study, we compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.

View Article and Find Full Text PDF

Similar Publications

Temporal Modeling With Frozen Vision-Language Foundation Models for Parameter-Efficient Text-Video Retrieval.

IEEE Trans Neural Netw Learn Syst

September 2025

Leqi Shen , Tianxiang Hao , Tao He , Yifeng Zhang , Pengzhang Liu

Temporal modeling plays an important role in the effective adaption of the powerful pretrained text-image foundation model into text-video retrieval. However, existing methods often rely on additional heavy trainable modules, such as transformer or BiLSTM, which are inefficient. In contrast, we avoid introducing such heavy components by leveraging frozen foundation models.

View Article and Find Full Text PDF

Similar Publications

Enhancing the Interpretation of Skin Lesion Diagnosis: Concept Adaptive Fine-Tuning of Vision-Language Models.

IEEE J Biomed Health Inform

September 2025

Yating Zhu , Xiaoyan Wang , Xiaojie Huang , Ming Xia , Pan Mu

Significant progress has been made in applying deep learning for the automatic diagnosis of skin lesions. However, most models remain unexplainable, which severely hinders their application in clinical settings. Concept-based ante-hoc interpretable models have the potential to clarify the decision-making process of diagnosis by learning high-level, human-understandable concepts, while they can only provide numerical values of conceptual contributions.

View Article and Find Full Text PDF

Similar Publications

Leveraging Multi-Text Joint Prompts in SAM for Robust Medical Image Segmentation.

IEEE J Biomed Health Inform

September 2025

Xu Zhang , Huangxuan Zhao , Lefei Zhang , Yuan Xiong

The Segment Anything Model (SAM) has attracted considerable attention due to its impressive performance and demonstrates potential in medical image segmentation. Compared to SAM's native point and bounding box prompts, text prompts offer a simpler and more efficient alternative in the medical field, yet this approach remains relatively underexplored. In this paper, we propose a SAM-based framework that integrates a pre-trained vision-language model to generate referring prompts, with SAM handling the segmentation task.

View Article and Find Full Text PDF

Similar Publications

Vision-language foundation models for medical imaging: a review of current practices and innovations.

Biomed Eng Lett

September 2025

Department of Precision Medicine, Yonsei University Wonju College of Medicine, Wonju, Korea.

Ji Seung Ryu , Hyunyoung Kang , Yuseong Chu , Sejung Yang

Unlabelled: Foundation models, including large language models and vision-language models (VLMs), have revolutionized artificial intelligence by enabling efficient, scalable, and multimodal learning across diverse applications. By leveraging advancements in self-supervised and semi-supervised learning, these models integrate computer vision and natural language processing to address complex tasks, such as disease classification, segmentation, cross-modal retrieval, and automated report generation. Their ability to pretrain on vast, uncurated datasets minimizes reliance on annotated data while improving generalization and adaptability for a wide range of downstream tasks.

View Article and Find Full Text PDF

Similar Publications