Adapting vision-language AI models to cardiology tasks.

Nat Med

Department of Medicine, Division of Cardiology, Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.

Published: May 2024


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41591-024-02956-1DOI Listing

Publication Analysis

Top Keywords

adapting vision-language
4
vision-language models
4
models cardiology
4
cardiology tasks
4
adapting
1
models
1
cardiology
1
tasks
1

Similar Publications

Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized.

View Article and Find Full Text PDF

Prompt tuning, a recently emerging paradigm, adapts vision-language pre-trained models to new tasks efficiently by learning "soft prompts" for frozen models. However, in few-shot scenarios, its effectiveness is limited by sensitivity to the initialization and the time-consuming search for optimal initialization, hindering rapid adaptation. Additionally, prompt tuning risks reducing the models' generalizability due to overfitting on scarce training samples.

View Article and Find Full Text PDF

LLaVA-Pose: Keypoint-Integrated Instruction Tuning for Human Pose and Action Understanding.

Sensors (Basel)

August 2025

Department of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo 182-8585, Japan.

Current vision-language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes.

View Article and Find Full Text PDF

Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (g nConv) that performs high-order spatial interactions with gated convolutions and recursive designs.

View Article and Find Full Text PDF

AbVLM-Q: intelligent quality assessment for abdominal ultrasound standard planes via vision-language modeling.

BMC Med Imaging

August 2025

Department of Ultrasound Medicine, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang Province, China.

Background: Abdominal ultrasound is non-invasive and efficient, yet acquiring standard planes remains challenging due to operator dependency and procedural complexity. We propose AbVLM-Q, a vision-language framework for automated quality assessment of abdominal ultrasound standard planes.

Methods: In this study, we assembled a multi-center dataset comprising 7,766 abdominal ultrasound scans, which were randomly divided into training (70%), validation (15%), and testing (15%) subsets.

View Article and Find Full Text PDF