Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering.

Gang Liu , Jinlong He , Pengfei Li , Zixu Zhao , Shenjun Zhong

J Biomed Inform

Monash Biomedical Imaging, Monash University, Melbourne, 3800, Victoria, Australia. Electronic address:

Published: December 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Medical Visual Question Answering (VQA) is a task that aims to provide answers to questions about medical images, which utilizes both visual and textual information in the reasoning process. The absence of large-scale annotated medical VQA datasets presents a formidable obstacle to training a medical VQA model from scratch in an end-to-end manner. Existing works have been using image captioning dataset in the pre-training stage and fine-tuning to downstream VQA tasks. Following the same paradigm, we use a collection of public medical image captioning datasets to pre-train multimodality models in a self-supervised setup, and fine-tune to downstream medical VQA tasks. In the work, we propose a method that featured with Cross-Modal pre-training with Multiple Objectives (CMMO), which includes masked image modeling, masked language modeling, image-text matching, and image-text contrastive learning. The proposed method is designed to associate the visual features of medical images with corresponding medical concepts in captions, for learning aligned vision and language feature representations, and multi-modal interactions. The experimental results reveal that our proposed CMMO method outperforms state-of-the-art methods on three public medical VQA datasets, showing absolute improvements of 2.6%, 0.9%, and 4.0% on the VQA-RAD, PathVQA, and SLAKE dataset, respectively. We also conduct comprehensive ablation studies to validate our method, and visualize the attention maps which show a strong interpretability. The code and pre-trained weights will be released at https://github.com/pengfeiliHEU/CMMO.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.jbi.2024.104748	DOI Listing

Publication Analysis

Top Keywords

medical vqa

medical

vision language

pre-training multiple

multiple objectives

medical visual

visual question

question answering

medical images

vqa datasets

Similar Publications

Med3DVLM: An Efficient Vision-Language Model for 3D Medical Image Analysis.

IEEE J Biomed Health Inform

September 2025

Yu Xin , Gorkem Can Ates , Kuang Gong , Wei Shao

Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluated our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images.

View Article and Find Full Text PDF

Similar Publications

WoundcareVQA: A multilingual visual question answering benchmark dataset for wound care.

J Biomed Inform

August 2025

University of Washington, Seattle, USA.

Wen-Wai Yim , Asma Ben Abacha , Robert Doerning , Chia-Yu Chen , Jiaying Xu

Objective: Introduce the task of wound care multimodal multilingual visual question answering, provide baseline performances, and identify areas of future study.

Methods: A dataset of wound care multimodal multilingual visual question answering (VQA) was created using consumer health questions asked online. Practicing US medical doctors were tasked with providing metadata and expert responses labels.

View Article and Find Full Text PDF

Similar Publications

From visual question answering to intelligent AI agents in ophthalmology.

Br J Ophthalmol

August 2025

School of Optometry, The Hong Kong Polytechnic University, Kowloon, Hong Kong

Xiaolan Chen , Ruoyu Chen , Pusheng Xu , Xiaojie Wan , Weiyi Zhang

Ophthalmic practice involves the integration of diverse clinical data and interactive decision-making, posing challenges for traditional artificial intelligence (AI) systems. Visual question answering (VQA) addresses this by combining computer vision and natural language processing to interpret medical images through user-driven queries. Evolving from VQA, multimodal AI agents enable continuous dialogue, tool use and context-aware clinical decision support.

View Article and Find Full Text PDF

Similar Publications

M4CXR: Exploring Multitask Potentials of Multimodal Large Language Models for Chest X-Ray Interpretation.

IEEE Trans Neural Netw Learn Syst

August 2025

Jonggwon Park , Soobum Kim , Byungmu Yoon , Jihun Hyun , Kyoyun Choi

The rapid evolution of artificial intelligence, especially in large language models (LLMs), has significantly impacted various domains, including healthcare. In chest X-ray (CXR) analysis, previous studies have employed LLMs, but with limitations: either underutilizing the LLMs' capability for multitask learning or lacking clinical accuracy. This article presents M4CXR, a multimodal LLM designed to enhance CXR interpretation.

View Article and Find Full Text PDF

Similar Publications

LMT++: Adaptively Collaborating LLMs with Multi-specialized Teachers for Continual VQA in Robotic Surgical Videos.

IEEE Trans Med Imaging

June 2025

Yuyang Du , Kexin Chen , Yue Zhan , Chang Han Low , Mobarakol Islam

Visual question answering (VQA) plays a vital role in advancing surgical education. However, due to the privacy concern of patient data, training VQA model with previously used data becomes restricted, making it necessary to use the exemplar-free continual learning (CL) approach. Previous CL studies in the surgical field neglected two critical issues: i) significant domain shifts caused by the wide range of surgical procedures collected from various sources, and ii) the data imbalance problem caused by the unequal occurrence of medical instruments or surgical procedures.

View Article and Find Full Text PDF

Similar Publications