Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering.

J Biomed Inform

Monash Biomedical Imaging, Monash University, Melbourne, 3800, Victoria, Australia. Electronic address:

Published: December 2024


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Medical Visual Question Answering (VQA) is a task that aims to provide answers to questions about medical images, which utilizes both visual and textual information in the reasoning process. The absence of large-scale annotated medical VQA datasets presents a formidable obstacle to training a medical VQA model from scratch in an end-to-end manner. Existing works have been using image captioning dataset in the pre-training stage and fine-tuning to downstream VQA tasks. Following the same paradigm, we use a collection of public medical image captioning datasets to pre-train multimodality models in a self-supervised setup, and fine-tune to downstream medical VQA tasks. In the work, we propose a method that featured with Cross-Modal pre-training with Multiple Objectives (CMMO), which includes masked image modeling, masked language modeling, image-text matching, and image-text contrastive learning. The proposed method is designed to associate the visual features of medical images with corresponding medical concepts in captions, for learning aligned vision and language feature representations, and multi-modal interactions. The experimental results reveal that our proposed CMMO method outperforms state-of-the-art methods on three public medical VQA datasets, showing absolute improvements of 2.6%, 0.9%, and 4.0% on the VQA-RAD, PathVQA, and SLAKE dataset, respectively. We also conduct comprehensive ablation studies to validate our method, and visualize the attention maps which show a strong interpretability. The code and pre-trained weights will be released at https://github.com/pengfeiliHEU/CMMO.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jbi.2024.104748DOI Listing

Publication Analysis

Top Keywords

medical vqa
16
medical
10
vision language
8
pre-training multiple
8
multiple objectives
8
medical visual
8
visual question
8
question answering
8
medical images
8
vqa datasets
8

Similar Publications

Vision-language models (VLMs) have shown promise in 2D medical image analysis, but extending them to 3D remains challenging due to the high computational demands of volumetric data and the difficulty of aligning 3D spatial features with clinical text. We present Med3DVLM, a 3D VLM designed to address these challenges through three key innovations: (1) DCFormer, an efficient encoder that uses decomposed 3D convolutions to capture fine-grained spatial features at scale; (2) SigLIP, a contrastive learning strategy with pairwise sigmoid loss that improves image-text alignment without relying on large negative batches; and (3) a dual-stream MLP-Mixer projector that fuses low- and high-level image features with text embeddings for richer multi-modal representations. We evaluated our model on the M3D dataset, which includes radiology reports and VQA data for 120,084 3D medical images.

View Article and Find Full Text PDF

Objective: Introduce the task of wound care multimodal multilingual visual question answering, provide baseline performances, and identify areas of future study.

Methods: A dataset of wound care multimodal multilingual visual question answering (VQA) was created using consumer health questions asked online. Practicing US medical doctors were tasked with providing metadata and expert responses labels.

View Article and Find Full Text PDF

Ophthalmic practice involves the integration of diverse clinical data and interactive decision-making, posing challenges for traditional artificial intelligence (AI) systems. Visual question answering (VQA) addresses this by combining computer vision and natural language processing to interpret medical images through user-driven queries. Evolving from VQA, multimodal AI agents enable continuous dialogue, tool use and context-aware clinical decision support.

View Article and Find Full Text PDF

The rapid evolution of artificial intelligence, especially in large language models (LLMs), has significantly impacted various domains, including healthcare. In chest X-ray (CXR) analysis, previous studies have employed LLMs, but with limitations: either underutilizing the LLMs' capability for multitask learning or lacking clinical accuracy. This article presents M4CXR, a multimodal LLM designed to enhance CXR interpretation.

View Article and Find Full Text PDF

Visual question answering (VQA) plays a vital role in advancing surgical education. However, due to the privacy concern of patient data, training VQA model with previously used data becomes restricted, making it necessary to use the exemplar-free continual learning (CL) approach. Previous CL studies in the surgical field neglected two critical issues: i) significant domain shifts caused by the wide range of surgical procedures collected from various sources, and ii) the data imbalance problem caused by the unequal occurrence of medical instruments or surgical procedures.

View Article and Find Full Text PDF