Understanding transfer learning for chest radiograph clinical report generation with modified transformer architectures.

Edward Vendrow , Ethan Schonfeld

Heliyon

School of Medicine, Stanford University, 291 Campus Drive, Stanford, CA, United States of America.

Published: July 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

The image captioning task is increasingly prevalent in artificial intelligence applications for medicine. One important application is clinical report generation from chest radiographs. The clinical writing of unstructured reports is time consuming and error-prone. An automated system would improve standardization, error reduction, time consumption, and medical accessibility. In this paper we demonstrate the importance of domain specific pre-training and propose a modified transformer architecture for the medical image captioning task. To accomplish this, we train a series of modified transformers to generate clinical reports from chest radiograph image input. These modified transformers include: a meshed-memory augmented transformer architecture with visual extractor using ImageNet pre-trained weights, a meshed-memory augmented transformer architecture with visual extractor using CheXpert pre-trained weights, and a meshed-memory augmented transformer whose encoder is passed the concatenated embeddings using both ImageNet pre-trained weights and CheXpert pre-trained weights. We use BLEU(1-4), ROUGE-L, CIDEr, and the clinical CheXbert F1 scores to validate our models and demonstrate competitive scores with state of the art models. We provide evidence that ImageNet pre-training is ill-suited for the medical image captioning task, especially for less frequent conditions (e.g.: enlarged cardiomediastinum, lung lesion, pneumothorax). Furthermore, we demonstrate that the double feature model improves performance for specific medical conditions (edema, consolidation, pneumothorax, support devices) and overall CheXbert F1 score, and should be further developed in future work. Such a double feature model, including both ImageNet pre-training as well as domain specific pre-training, could be used in a wide range of image captioning models in medicine.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10372225	PMC
http://dx.doi.org/10.1016/j.heliyon.2023.e17968	DOI Listing

Publication Analysis

Top Keywords

image captioning

pre-trained weights

captioning task

transformer architecture

meshed-memory augmented

augmented transformer

chest radiograph

clinical report

report generation

modified transformer

Similar Publications

Ensemble deep learning with image captioning for visual pollution detection, classification, and reporting.

Sci Rep

August 2025

Computer Science Department, College of Computing and Informatics, Saudi Electronic University, 11673, Riyadh, Saudi Arabia.

Haya Almalki , Nahlah Algethami

With the rapid urban development and initiatives such as Saudi Vision 2030, efforts have been directed toward improving services and quality of life in Saudi cities. As a result, multiple environmental challenges have emerged, including visual pollution (VP), which significantly impacts the quality of life. Current approaches to these challenges rely on reporting through an online application managed by the Ministry of Municipalities and Housing, which is prone to errors due to manual data entry.

View Article and Find Full Text PDF

Similar Publications

Captioner: Improving change captioning by leveraging momentum cross-view and cross-modality contrastive learning.

Neural Netw

August 2025

National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, 610064, PR China; College of Computer Science, Sichuan University, Chengdu, 610065, PR China. Electronic address:

Lin Deng , Borui Kang , Yuzhong Zhong , Maoning Wang , Jianwei Zhang

The primary goal of change captioning is to identify subtle visual differences between two similar images and express them in natural language. Existing research has been significantly influenced by the task of vision change detection and has mainly concentrated on the identification and description of visual changes. However, we contend that an effective change captioner should go beyond mere detection and description of what has changed.

View Article and Find Full Text PDF

Similar Publications

High-level visual representations in the human brain are aligned with large language models.

Nat Mach Intell

August 2025

cerebrUM, Département de Psychologie, Université de Montréal, Montreal, Quebec Canada.

Adrien Doerig , Tim C Kietzmann , Emily Allen , Yihan Wu , Thomas Naselaris

The human brain extracts complex information from visual inputs, including objects, their spatial and semantic interrelations, and their interactions with the environment. However, a quantitative approach for studying this information remains elusive. Here we test whether the contextual information encoded in large language models (LLMs) is beneficial for modelling the complex visual information extracted by the brain from natural scenes.

View Article and Find Full Text PDF

Similar Publications

Improving Video Summarization by Exploring the Coherence Between Corresponding Captions.

IEEE Trans Image Process

January 2025

Cheng Ye , Weidong Chen , Bo Hu , Lei Zhang , Yongdong Zhang

Video summarization aims to generate a compact summary of the original video by selecting and combining the most representative parts. Most existing approaches only focus on recognizing key video segments to generate the summary, which lacks holistic considerations. The transitions between selected video segments are usually abrupt and inconsistent, making the summary confusing.

View Article and Find Full Text PDF

Similar Publications

MinD-3D++: Advancing fMRI-Based 3D Reconstruction With High-Quality Textured Mesh Generation and a Comprehensive Dataset.

IEEE Trans Pattern Anal Mach Intell

August 2025

Jianxiong Gao , Yanwei Fu , Yuqian Fu , Yun Wang , Xuelin Qian

Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI) data, introduced as Recon3DMind, is of significant interest to both cognitive neuroscience and computer vision. To advance this task, we present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4,768 3D objects. The dataset consists of two components: fMRI-Shape, previously introduced and available at https://huggingface.

View Article and Find Full Text PDF

Similar Publications