98%
921
2 minutes
20
The image captioning task is increasingly prevalent in artificial intelligence applications for medicine. One important application is clinical report generation from chest radiographs. The clinical writing of unstructured reports is time consuming and error-prone. An automated system would improve standardization, error reduction, time consumption, and medical accessibility. In this paper we demonstrate the importance of domain specific pre-training and propose a modified transformer architecture for the medical image captioning task. To accomplish this, we train a series of modified transformers to generate clinical reports from chest radiograph image input. These modified transformers include: a meshed-memory augmented transformer architecture with visual extractor using ImageNet pre-trained weights, a meshed-memory augmented transformer architecture with visual extractor using CheXpert pre-trained weights, and a meshed-memory augmented transformer whose encoder is passed the concatenated embeddings using both ImageNet pre-trained weights and CheXpert pre-trained weights. We use BLEU(1-4), ROUGE-L, CIDEr, and the clinical CheXbert F1 scores to validate our models and demonstrate competitive scores with state of the art models. We provide evidence that ImageNet pre-training is ill-suited for the medical image captioning task, especially for less frequent conditions (e.g.: enlarged cardiomediastinum, lung lesion, pneumothorax). Furthermore, we demonstrate that the double feature model improves performance for specific medical conditions (edema, consolidation, pneumothorax, support devices) and overall CheXbert F1 score, and should be further developed in future work. Such a double feature model, including both ImageNet pre-training as well as domain specific pre-training, could be used in a wide range of image captioning models in medicine.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10372225 | PMC |
http://dx.doi.org/10.1016/j.heliyon.2023.e17968 | DOI Listing |
Sci Rep
August 2025
Computer Science Department, College of Computing and Informatics, Saudi Electronic University, 11673, Riyadh, Saudi Arabia.
With the rapid urban development and initiatives such as Saudi Vision 2030, efforts have been directed toward improving services and quality of life in Saudi cities. As a result, multiple environmental challenges have emerged, including visual pollution (VP), which significantly impacts the quality of life. Current approaches to these challenges rely on reporting through an online application managed by the Ministry of Municipalities and Housing, which is prone to errors due to manual data entry.
View Article and Find Full Text PDFNeural Netw
August 2025
National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, 610064, PR China; College of Computer Science, Sichuan University, Chengdu, 610065, PR China. Electronic address:
The primary goal of change captioning is to identify subtle visual differences between two similar images and express them in natural language. Existing research has been significantly influenced by the task of vision change detection and has mainly concentrated on the identification and description of visual changes. However, we contend that an effective change captioner should go beyond mere detection and description of what has changed.
View Article and Find Full Text PDFNat Mach Intell
August 2025
cerebrUM, Département de Psychologie, Université de Montréal, Montreal, Quebec Canada.
The human brain extracts complex information from visual inputs, including objects, their spatial and semantic interrelations, and their interactions with the environment. However, a quantitative approach for studying this information remains elusive. Here we test whether the contextual information encoded in large language models (LLMs) is beneficial for modelling the complex visual information extracted by the brain from natural scenes.
View Article and Find Full Text PDFIEEE Trans Image Process
January 2025
Video summarization aims to generate a compact summary of the original video by selecting and combining the most representative parts. Most existing approaches only focus on recognizing key video segments to generate the summary, which lacks holistic considerations. The transitions between selected video segments are usually abrupt and inconsistent, making the summary confusing.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
August 2025
Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI) data, introduced as Recon3DMind, is of significant interest to both cognitive neuroscience and computer vision. To advance this task, we present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4,768 3D objects. The dataset consists of two components: fMRI-Shape, previously introduced and available at https://huggingface.
View Article and Find Full Text PDF