RadioRAG: Online Retrieval-Augmented Generation for Radiology Question Answering.

Soroosh Tayebi Arasteh , Mahshad Lotfinia , Keno Bressem , Robert Siepmann , Lisa Adams , Dyke Ferber , Christiane Kuhl , Jakob Nikolas Kather , Sven Nebelung , Daniel Truhn

Radiol Artif Intell

Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Pauwelsstr 30, 52074 Aachen, Germany.

Published: July 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Purpose To evaluate diagnostic accuracy of various large language models (LLMs) when answering radiology-specific questions with and without access to additional online, up-to-date information via retrieval-augmented generation (RAG). Materials and Methods The authors developed radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RAG incorporates information retrieval from external sources to supplement the initial prompt, grounding the model's response in relevant information. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo [OpenAI], GPT-4, Mistral 7B, Mixtral 8×7B [Mistral], and Llama3-8B and -70B [Meta]) were prompted with and without RadioRAG in a zero-shot inference scenario (temperature ≤ 0.1, top-p = 1). RadioRAG retrieved context-specific information from . Accuracy of LLMs with and without RadioRAG in answering questions from each dataset was assessed. Statistical analyses were performed using bootstrapping while preserving pairing. Additional assessments included comparison of model with human performance and comparison of time required for conventional versus RadioRAG-powered question answering. Results RadioRAG improved accuracy for some LLMs, including GPT-3.5-turbo (74% [59 of 80] vs 66% [53 of 80], false discovery rate [FDR] = 0.03) and Mixtral 8×7B (76% [61 of 80] vs 65% [52 of 80], FDR = 0.02) on the RSNA radiology question answering (RSNA-RadioQA) dataset, with similar trends in the ExtendedQA dataset. Accuracy exceeded that of a human expert (63% [50 of 80], FDR ≤ 0.007) for these LLMs, although not for Mistral 7B-instruct-v0.2, Llama3-8B, and Llama3-70B (FDR ≥ 0.21). RadioRAG reduced hallucinations for all LLMs (rate, 6%-25%). RadioRAG increased estimated response time fourfold. Conclusion RadioRAG shows potential to improve LLM accuracy and factuality in radiology QA by integrating real-time, domain-specific data. Retrieval-augmented Generation, Informatics, Computer-aided Diagnosis, Large Language Models © RSNA, 2025.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12326075	PMC
http://dx.doi.org/10.1148/ryai.240476	DOI Listing

Publication Analysis

Top Keywords

retrieval-augmented generation

question answering

radiorag

radiology question

large language

language models

mixtral 8×7b

accuracy llms

80] fdr

llms

Similar Publications

Performance of GPT-4o combined with retrieval-augmented generation on nutritionist licensing exam questions.

Endocr J

September 2025

Institute of Liberal Arts and Science, Kanazawa University, Kanazawa, Japan.

Yu Ishikawa , Akitaka Higashi , Nozomu Arai , Daisuke Ozo , Wataru Hasegawa

GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation.

View Article and Find Full Text PDF

Similar Publications

Effectiveness, Usability, and Acceptability of ChatGPT With Retrieval-Augmented Generation (SIV-ChatGPT) in Increasing Seasonal Influenza Vaccination Uptake Among Older Adults: Quasi-Experimental Study.

J Med Internet Res

September 2025

School of Governance and Policy Science, The Chinese University of Hong Kong, Hong Kong, China (Hong Kong).

Zixin Wang , Tsz Hin Tsang , Fuk-Yuen Yu , Yuan Fang , Siyu Chen

Background: Older adults are more vulnerable to severe consequences caused by seasonal influenza. Although seasonal influenza vaccination (SIV) is effective and free vaccines are available, the SIV uptake rate remained inadequate among people aged 65 years or older in Hong Kong, China. There was a lack of studies evaluating ChatGPT in promoting vaccination uptake among older adults.

View Article and Find Full Text PDF

Similar Publications

Enhancing Oncology-Specific Question Answering With Large Language Models Through Fine-Tuned Embeddings With Synthetic Data.

JCO Clin Cancer Inform

September 2025

Department of Applied AI and Data Science, City of Hope, Duarte, CA.

Kun-Han Lu , Sina Mehdinia , Kingson Man , Chi Wah Wong , Allen Mao

Purpose: The recent advancements of retrieval-augmented generation (RAG) and large language models (LLMs) have revolutionized the extraction of real-world evidence from unstructured electronic health records (EHRs) in oncology. This study aims to enhance RAG's effectiveness by implementing a retriever encoder specifically designed for oncology EHRs, with the goal of improving the precision and relevance of retrieved clinical notes for oncology-related queries.

Methods: Our model was pretrained with more than six million oncology notes from 209,135 patients at City of Hope.

View Article and Find Full Text PDF

Similar Publications

Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models.

Res Sq

August 2025

Yao Ge , Sudeshna Das , Yuting Guo , Abeed Sarker

Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data).

View Article and Find Full Text PDF

Similar Publications

Evaluating large language models in neuro-oncology: A comparative study of accuracy, completeness, and clinical usefulness.

J Clin Neurosci

September 2025

Nordwest-Krankenhaus Sanderbusch, Friesland Kliniken gGmbH, Department of Neurosurgery, Sande, Germany. Electronic address:

Shefqet Hajdari , Minaam Farooq , Aleeza Habib , Asad Ali Siddiqui , Laiba Sarfraz

Background: Large language models (LLMs), with their remarkable ability to retrieve and analyse the information within seconds, are generating significant interest in the domain of healthcare. This study aims to assess and compare the accuracy, completeness, and usefulness of the responses of Gemini Advanced, ChatGPT-3.5, and ChatGPT-4, in neuro-oncology cases.

View Article and Find Full Text PDF

Similar Publications