Validating the accuracy of deep learning for the diagnosis of pneumonia on chest x-ray against a robust multimodal reference diagnosis: a post hoc analysis of two prospective studies.

Jeremy Hofmeister , Nicolas Garin , Xavier Montet , Max Scheffler , Alexandra Platon , Pierre-Alexandre Poletti , Jérôme Stirnemann , Marie-Pierre Debray , Yann-Erick Claessens , Xavier Duval , Virginie Prendki

Eur Radiol Exp

Department of Rehabilitation and Geriatrics, Geneva University Hospitals, Geneva, Switzerland.

Published: February 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Artificial intelligence (AI) seems promising in diagnosing pneumonia on chest x-rays (CXR), but deep learning (DL) algorithms have primarily been compared with radiologists, whose diagnosis can be not completely accurate. Therefore, we evaluated the accuracy of DL in diagnosing pneumonia on CXR using a more robust reference diagnosis.

Methods: We trained a DL convolutional neural network model to diagnose pneumonia and evaluated its accuracy in two prospective pneumonia cohorts including 430 patients, for whom the reference diagnosis was determined a posteriori by a multidisciplinary expert panel using multimodal data. The performance of the DL model was compared with that of senior radiologists and emergency physicians reviewing CXRs and that of radiologists reviewing computed tomography (CT) performed concomitantly.

Results: Radiologists and DL showed a similar accuracy on CXR for both cohorts (p ≥ 0.269): cohort 1, radiologist 1 75.5% (95% confidence interval 69.1-80.9), radiologist 2 71.0% (64.4-76.8), DL 71.0% (64.4-76.8); cohort 2, radiologist 70.9% (64.7-76.4), DL 72.6% (66.5-78.0). The accuracy of radiologists and DL was significantly higher (p ≤ 0.022) than that of emergency physicians (cohort 1 64.0% [57.1-70.3], cohort 2 63.0% [55.6-69.0]). Accuracy was significantly higher for CT (cohort 1 79.0% [72.8-84.1], cohort 2 89.6% [84.9-92.9]) than for CXR readers including radiologists, clinicians, and DL (all p-values < 0.001).

Conclusions: When compared with a robust reference diagnosis, the performance of AI models to identify pneumonia on CXRs was inferior than previously reported but similar to that of radiologists and better than that of emergency physicians.

Relevance Statement: The clinical relevance of AI models for pneumonia diagnosis may have been overestimated. AI models should be benchmarked against robust reference multimodal diagnosis to avoid overestimating its performance.

Trial Registration: NCT02467192 , and NCT01574066 .

Key Point: • We evaluated an openly-access convolutional neural network (CNN) model to diagnose pneumonia on CXRs. • CNN was validated against a strong multimodal reference diagnosis. • In our study, the CNN performance (area under the receiver operating characteristics curve 0.74) was lower than that previously reported when validated against radiologists' diagnosis (0.99 in a recent meta-analysis). • The CNN performance was significantly higher than emergency physicians' (p ≤ 0.022) and comparable to that of board-certified radiologists (p ≥ 0.269).

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10834924	PMC
http://dx.doi.org/10.1186/s41747-023-00416-y	DOI Listing

Publication Analysis

Top Keywords

deep learning

pneumonia chest

reference diagnosis

diagnosing pneumonia

evaluated accuracy

emergency physicians

cohort radiologist

710% 644-768

radiologists

cohort

Similar Publications

Current Trends and Future Directions of Statistical Methods in Medical Research: A Scientometric Analysis.

J Eval Clin Pract

September 2025

Department of Orthopedics and Traumatology, Medical Faculty, University of Health Sciences, Antalya, Turkey.

Fatma Yardibi , Chaomei Chen , Cagdas Hakan Aladag , Ozkan Kose

Aims And Objective: The field of medical statistics has experienced significant advancements driven by integrating innovative statistical methodologies. This study aims to conduct a comprehensive analysis to explore current trends, influential research areas, and future directions in medical statistics.

Methods: This paper maps the evolution of statistical methods used in medical research based on 4,919 relevant publications retrieved from the Web of Science.

View Article and Find Full Text PDF

Similar Publications

Artificial Intelligence in Contact Dermatitis: Current and Future Perspectives.

Dermatitis

September 2025

From the Department of Dermatology, Venereology and Leprology, All India Institute of Medical Sciences (AIIMS), Bhopal, India.

Akriti Agrawal

Contact dermatitis (CD), which includes both allergic CD and irritant CD, is a common inflammatory condition that can pose significant diagnostic challenges. Although patch testing is the gold standard for identifying causative allergens for allergic contact dermatitis (ACD), it is time-consuming, subjective, and requires expert interpretation. Recent advancements in artificial intelligence (AI), particularly in machine learning (ML) and deep learning, have shown promise in improving the accuracy, efficiency, and accessibility of CD diagnosis and management.

View Article and Find Full Text PDF

Similar Publications

Optimized node-level capsule graph neural network for subject-independent emotion recognition from EEG signals.

Electromagn Biol Med

September 2025

Computer Science and Business Systems, Sri Krishna College of Engineering and Technology, Coimbatore, India.

G Kiruthiga , Ashwinth Janarthanan , P D Mahendhiran

Subject-independent emotion detection using EEG (Electroencephalography) using Vibrational Mode Decomposition and deep learning is made possible by the scarcity of labelled EEG datasets encompassing a variety of emotions. Labelled EEG data collection over a wide range of emotional states from a broad and varied population is challenging and resource-intensive. As a result, models trained on small or biased datasets may fail to generalize well to unknown individuals or emotional states, resulting in lower accuracy and robustness in real-world applications.

View Article and Find Full Text PDF

Similar Publications

[A myocardial infarction detection and localization model based on multi-scale field residual blocks fusion with modified channel attention].

Nan Fang Yi Ke Da Xue Xue Bao

August 2025

School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China.

Qiucen Wu , Xueqi Lu , Yaoqi Wen , Yong Hong , Yuliang Wu

Objectives: We propose a myocardial infarction (MI) detection and localization model for improving the diagnostic accuracy for MI to provide assistance to clinical decision-making.

Methods: The proposed model was constructed based on multi-scale field residual blocks fusion modified channel attention (MSF-RB-MCA). The model utilizes lead II electrocardiogram (ECG) signals to detect and localize MI, and extracts different levels of feature information through the multi-scale field residual block.

View Article and Find Full Text PDF

Similar Publications

Large language models in nephrology: applications and challenges in chronic kidney disease management.

Ren Fail

December 2025

Department of Nephrology, The Affiliated Hospital of Qingdao University, Qingdao, China.

Yongzheng Hu , Jianping Liu , Wei Jiang

Large language models (LLMs) represent a transformative advance in artificial intelligence, with growing potential to impact chronic kidney disease (CKD) management. CKD is a complex, highly prevalent condition requiring multifaceted care and substantial patient engagement. Recent developments in LLMs-including conversational AI, multimodal integration, and autonomous agents-offer novel opportunities to enhance patient education, streamline clinical documentation, and support decision-making across nephrology practice.

View Article and Find Full Text PDF

Similar Publications