ChatGPT vs. Gemini: Comparative accuracy and efficiency in Lung-RADS score assignment from radiology reports.

Ria Singh , Mohamed Hamouda , Jordan H Chamberlin , Adrienn Tóth , James Munford , Matthew Silbergleit , Dhiraj Baruah , Jeremy R Burt , Ismail M Kabakus

Clin Imaging

Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA. Electronic address:

Published: May 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Objective: To evaluate the accuracy of large language models (LLMs) in generating Lung-RADS scores based on lung cancer screening low-dose computed tomography radiology reports.

Material And Methods: A retrospective cross-sectional analysis was performed on 242 consecutive LDCT radiology reports generated by cardiothoracic fellowship-trained radiologists at a tertiary center. LLMs evaluated included ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced. Each LLM was used to assign Lung-RADS scores based on the findings section of each report. No domain-specific fine-tuning was applied. Accuracy was determined by comparing the LLM-assigned scores to radiologist-assigned scores. Efficiency was assessed by measuring response times for each LLM.

Results: ChatGPT-4o achieved the highest accuracy (83.6 %) in assigning Lung-RADS scores compared to other models, with ChatGPT-3.5 reaching 70.1 %. Gemini and Gemini Advanced had similar accuracy (70.9 % and 65.1 %, respectively). ChatGPT-3.5 had the fastest response time (median 4 s), while ChatGPT-4o was slower (median 10 s). Higher Lung-RADS categories were associated with marginally longer completion times. ChatGPT-4o demonstrated the greatest agreement with radiologists (κ = 0.836), although it was less than the previously reported human interobserver agreement.

Conclusion: ChatGPT-4o outperformed ChatGPT-3.5, Gemini, and Gemini Advanced in Lung-RADS score assignment accuracy but did not reach the level of human experts. Despite promising results, further work is needed to integrate domain-specific training and ensure LLM reliability for clinical decision-making in lung cancer screening.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.clinimag.2025.110455	DOI Listing

Publication Analysis

Top Keywords

lung-rads scores

gemini advanced

lung-rads score

score assignment

radiology reports

scores based

lung cancer

cancer screening

google gemini

gemini gemini

A PHP Error was encountered