Performance of GPT-4 Turbo and GPT-4o in Korean Society of Radiology In-Training Examinations.

Arum Choi , Hyun Gi Kim , Moon Hyung Choi , Shakthi Kumaran Ramasamy , Youme Kim , Seung Eun Jung

Korean J Radiol

Department of Radiology, Eunpyeong St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea.

Published: June 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Objective: Despite the potential of large language models for radiology training, their ability to handle image-based radiological questions remains poorly understood. This study aimed to evaluate the performance of the GPT-4 Turbo and GPT-4o in radiology resident examinations, to analyze differences across question types, and to compare their results with those of residents at different levels.

Materials And Methods: A total of 776 multiple-choice questions from the Korean Society of Radiology In-Training Examinations were used, forming two question sets: one originally written in Korean and the other translated into English. We evaluated the performance of GPT-4 Turbo (gpt-4-turbo-2024-04-09) and GPT-4o (gpt-4o-2024-11-20) on these questions with the temperature set to zero, determining the accuracy based on the majority vote from five independent trials. We analyzed their results using the question type (text-only vs. image-based) and benchmarked them against nationwide radiology residents' performance. The impact of the input language (Korean or English) on model performance was examined.

Results: GPT-4o outperformed GPT-4 Turbo for both image-based (48.2% vs. 41.8%, = 0.002) and text-only questions (77.9% vs. 69.0%, = 0.031). On image-based questions, GPT-4 Turbo and GPT-4o showed comparable performance to that of 1st-year residents (41.8% and 48.2%, respectively, vs. 43.3%, = 0.608 and 0.079, respectively) but lower performance than that of 2nd- to 4th-year residents (vs. 56.0%-63.9%, all ≤ 0.005). For text-only questions, GPT-4 Turbo and GPT-4o performed better than residents across all years (69.0% and 77.9%, respectively, vs. 44.7%-57.5%, all ≤ 0.039). Performance on the English- and Korean-version questions showed no significant differences for either model (all ≥ 0.275).

Conclusion: GPT-4o outperformed the GPT-4 Turbo in all question types. On image-based questions, both models' performance matched that of 1st-year residents but was lower than that of higher-year residents. Both models demonstrated superior performance compared to residents for text-only questions. The models showed consistent performances across English and Korean inputs.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12123083	PMC
http://dx.doi.org/10.3348/kjr.2024.1096	DOI Listing

Publication Analysis

Top Keywords

gpt-4 turbo

turbo gpt-4o

performance gpt-4

text-only questions

performance

questions

korean society

society radiology

radiology in-training

in-training examinations

Similar Publications

Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method.

JMIRx Med

August 2025

Rhazes AI, First Floor, 85 Great Portland Street, London, W1W 7LT, United Kingdom.

Peter Sarvari , Zaid Al-Fagih

Background: On average, 1 in 10 patients die because of a diagnostic error, and medical errors represent the third largest cause of death in the United States. While large language models (LLMs) have been proposed to aid doctors in diagnoses, no research results have been published comparing the diagnostic abilities of many popular LLMs on a large, openly accessible real-patient cohort.

Objective: In this study, we set out to compare the diagnostic ability of 18 LLMs from Google, OpenAI, Meta, Mistral, Cohere, and Anthropic, using 3 prompts, 2 temperature settings, and 1000 randomly selected Medical Information Mart for Intensive Care-IV (MIMIC-IV) hospital admissions.

View Article and Find Full Text PDF

Similar Publications

Fine-Tuned Large Language Models for High-Accuracy Prediction of Band Gap and Stability in Transition Metal Sulfides.

Materials (Basel)

August 2025

Demonstrative Software School, College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu 610059, China.

Zimo Zhao , Lin Hu , Honghui Wang

This study presents a fine-tuned Large Language Model approach for predicting band gap and stability of transition metal sulfides. Our method processes textual descriptions of crystal structures directly, eliminating the need for complex feature engineering required by traditional ML and GNN approaches. Using a strategically selected dataset of 554 compounds from the Materials Project database, we fine-tuned GPT-3.

View Article and Find Full Text PDF

Similar Publications

A scalable framework for evaluating multiple language models through cross-domain generation and hallucination detection.

Sci Rep

August 2025

Department of Computer Science and Engineering, Manipal University Jaipur, Jaipur, 303007, Rajasthan, India.

Sorup Chakraborty , Rajesh Chowdhury , Sourov Roy Shuvo , Rajdeep Chatterjee , Satyabrata Roy

Large language models (LLMs) have significantly advanced in recent years, greatly enhancing the capabilities of retrieval-augmented generation (RAG) systems. However, challenges such as semantic similarity, bias/sentiment, and hallucinations persist, especially in domain-specific applications. This paper introduces MultiLLM-Chatbot, a scalable RAG-based benchmarking framework designed to evaluate five popular LLMs GPT-4-Turbo, CLAUDE-3.

View Article and Find Full Text PDF

Similar Publications

Evaluating the accuracy of generative artificial intelligence models in dental age estimation based on the Demirjian's method.

Front Dent Med

July 2025

Department of Orthodontics, University Hospital Bonn, Medical Faculty, Bonn, Germany.

Allan Abuabara , Thais Vilalba Paniagua Machado do Nascimento , Seandra Maria Trentini , Angela Mairane Costa Gonçalves , Maria Angélica Hueb de Menezes-Oliveira

Introduction: Dental age estimation plays a key role in forensic identification, clinical diagnosis, treatment planning, and prognosis in fields such as pediatric dentistry and orthodontics. Large language models (LLM) are increasingly being recognized for their potential applications in Dentistry. This study aimed to compare the performance of currently available generative artificial intelligence LLM technologies in estimating dental age using the Demirjian's scores.

View Article and Find Full Text PDF

Similar Publications

Conference report: The second Bacterial Genome Sequencing Pan-European Network conference.

Microbes Infect

August 2025

Institute of Medical Microbiology, University of Zurich, Zurich, Switzerland. Electronic address:

Kuangyi Charles Wei , Srinithi Purushothaman , Francesca Azzato , Kate S Baker , Kira Waagner Birkeland

View Article and Find Full Text PDF

Similar Publications