Performance of GPT-4 Turbo and GPT-4o in Korean Society of Radiology In-Training Examinations.

Korean J Radiol

Department of Radiology, Eunpyeong St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea.

Published: June 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Objective: Despite the potential of large language models for radiology training, their ability to handle image-based radiological questions remains poorly understood. This study aimed to evaluate the performance of the GPT-4 Turbo and GPT-4o in radiology resident examinations, to analyze differences across question types, and to compare their results with those of residents at different levels.

Materials And Methods: A total of 776 multiple-choice questions from the Korean Society of Radiology In-Training Examinations were used, forming two question sets: one originally written in Korean and the other translated into English. We evaluated the performance of GPT-4 Turbo (gpt-4-turbo-2024-04-09) and GPT-4o (gpt-4o-2024-11-20) on these questions with the temperature set to zero, determining the accuracy based on the majority vote from five independent trials. We analyzed their results using the question type (text-only vs. image-based) and benchmarked them against nationwide radiology residents' performance. The impact of the input language (Korean or English) on model performance was examined.

Results: GPT-4o outperformed GPT-4 Turbo for both image-based (48.2% vs. 41.8%, = 0.002) and text-only questions (77.9% vs. 69.0%, = 0.031). On image-based questions, GPT-4 Turbo and GPT-4o showed comparable performance to that of 1st-year residents (41.8% and 48.2%, respectively, vs. 43.3%, = 0.608 and 0.079, respectively) but lower performance than that of 2nd- to 4th-year residents (vs. 56.0%-63.9%, all ≤ 0.005). For text-only questions, GPT-4 Turbo and GPT-4o performed better than residents across all years (69.0% and 77.9%, respectively, vs. 44.7%-57.5%, all ≤ 0.039). Performance on the English- and Korean-version questions showed no significant differences for either model (all ≥ 0.275).

Conclusion: GPT-4o outperformed the GPT-4 Turbo in all question types. On image-based questions, both models' performance matched that of 1st-year residents but was lower than that of higher-year residents. Both models demonstrated superior performance compared to residents for text-only questions. The models showed consistent performances across English and Korean inputs.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12123083PMC
http://dx.doi.org/10.3348/kjr.2024.1096DOI Listing

Publication Analysis

Top Keywords

gpt-4 turbo
28
turbo gpt-4o
16
performance gpt-4
12
text-only questions
12
performance
10
questions
9
korean society
8
society radiology
8
radiology in-training
8
in-training examinations
8

Similar Publications

Background: On average, 1 in 10 patients die because of a diagnostic error, and medical errors represent the third largest cause of death in the United States. While large language models (LLMs) have been proposed to aid doctors in diagnoses, no research results have been published comparing the diagnostic abilities of many popular LLMs on a large, openly accessible real-patient cohort.

Objective: In this study, we set out to compare the diagnostic ability of 18 LLMs from Google, OpenAI, Meta, Mistral, Cohere, and Anthropic, using 3 prompts, 2 temperature settings, and 1000 randomly selected Medical Information Mart for Intensive Care-IV (MIMIC-IV) hospital admissions.

View Article and Find Full Text PDF

Fine-Tuned Large Language Models for High-Accuracy Prediction of Band Gap and Stability in Transition Metal Sulfides.

Materials (Basel)

August 2025

Demonstrative Software School, College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu 610059, China.

This study presents a fine-tuned Large Language Model approach for predicting band gap and stability of transition metal sulfides. Our method processes textual descriptions of crystal structures directly, eliminating the need for complex feature engineering required by traditional ML and GNN approaches. Using a strategically selected dataset of 554 compounds from the Materials Project database, we fine-tuned GPT-3.

View Article and Find Full Text PDF

Large language models (LLMs) have significantly advanced in recent years, greatly enhancing the capabilities of retrieval-augmented generation (RAG) systems. However, challenges such as semantic similarity, bias/sentiment, and hallucinations persist, especially in domain-specific applications. This paper introduces MultiLLM-Chatbot, a scalable RAG-based benchmarking framework designed to evaluate five popular LLMs GPT-4-Turbo, CLAUDE-3.

View Article and Find Full Text PDF

Introduction: Dental age estimation plays a key role in forensic identification, clinical diagnosis, treatment planning, and prognosis in fields such as pediatric dentistry and orthodontics. Large language models (LLM) are increasingly being recognized for their potential applications in Dentistry. This study aimed to compare the performance of currently available generative artificial intelligence LLM technologies in estimating dental age using the Demirjian's scores.

View Article and Find Full Text PDF