98%
921
2 minutes
20
Objectives: The main objective of this study is to evaluate the ability of the Large Language Model Chat Generative Pre-Trained Transformer (ChatGPT) to accurately answer the United States Medical Licensing Examination (USMLE) board-style medical ethics questions compared to medical knowledge-based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and assessing the variability of responses given by each version.
Methods: Using AMBOSS, a third-party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4 and recorded the output. A random-effects linear probability regression model evaluated accuracy and a Shannon entropy calculation evaluated response variation.
Results: Both versions of ChatGPT demonstrated worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points ( < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points ( = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points ( < 0.001) on medical ethics and 33% points ( < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55, respectively) which indicates lower variability in response.
Conclusion: Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11617073 | PMC |
http://dx.doi.org/10.1055/a-2405-0138 | DOI Listing |
Spine Deform
September 2025
Spine Unit, Department of Orthopedic Surgery, Rigshospitalet, Inge Lehmanns Vej 6, 2100, Copenhagen, Denmark.
Study Design: This is a retrospective single-center study.
Purpose: The purpose is to investigate the incidence of distal junctional kyphosis (DJK) when fused proximal to the stable sagittal vertebra (SSV) in adolescent idiopathic scoliosis (AIS) patients undergoing selective thoracic fusion.
Methods: We retrospectively reviewed a consecutive cohort of surgically treated AIS patients with Lenke 1-2 A/B curves between 2011 and 2022 with a minimum of 2 years of follow-up.
J Behav Med
September 2025
Department of Psychology, University of Wisconsin-La Crosse, La Crosse, WI, USA.
Latent profile analysis (LPA) is in the finite mixture model analysis family and identifies subgroups by participants' responses to continuous variables (i.e., indicators); participants' probable membership in each subgroup is based on the similarity between the subgroup's prototypical responses and the person's unique responses.
View Article and Find Full Text PDFActa Parasitol
September 2025
Région du Centre, Université Joseph Ki-Zerbo, Rue Thomas Sankara, O3 BP 7021, Ouagadougou, Burkina Faso.
Introduction: The objective of the World Health Organization is to achieve the interruption of human African trypanosomiasis (HAT) transmission by 2030.
Methods: This review aims to update knowledge on HAT, through a synthesis on the epidemiology, diagnostic tools and drugs of HAT.
Results: From 1960 to 2024 approximately 132,063 cases of HAT have been reported across Africa.
Acta Neurochir (Wien)
September 2025
Department of Neurosurgery, Istinye University, Istanbul, Turkey.
Background: Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context.
View Article and Find Full Text PDFEur Radiol
September 2025
Department of Ultrasound, Affiliated Hospital of Nanjing University of Chinese Medicine, Jiangsu Province Hospital of Chinese Medicine, Nanjing, China.
Objectives: To evaluate the predictive role of carotid stiffening, quantified using ultrafast pulse wave velocity (ufPWV), for assessing cardiovascular risk in young populations with no or elevated cardiovascular risk factors (CVRFs).
Materials And Methods: This study enrolled 180 young, apparently healthy individuals who underwent ufPWV measurements. They were classified into three groups: the CVRF-free group (n = 60), comprising current non-smokers with untreated blood pressure < 140/90 mmHg, fasting blood glucose (FBG) < 7.