98%
921
2 minutes
20
Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT-3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11407633 | PMC |
http://dx.doi.org/10.1371/journal.pdig.0000604 | DOI Listing |
Anim Reprod Sci
September 2025
Department of Biomedical & Clinical Sciences (BKV), BKH/Obstetrics & Gynecology, Faculty of Medicine and Health Sciences, Linköping University, Linköping SE-58185, Sweden.
Embryo transfer (ET) is a valuable reproductive technology in pigs, albeit its efficiency remains significantly lower than that of natural mating or artificial insemination (AI), owing to high embryonic death rates. Critical for embryo survival and pregnancy success is the placenta, which supports conceptus development through nutrient exchange, hormone production, and immune modulation. Alterations in placental development and function may therefore underlie the reduced efficiency of ET.
View Article and Find Full Text PDFTurk J Pediatr
September 2025
Division of Adolescent Health, Department of Pediatrics, University of Ottawa, Children's Hospital of Eastern Ontario (CHEO), Ottawa, Ontario, Canada.
Background: Food addiction has been increasingly recognized as a contributing factor to obesity and eating disorders. Compulsive eating, characterized by an uncontrollable urge to consume food despite adverse consequences, shares behavioral similarities with substance addiction. This study aims to adapt the Brief Measure of Eating Compulsivity (MEC) into Turkish and evaluate its validity and reliability in the adolescent population.
View Article and Find Full Text PDFCrit Care Explor
September 2025
Division of Tropical Medicine and Infectious Diseases, Department of Internal Medicine, Dr. Cipto Mangunkusumo National General Hospital, Faculty of Medicine Universitas Indonesia, Jakarta, Indonesia.
Importance: Sepsis remains a leading cause of death in infectious cases. The heterogeneity of immune responses is a major challenge in the management and prognostication of patients with sepsis. Identifying distinct immune response subphenotypes using parsimonious classifiers may improve outcome prediction, particularly in resource-limited settings.
View Article and Find Full Text PDFPediatr Phys Ther
September 2025
Department of Medicine and Health Science, University of Trieste, 34100 Trieste, Italy (Dr Policastro and Goos); Institute for Maternal and Child Health IRCCS Burlo Garofolo, 34137 Trieste, Italy (Casalaz and Sartori); Departmental Faculty of Medicine and Surgery, Saint Camillus International Univer
Purpose: Low back and neck pain are increasing worldwide, even in children. However, Italy lacks validated tools for the assessment of children and adolescents with spine disorders. The Young Spine Questionnaire (YSQ) seems to be an appropriate option.
View Article and Find Full Text PDF