Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Background: Large language models (LLMs) are increasingly used in the medical field for diverse applications including differential diagnostic support. The estimated training data used to create LLMs such as the Generative Pretrained Transformer (GPT) predominantly consist of English-language texts, but LLMs could be used across the globe to support diagnostics if language barriers could be overcome. Initial pilot studies on the utility of LLMs for differential diagnosis in languages other than English have shown promise, but a large-scale assessment on the relative performance of these models in a variety of European and non-European languages on a comprehensive corpus of challenging rare-disease cases is lacking.

Methods: We created 4967 clinical vignettes using structured data captured with Human Phenotype Ontology (HPO) terms with the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema. These clinical vignettes span a total of 378 distinct genetic diseases with 2618 associated phenotypic features. We used translations of the Human Phenotype Ontology together with language-specific templates to generate prompts in English, Chinese, Czech, Dutch, German, Italian, Japanese, Spanish, and Turkish. We applied GPT-4o, version gpt-4o-2024-08-06, to the task of delivering a ranked differential diagnosis using a zero-shot prompt. An ontology-based approach with the Mondo disease ontology was used to map synonyms and to map disease subtypes to clinical diagnoses in order to automate evaluation of LLM responses.

Findings: For English, GPT-4o placed the correct diagnosis at the first rank 19·8% and within the top-3 ranks 27·0% of the time. In comparison, for the eight non-English languages tested here the correct diagnosis was placed at rank 1 between 16·9% and 20·5%, within top-3 between 25·3% and 27·7% of cases.

Interpretation: The differential diagnostic performance of GPT-4o across a comprehensive corpus of rare-disease cases was consistent across the nine languages tested. This suggests that LLMs such as GPT-4o may have utility in non-English clinical settings.

Funding: NHGRI 5U24HG011449 and 5RM1HG010860. P.N.R. was supported by a Professorship of the Alexander von Humboldt Foundation; P.L. was supported by a National Grant (PMP21/00063 ONTOPRECISC-III, Fondos FEDER).

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11888497PMC
http://dx.doi.org/10.1101/2025.02.26.25322769DOI Listing

Publication Analysis

Top Keywords

performance gpt-4o
8
diagnosis languages
8
differential diagnostic
8
differential diagnosis
8
comprehensive corpus
8
rare-disease cases
8
clinical vignettes
8
human phenotype
8
phenotype ontology
8
correct diagnosis
8

Similar Publications

Purpose: Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.

Methods: This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries.

View Article and Find Full Text PDF

Outcomes were to compare the accuracy of 2 large-language models-GPT-4o and o3-Mini-against medical-student performance on otolaryngology-focused, USMLE-style multiple-choice questions. With permission from AMBOSS, we extracted 146 Step 2 CK questions tagged "Otolaryngology" and stratified them by AMBOSS difficulty (levels 1-5). Each item was presented verbatim to GPT-4o and o3-Mini through their official APIs; outputs were scored correct/incorrect.

View Article and Find Full Text PDF

Introduction: ChatGPT, a generative artificial intelligence, has potential applications in numerous fields, including medical education. This potential can be assessed through its performance on medical exams. Medical residency exams, critical for entering medical specialties, serve as a valuable benchmark.

View Article and Find Full Text PDF

Purpose: Diabetic retinopathy (DR) is a leading cause of vision loss in working-age adults. Despite the importance of early DR detection, only 60% of patients with diabetes receive recommended annual screenings due to limited eye care provider capacity. FDA-approved AI systems were developed to meet the growing demand for DR screening; however, high costs and specialized equipment limit accessibility.

View Article and Find Full Text PDF

Purpose: This study aimed to evaluate the performance of ChatGPT (GPT-4o) in interpreting free-text breast magnetic resonance imaging (MRI) reports by assigning BI-RADS categories and recommending appropriate clinical management steps in the absence of explicitly stated BI-RADS classifications.

Methods: In this retrospective, single-center study, a total of 352 documented full-text breast MRI reports of at least one identifiable breast lesion with descriptive imaging findings between January 2024 and June 2025 were included in the study. Incomplete reports due to technical limitations, reports describing only normal findings, and MRI examinations performed at external institutions were excluded from the study.

View Article and Find Full Text PDF