Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools.

Justin T Reese , Leonardo Chimirri , Yasemin Bridges , Daniel Danis , J Harry Caufield , Michael A Gargano , Carlo Kroll , Andrew Schmeder , Fengchen Liu , Kyran Wissink , Julie A McMurry , Adam Sl Graefe , Enock Niyonkuru , Daniel R Korn , Elena Casiraghi , Giorgio Valentini , Julius Ob Jacobsen , Melissa Haendel , Damian Smedley , Christopher J Mungall

medRxiv

Monarch Initiative.

Published: May 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Large language models (LLMs) show promise in supporting differential diagnosis, but their performance is challenging to evaluate due to the unstructured nature of their responses and their accuracy compared to existing diagnostic tools is not well characterized. To assess the current capabilities of LLMs to diagnose genetic diseases, we benchmarked these models on 5,213 case reports using the Phenopacket Schema, the Human Phenotype Ontology and Mondo disease ontology. Prompts generated from each phenopacket were sent to seven LLMs, including four generalist models and three LLMs specialized for medical applications. The same phenopackets were used as input to a widely used diagnostic tool, Exomiser, in phenotype-only mode. The best LLM ranked the correct diagnosis first in 23.6% of cases, whereas Exomiser did so in 35.5% of cases. While the performance of LLMs for supporting differential diagnosis has been improving, it has not reached the level of commonly used traditional bioinformatics tools. Future research is needed to determine the best approach to incorporate LLMs into diagnostic pipelines.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11302616	PMC
http://dx.doi.org/10.1101/2024.07.22.24310816	DOI Listing

Publication Analysis

Top Keywords

large language

language models

supporting differential

differential diagnosis

llms

systematic benchmarking

benchmarking demonstrates

demonstrates large

models

models reached

A PHP Error was encountered