Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Background: Recognizing patient symptoms is fundamental to medicine, research, and public health. However, symptoms are often underreported in coded formats even though they are routinely documented in physician notes. Large language models (LLMs), noted for their generalizability, could help bridge this gap by mimicking the role of human expert chart reviewers for symptom identification.

Objective: The primary objective of this multisite study was to measure the accurate identification of infectious respiratory disease symptoms using LLMs instructed to follow chart review guidelines. The secondary objective was to evaluate LLM generalizability in multisite settings without the need for site-specific training, fine-tuning, or customization.

Methods: Four LLMs were evaluated: GPT-4, GPT-3.5, Llama2 70B, and Mixtral 8×7B. LLM prompts were instructed to take on the role of chart reviewers and follow symptom annotation guidelines when assessing physician notes. Ground truth labels for each note were annotated by subject matter experts. Optimal LLM prompting strategies were selected using a development corpus of 103 notes from the emergency department at Boston Children's Hospital. The performance of each LLM was measured using a test corpus with 202 notes from Boston Children's Hospital. The performance of an International Classification of Diseases, Tenth Revision (ICD-10)-based method was also measured as a baseline. Generalizability of the most performant LLM was then measured in a validation corpus of 308 notes from 21 emergency departments in the Indiana Health Information Exchange.

Results: Symptom identification accuracy was superior for every LLM tested for each infectious disease symptom compared to an ICD-10-based method (F1-score=45.1%). GPT-4 was the highest scoring (F1-score=91.4%; P<.001) and was significantly better than the ICD-10-based method, followed by GPT-3.5 (F1-score=90.0%; P<.001), Llama2 (F1-score=81.7%; P<.001), and Mixtral (F1-score=83.5%; P<.001). For the validation corpus, performance of the ICD-10-based method decreased (F1-score=26.9%), while GPT-4 increased (F1-score=94.0%), demonstrating better generalizability using GPT-4 (P<.001).

Conclusions: LLMs significantly outperformed an ICD-10-based method for respiratory symptom identification in emergency department electronic health records. GPT-4 demonstrated the highest accuracy and generalizability, suggesting that LLMs may augment or replace traditional approaches. LLMs can be instructed to mimic human chart reviewers with high accuracy. Future work should assess broader symptom types and health care settings.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12313083PMC
http://dx.doi.org/10.2196/72984DOI Listing

Publication Analysis

Top Keywords

large language
8
symptom identification
8
physician notes
8
chart reviewers
8
notes emergency
8
boston children's
8
children's hospital
8
hospital performance
8
llm measured
8
icd-10-based method
8

Similar Publications

Patient-reported outcomes after lobectomy vs. segmentectomy for early-stage non-small cell lung cancer.

Surg Endosc

September 2025

Department of Thoracic Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China.

Background: Surgical resection is the cornerstone for early-stage non-small cell lung cancer (NSCLC), with lobectomy historically standard. Evolving techniques have spurred debate comparing lobectomy and segmentectomy. This study analyzed early postoperative patient-reported symptoms and functional status in patients with early NSCLC undergoing either procedure.

View Article and Find Full Text PDF

Purpose: The study aims to compare the treatment recommendations generated by four leading large language models (LLMs) with those from 21 sarcoma centers' multidisciplinary tumor boards (MTBs) of the sarcoma ring trial in managing complex soft tissue sarcoma (STS) cases.

Methods: We simulated STS-MTBs using four LLMs-Llama 3.2-vison: 90b, Claude 3.

View Article and Find Full Text PDF

Background: Clinical communication is central to the delivery of effective, timely, and safe patient care. The use of text-based tools for clinician-to-clinician communication-commonly referred to as secure messaging-has increased exponentially over the past decade. The use of secure messaging has a potential impact on clinician work behaviors, workload, and cognitive burden.

View Article and Find Full Text PDF

Artificial Intelligence in allergy and immunology: recent developments, implementation challenges, and the road towards clinical impact.

J Allergy Clin Immunol

September 2025

University of Groningen, University Medical Center Groningen, Beatrix Children's Hospital, Department of Pediatric Pulmonology and Pediatric Allergology, Groningen, the Netherlands; University of Groningen, University Medical Center Groningen, Groningen Research Institute for Asthma and COPD (GRIAC)

Artificial intelligence (AI) is increasingly recognized for its capacity to transform medicine. While publications applying AI in allergy and immunology have increased, clinical implementation substantially lags behind other specialties. By mid-2024, over 1,000 FDA-approved AI-enabled medical devices existed, but none specifically addressed allergy and immunology.

View Article and Find Full Text PDF

[Artificial Intelligence Methods - a Perspective for Cardiovascular Telemedicine?].

Dtsch Med Wochenschr

September 2025

Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Charité Universitätsmedizin Berlin, Berlin, Deutschland.

Since 2022, an estimated 150000 to 200000 patients with heart failure (HF) in Germany have met the inclusion criteria for HF telemonitoring in accordance with the Federal Joint Committee's (G-BA) decision. Currently, only a few artificial intelligence (AI) applications are used in standard cardiovascular telemedicine care. However, AI applications could improve the predictive accuracy of existing telemedical sensor technology by recognising patterns across multiple data sources.

View Article and Find Full Text PDF