98%
921
2 minutes
20
Large language models (LLMs) have shown promising potential in analyzing complex textual data, including radiological reports. These models can assist clinicians, particularly those with limited experience, by integrating and presenting diagnostic criteria within radiological classifications. However, before clinical adoption, LLMs must be rigorously validated by medical professionals to ensure accuracy, especially in the context of advanced radiological classification systems. This study evaluates the performance of four LLMs-ChatGPT-4o, AmbossGPT, Claude 3.5 Sonnet, and Gemini 2.0 Flash-in classifying fractures based on the AO classification system using CT reports. A dataset of 292 fictitious physician-generated CT reports, representing 310 fractures, was used to assess the accuracy of each LLM in AO fracture classification retrospectively. Performance was evaluated by comparing the models' classifications to ground truth labels, with accuracy rates analyzed across different fracture types and subtypes. ChatGPT-4o and AmbossGPT achieved the highest overall accuracy (74.6 and 74.3%, respectively), outperforming Claude 3.5 Sonnet (69.5%) and Gemini 2.0 Flash (62.7%). Statistically significant differences were observed in fracture type classification, particularly between ChatGPT-4o and Gemini 2.0 Flash (Δ12%, p < 0.001). While all models demonstrated strong bone recognition rates (90-99%), their accuracy in fracture subtype classification remained lower (71-77%), indicating limitations in nuanced diagnostic categorization. LLMs show potential in assisting radiologists with initial fracture classification, particularly in high-volume or resource-limited settings. However, their performance remains inconsistent for detailed subtype classification, highlighting the need for further refinement and validation before clinical integration in advanced diagnostic workflows.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1007/s10278-025-01603-6 | DOI Listing |
Surg Endosc
September 2025
Department of Thoracic Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China.
Background: Surgical resection is the cornerstone for early-stage non-small cell lung cancer (NSCLC), with lobectomy historically standard. Evolving techniques have spurred debate comparing lobectomy and segmentectomy. This study analyzed early postoperative patient-reported symptoms and functional status in patients with early NSCLC undergoing either procedure.
View Article and Find Full Text PDFJ Cancer Res Clin Oncol
September 2025
Department of Surgery, Mannheim School of Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany.
Purpose: The study aims to compare the treatment recommendations generated by four leading large language models (LLMs) with those from 21 sarcoma centers' multidisciplinary tumor boards (MTBs) of the sarcoma ring trial in managing complex soft tissue sarcoma (STS) cases.
Methods: We simulated STS-MTBs using four LLMs-Llama 3.2-vison: 90b, Claude 3.
J Med Internet Res
September 2025
Washington University in St. Louis, 660 South Euclid Avenue, Campus Box 8054, St Louis, MO, United States, 1 3142737801.
Background: Clinical communication is central to the delivery of effective, timely, and safe patient care. The use of text-based tools for clinician-to-clinician communication-commonly referred to as secure messaging-has increased exponentially over the past decade. The use of secure messaging has a potential impact on clinician work behaviors, workload, and cognitive burden.
View Article and Find Full Text PDFJ Allergy Clin Immunol
September 2025
University of Groningen, University Medical Center Groningen, Beatrix Children's Hospital, Department of Pediatric Pulmonology and Pediatric Allergology, Groningen, the Netherlands; University of Groningen, University Medical Center Groningen, Groningen Research Institute for Asthma and COPD (GRIAC)
Artificial intelligence (AI) is increasingly recognized for its capacity to transform medicine. While publications applying AI in allergy and immunology have increased, clinical implementation substantially lags behind other specialties. By mid-2024, over 1,000 FDA-approved AI-enabled medical devices existed, but none specifically addressed allergy and immunology.
View Article and Find Full Text PDFDtsch Med Wochenschr
September 2025
Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Charité Universitätsmedizin Berlin, Berlin, Deutschland.
Since 2022, an estimated 150000 to 200000 patients with heart failure (HF) in Germany have met the inclusion criteria for HF telemonitoring in accordance with the Federal Joint Committee's (G-BA) decision. Currently, only a few artificial intelligence (AI) applications are used in standard cardiovascular telemedicine care. However, AI applications could improve the predictive accuracy of existing telemedical sensor technology by recognising patterns across multiple data sources.
View Article and Find Full Text PDF