Enhancing Diagnostic Precision: Utilising a Large Language Model to Extract U Scores from Thyroid Sonography Reports.

Emma Watts , Omid Pournik , Rose Allington , Xuefei Ding , Kristien Boelaert , Neil Sharma , Leila Ghalichi , Theodoros N Arvanitis

Stud Health Technol Inform

Department of Electronic, Electrical and Systems Engineering, University of Birmingham, UK.

Published: June 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

This study evaluates the performance of ChatGPT-4, a Large Language Model (LLM), in automatically extracting U scores from free-text thyroid ultrasound reports collected from University Hospitals Birmingham (UHB), UK, between 2014 and 2024. The LLM was provided with guidelines on the U classification system and extracted U scores independently from 14,248 de-identified reports, without access to human-assigned scores. The LLM-extracted scores were compared to initial clinician-assigned and refined U scores provided by expert reviewers. The LLM achieved 97.7% agreement with refined human U scores, successfully identifying the highest U score in 98.1% of reports with multiple nodules. Most discrepancies (2.5%) were linked to ambiguous descriptions, multi-nodule reports, and cases with human-documented uncertainty. While the results demonstrate the potential for LLMs to improve reporting consistency and reduce manual workload, ethical and governance challenges such as transparency, privacy, and bias must be addressed before routine clinical deployment. Embedding LLMs into reporting workflows, such as Online Analytical Processing (OLAP) tools, could further enhance reporting quality and consistency.

Download full-text PDF	Source
http://dx.doi.org/10.3233/SHTI250672	DOI Listing

Publication Analysis

Top Keywords

large language

language model

scores

reports

enhancing diagnostic

diagnostic precision

precision utilising

utilising large

model extract

extract scores

A PHP Error was encountered