98%
921
2 minutes
20
Rationale And Objectives: To evaluate the performance, stability, and decision-making behavior of large language models (LLMs) for title and abstract screening for radiology systematic reviews, with attention to prompt framing, confidence calibration, and model robustness under disagreement.
Materials And Methods: We compared five LLMs (GPT-4o, GPT-4o mini, Gemini 1.5 Pro, Gemini 2.0 Flash, Llama 3.3 70B) on two imaging-focused systematic reviews (n = 5438 and n = 267 abstracts) using binary and ternary classification tasks, confidence scoring, and reclassification of true and synthetic disagreements. Disagreements were framed as either "LLM vs human" or "human vs human." We also piloted autonomous PubMed retrieval using OpenAI and Gemini Deep Research tools.
Results: LLMs achieved high specificity and variable sensitivity across reviews and tasks, with F1 scores ranging from 0.389 to 0.854. Ternary classification showed low abstention rates (<5%) and modest sensitivity gains. Confidence scores were significantly higher for correct predictions. In disagreement tasks, models more often selected the human label when disagreements were framed as "LLM vs human," consistent with authority bias. GPT-4o showed greater resistance to this effect, while others were more prone to defer to perceived human input. In the autonomous search task, OpenAI achieved moderate recall and high precision; Gemini's recall was poor but precision remained high.
Conclusion: LLMs hold promise for systematic review screening tasks but require careful prompt design and circumspect human-in-the-loop oversight to ensure robust performance.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.acra.2025.08.014 | DOI Listing |
PLoS One
September 2025
Centre for Experimental Pathogen Host Research, School of Medicine, University College Dublin, Dublin, Ireland.
Background: Acute viral respiratory infections (AVRIs) rank among the most common causes of hospitalisation worldwide, imposing significant healthcare burdens and driving the development of pharmacological treatments. However, inconsistent outcome reporting across clinical trials limits evidence synthesis and its translation into clinical practice. A core outcome set (COS) for pharmacological treatments in hospitalised adults with AVRIs is essential to standardise trial outcomes and improve research comparability.
View Article and Find Full Text PDFIEEE Comput Graph Appl
September 2025
Autonomous agents powered by Large Language Models are transforming AI, creating an imperative for the visualization area. However, our field's focus on a human in the sensemaking loop raises critical questions about autonomy, delegation, and coordination for such agentic visualization that preserve human agency while amplifying analytical capabilities. This paper addresses these questions by reinterpreting existing visualization systems with semi-automated or fully automatic AI components through an agentic lens.
View Article and Find Full Text PDFDrug Saf
September 2025
The MITRE Corporation, 202 Burlington Rd, Bedford, MA, 01730, USA.
Acta Neurochir (Wien)
September 2025
Department of Neurosurgery, Istinye University, Istanbul, Turkey.
Background: Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context.
View Article and Find Full Text PDF