Benchmarking large language models for biomedical natural language processing applications and recommendations.

Qingyu Chen , Yan Hu , Xueqing Peng , Qianqian Xie , Qiao Jin , Aidan Gilson , Maxwell B Singer , Xuguang Ai , Po-Ting Lai , Zhizheng Wang , Vipina K Keloth , Kalpana Raja , Jimin Huang , Huan He , Fongci Lin , Jingcheng Du , Rui Zhang , W Jim Zheng , Ron A Adelman , Zhiyong Lu

Nat Commun

Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.

Published: April 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs-GPT and LLaMA representatives-on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11972378	PMC
http://dx.doi.org/10.1038/s41467-025-56989-2	DOI Listing

Publication Analysis

Top Keywords

large language

language models

biomedical natural

natural language

language processing

traditional fine-tuning

missing hallucinations

llms

benchmarking large

language

Similar Publications

Protocol for a core outcome set for pharmacological treatments in hospitalised patients with acute viral respiratory infections (COSAVRI).

PLoS One

September 2025

Centre for Experimental Pathogen Host Research, School of Medicine, University College Dublin, Dublin, Ireland.

Declan Devane , Matthias Briel , Sanjay Bhagani , Nadine Boesten , Stephanie Buchholz

Background: Acute viral respiratory infections (AVRIs) rank among the most common causes of hospitalisation worldwide, imposing significant healthcare burdens and driving the development of pharmacological treatments. However, inconsistent outcome reporting across clinical trials limits evidence synthesis and its translation into clinical practice. A core outcome set (COS) for pharmacological treatments in hospitalised adults with AVRIs is essential to standardise trial outcomes and improve research comparability.

View Article and Find Full Text PDF

Similar Publications

Agentic Visualization: Extracting Agent-based Design Patterns from Visualization Systems.

IEEE Comput Graph Appl

September 2025

Vaishali Dhanoa , Anton Wolter , Gabriela Molina Leon , Hans-Jorg Schulz , Niklas Elmqvist

Autonomous agents powered by Large Language Models are transforming AI, creating an imperative for the visualization area. However, our field's focus on a human in the sensemaking loop raises critical questions about autonomy, delegation, and coordination for such agentic visualization that preserve human agency while amplifying analytical capabilities. This paper addresses these questions by reinterpreting existing visualization systems with semi-automated or fully automatic AI components through an agentic lens.

View Article and Find Full Text PDF

Similar Publications

The Promise and Challenge of Large Language Models for Pharmacovigilance.

Drug Saf

September 2025

The MITRE Corporation, 202 Burlington Rd, Bedford, MA, 01730, USA.

Lynette Hirschman

View Article and Find Full Text PDF

Similar Publications

Comparative performance of neurosurgery-specific, peer-reviewed versus general AI chatbots in bilingual board examinations: evaluating accuracy, consistency, and error minimization strategies.

Acta Neurochir (Wien)

September 2025

Department of Neurosurgery, Istinye University, Istanbul, Turkey.

Mahmut Çamlar , Umut Tan Sevgi , Gökberk Erol , Furkan Karakaş , Yücel Doğruel

Background: Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context.

View Article and Find Full Text PDF

Similar Publications