GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model.

Cem Simsek , Mete Ucdal , Enrique de-Madaria , Alanna Ebigbo , Petr Vanek , Omar Elshaarawy , Theodor Alexandru Voiosu , Giulio Antonelli , Román Turró , Javier P Gisbert , Olga P Nyssen , Cesare Hassan , Helmut Messmann , Rajiv Jalan

Endosc Int Open

University College Hospital London Medical School, London, United Kingdom of Great Britain and Northern Ireland.

Published: August 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background And Study Aims: Current general-purpose artificial intelligence (AI) large language models (LLMs) demonstrate limited efficacy in clinical medicine, often constrained to question-answering, documentation, and literature summarization roles. We developed GastroGPT, a proof-of-concept specialty-specific, multi-task, clinical LLM, and evaluated its performance against leading general-purpose LLMs across key gastroenterology tasks and diverse case scenarios.

Methods: In this structured analysis, GastroGPT was compared with three state-of-the-art general-purpose LLMs (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude). Models were assessed on seven clinical tasks and overall performance across 10 simulated gastroenterology cases varying in complexity, frequency, and patient demographics. Standardized prompts facilitated structured comparisons. A blinded expert panel rated model outputs per task on a 10-point Likert scale, judging clinical utility. Comprehensive statistical analyses were conducted.

Results: A total of 2,240 expert ratings were obtained. GastroGPT achieved significantly higher mean overall scores (8.1 ± 1.8) compared with GPT-4 (5.2 ± 3.0), Bard (5.7 ± 3.3), and Claude (7.0 ± 2.7) (all < 0.001). It outperformed comparators in six of seven tasks ( < 0.05), except follow-up planning. GastroGPT demonstrated superior score consistency (variance 34.95) versus general models (97.4-260.35) ( < 0.001). Its performance remained consistent across case complexities and frequencies, unlike the comparators ( < 0.001). Multivariate analysis revealed that model type significantly predicted performance ( < 0.001).

Conclusions: This study pioneered development and comparison of a specialty-specific, clinically-oriented AI model to general-purpose LLMs. GastroGPT demonstrated superior utility overall and on key gastroenterology tasks, highlighting the potential for tailored, task-focused AI models in medicine.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12371664	PMC
http://dx.doi.org/10.1055/a-2637-2163	DOI Listing

Publication Analysis

Top Keywords

general-purpose llms

key gastroenterology

gastroenterology tasks

gastrogpt demonstrated

demonstrated superior

gastrogpt

clinical

gastrogpt development

development controlled

controlled testing

Similar Publications

Comparative analysis of accuracy and completeness in standardized database generation for complex multilingual lung cancer pathological reports: large language model-based assisted diagnosis system vs. DeepSeek, GPT-3.5, and healthcare professionals with varied professional titles, with task load variation assessment among medical staff.

Front Med (Lausanne)

August 2025

Department of Oncology, Shanghai Lung Cancer Center, Shanghai Chest Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.

Hao Hang , Liankai Yang , Zhongjie Wang , Zhebing Lin , Pengchong Li

Background: This study evaluates how AI enhances EHR efficiency by comparing a lung cancer-specific LLM with general-purpose models (DeepSeek, GPT-3.5) and clinicians across expertise levels, assessing accuracy and completeness in complex lung cancer pathology documentation and task load changes pre-/post-AI implementation.

Methods: This study analyzed 300 lung cancer cases (Shanghai Chest Hospital) and 60 TCGA cases, split into training/validation/test sets.

View Article and Find Full Text PDF

Similar Publications

Out-of-the-box bioinformatics capabilities of large language models (LLMs).

bioRxiv

August 2025

Varsha Rajesh , Geoffrey H Siwo

Large Language Models (LLMs), AI agents and co-scientists promise to accelerate scientific discovery across fields ranging from chemistry to biology. Bioinformatics- the analysis of DNA, RNA and protein sequences plays a crucial role in biological research and is especially amenable to AI-driven automation given its computational nature. Here, we assess the bioinformatics capabilities of three popular general-purpose LLMs on a set of tasks covering basic analytical questions that include code writing and multi-step reasoning in the domain.

View Article and Find Full Text PDF

Similar Publications

Agentic LLM-based robotic systems for real-world applications: a review on their agenticness and ethics.

Front Robot AI

August 2025

Information Technologies Institute, The Centre for Research and Technology Hellas, Thessaloniki, Greece.

Emmanuel K Raptis , Athanasios Ch Kapoutsis , Elias B Kosmatopoulos

Agentic AI refers to autonomous systems that can perceive their environment, make decisions, and take actions to achieve goals with minimal or no human intervention. Recent advances in Large Language Models (LLMs) have opened new pathways to imbue robots with such "agentic" behaviors by leveraging the LLMs' vast knowledge and reasoning capabilities for planning and control. This survey provides the first comprehensive exploration of LLM-based robotic systems integration into agentic behaviors that have been validated in real-world applications.

View Article and Find Full Text PDF

Similar Publications

A comprehensive evaluation of large language models for information extraction from unstructured electronic health records in residential aged care.

Comput Biol Med

August 2025

School of Medical, Indigenous and Health Sciences, University of Wollongong, Wollongong, Australia.

Dinithi Vithanage , Ping Yu , Qianqian Xie , Hua Xu , Lei Wang

Despite rapid healthcare digitization, extracting information from unstructured electronic health records (EHRs), such as nursing notes, remains challenging due to inconsistencies and ambiguities in clinical documentation. Generative large language models (LLMs) have emerged as promising tools for automating information extraction (IE); however, their application in real-world clinical settings, such as residential aged care (RAC), is limited by critical gaps. Prior studies have often focused on structured EHR data and conventional evaluation metrics such as accuracy and F1 score, overlooking critical aspects like robustness, fairness, bias, and contextual relevance, particularly in unstructured clinical narratives.

View Article and Find Full Text PDF

Similar Publications

Precision Oncology Through Dialogue: AI-HOPE-RTK-RAS Integrates Clinical and Genomic Insights into RTK-RAS Alterations in Colorectal Cancer.

Biomedicines

July 2025

Department of Integrative Translational Sciences, Beckman Research Institute of City of Hope, Duarte, CA 91010, USA.

Ei-Wen Yang , Brigette Waldrup , Enrique Velazquez-Villarreal

The RTK-RAS signaling cascade is a central axis in colorectal cancer (CRC) pathogenesis, governing cellular proliferation, survival, and therapeutic resistance. Somatic alterations in key pathway genes-including KRAS, NRAS, BRAF, and EGFR-are pivotal to clinical decision-making in precision oncology. However, the integration of these genomic events with clinical and demographic data remains hindered by fragmented resources and a lack of accessible analytical frameworks.

View Article and Find Full Text PDF

Similar Publications