Accuracy and Reliability of Chatbot Responses to Physician Questions.

Rachel S Goodman , J Randall Patrinely , Cosby A Stone , Eli Zimmerman , Rebecca R Donald , Sam S Chang , Sean T Berkowitz , Avni P Finn , Eiman Jahangir , Elizabeth A Scoville , Tyler S Reese , Debra L Friedman , Julie A Bastarache , Yuri F van der Heijden , Jordan J Wright , Fei Ye , Nicholas Carter , Matthew R Alexander , Jennifer H Choe , Cody A Chastain

JAMA Netw Open

Department of Hematology/Oncology, Vanderbilt University Medical Center, Nashville, Tennessee.

Published: October 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Importance: Natural language processing tools, such as ChatGPT (generative pretrained transformer, hereafter referred to as chatbot), have the potential to radically enhance the accessibility of medical information for health professionals and patients. Assessing the safety and efficacy of these tools in answering physician-generated questions is critical to determining their suitability in clinical settings, facilitating complex decision-making, and optimizing health care efficiency.

Objective: To assess the accuracy and comprehensiveness of chatbot-generated responses to physician-developed medical queries, highlighting the reliability and limitations of artificial intelligence-generated medical information.

Design, Setting, And Participants: Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes or no) or descriptive answers. The physicians then graded the chatbot-generated answers to these questions for accuracy (6-point Likert scale with 1 being completely incorrect and 6 being completely correct) and completeness (3-point Likert scale, with 1 being incomplete and 3 being complete plus additional context). Scores were summarized with descriptive statistics and compared using the Mann-Whitney U test or the Kruskal-Wallis test. The study (including data analysis) was conducted from January to May 2023.

Main Outcomes And Measures: Accuracy, completeness, and consistency over time and between 2 different versions (GPT-3.5 and GPT-4) of chatbot-generated medical responses.

Results: Across all questions (n = 284) generated by 33 physicians (31 faculty members and 2 recent graduates from residency or fellowship programs) across 17 specialties, the median accuracy score was 5.5 (IQR, 4.0-6.0) (between almost completely and complete correct) with a mean (SD) score of 4.8 (1.6) (between mostly and almost completely correct). The median completeness score was 3.0 (IQR, 2.0-3.0) (complete and comprehensive) with a mean (SD) score of 2.5 (0.7). For questions rated easy, medium, and hard, the median accuracy scores were 6.0 (IQR, 5.0-6.0), 5.5 (IQR, 5.0-6.0), and 5.0 (IQR, 4.0-6.0), respectively (mean [SD] scores were 5.0 [1.5], 4.7 [1.7], and 4.6 [1.6], respectively; P = .05). Accuracy scores for binary and descriptive questions were similar (median score, 6.0 [IQR, 4.0-6.0] vs 5.0 [IQR, 3.4-6.0]; mean [SD] score, 4.9 [1.6] vs 4.7 [1.6]; P = .07). Of 36 questions with scores of 1.0 to 2.0, 34 were requeried or regraded 8 to 17 days later with substantial improvement (median score 2.0 [IQR, 1.0-3.0] vs 4.0 [IQR, 2.0-5.3]; P < .01). A subset of questions, regardless of initial scores (version 3.5), were regenerated and rescored using version 4 with improvement (mean accuracy [SD] score, 5.2 [1.5] vs 5.7 [0.8]; median score, 6.0 [IQR, 5.0-6.0] for original and 6.0 [IQR, 6.0-6.0] for rescored; P = .002).

Conclusions And Relevance: In this cross-sectional study, chatbot generated largely accurate information to diverse medical queries as judged by academic physician specialists with improvement over time, although it had important limitations. Further research and model development are needed to correct inaccuracies and for validation.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10546234	PMC
http://dx.doi.org/10.1001/jamanetworkopen.2023.36483	DOI Listing

Publication Analysis

Top Keywords

questions

easy medium

medium hard

binary descriptive

likert scale

completely correct

median accuracy

score iqr

iqr 40-60

accuracy scores

Similar Publications

Genetic factors associated with the co-occurrence of endometriosis with antiphospholipid syndrome (Review).

Exp Ther Med

October 2025

Section of Molecular Pathology and Human Genetics, Department of Internal Medicine, School of Medicine, University of Crete, 71003 Heraklion, Greece.

Maria I Zervou , Theoni B Tarlatzi , Demetrios A Spandidos , Basil C Tarlatzis , George Bertsias

Immune-related factors may serve an important role in the development of endometriosis, considering the occurrence of substantial abnormalities in the immune system of women with endometriosis, including reduced T-cell reactivity and natural killer cell cytotoxicity, as well as increased numbers and activation of peritoneal macrophages. Moreover, women suffering from endometriosis are at a higher risk for developing various autoimmune diseases as comorbidities of endometriosis. Recent epidemiological data demonstrate that patients with endometriosis have a significantly higher risk (2.

View Article and Find Full Text PDF

Similar Publications

The Health Belief Model and Pediatric Fissure Sealant Therapy: Identifying Predictors of Parental Behavior-A Cross-Sectional Study.

Health Sci Rep

September 2025

Tobacco and Health Research Center Hormozgan University of Medical Sciences Bandar Abbas Iran.

Viana Mortazavi , Najmeh Shanbehzadeh , Roghayeh Ezati Rad , Shideh Rafati , Sara Dadipoor

Background And Aims: Dental caries in children remains a global health challenge. Fissure sealant therapy (FST) is an effective preventive measure, yet parental acceptance remains low. This study aimed to identify predictors of parental FST behavior for children aged 6-12 years in Bandar Abbas, Iran, using the health belief model (HBM).

View Article and Find Full Text PDF

Similar Publications

Association of Polygenic Risk Scores for Schizophrenia with Psychosis-Proneness Indicators in the General Population: A Narrative Review.

Consort Psychiatr

June 2025

Margarita Alfimova

Background: Schizotypy (ST) and psychotic-like experiences and negative symptoms (PENS) are commonly used phenotypes in high-risk and early intervention research for schizophrenia and other non-affective psychoses. However, the origin of these phenotypes in the general population is poorly understood and their association with the genetic predisposition to psychoses has not yet been proven.

Aim: The aim of this study is to answer the question of whether data on the relations of ST and PENS with polygenic risk scores for schizophrenia (SZ-PRS) support the hypothesis that these phenotypes are subclinical manifestations of genetic liability for schizophrenia.

View Article and Find Full Text PDF

Similar Publications

ChatGPT-4o is Not a Reliable Study Source for Orthopaedic Surgery Residents.

JB JS Open Access

September 2025

Department of Orthopaedic Surgery, St. Luke's University Health Network, Bethlehem, Pennsylvania.

Neil Jain , Caleb Gottlich , John Fisher , Travis Winston , Kristofer Matullo

Background: The use of artificial intelligence platforms by medical residents as an educational resource is increasing. Within orthopaedic surgery, older Chat Generative Pre-trained Transformer (ChatGPT) models performed worse than resident physicians on practice examinations and rarely answered questions with images correctly. The newer ChatGPT-4o was designed to improve these deficiencies but has not been evaluated.

View Article and Find Full Text PDF

Similar Publications

Reciprocated tachycardias in cardiac laminopathy: a clinical case report.

Eur Heart J Case Rep

September 2025

Feinberg School of Medicine, Northwestern University, 303E Chicago Ave, Ward 1-003, Chicago, IL 60611, USA.

Evgeny Zhelyakov , Natalia Sonicheva-Paterson , Svetlana Aleksandrova , Viktor Tcivkovskii , Andrei Ardashev

Background: Cardiac laminopathies, associated with mutations in the LMNA gene, are a rare inherited disorder characterized by a broad range of clinical manifestations. There are currently no data on the association between supraventricular re-entrant tachycardias and LMNA-related cardiomyopathy.

Case Summary: A 26-year-old male presented with either wide-QRS tachycardia with a left bundle branch block (LBBB) pattern or narrow QRS tachycardia, as well as a history of palpitations since age 15.

View Article and Find Full Text PDF

Similar Publications