A PHP Error was encountered

Severity: Warning

Message: file_get_contents(https://...@gmail.com&api_key=61f08fa0b96a73de8c900d749fcb997acc09&a=1): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests

Filename: helpers/my_audit_helper.php

Line Number: 197

Backtrace:

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 197
Function: file_get_contents

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 271
Function: simplexml_load_file_from_url

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 3165
Function: getPubMedXML

File: /var/www/html/application/controllers/Detail.php
Line: 597
Function: pubMedSearch_Global

File: /var/www/html/application/controllers/Detail.php
Line: 511
Function: pubMedGetRelatedKeyword

File: /var/www/html/index.php
Line: 317
Function: require_once

Token Probabilities to Mitigate Large Language Models Overconfidence in Answering Medical Questions: Quantitative Study. | LitMetric

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Background: Chatbots have demonstrated promising capabilities in medicine, scoring passing grades for board examinations across various specialties. However, their tendency to express high levels of confidence in their responses, even when incorrect, poses a limitation to their utility in clinical settings.

Objective: The aim of the study is to examine whether token probabilities outperform chatbots' expressed confidence levels in predicting the accuracy of their responses to medical questions.

Methods: In total, 9 large language models, comprising both commercial (GPT-3.5, GPT-4, and GPT-4o) and open source (Llama 3.1-8b, Llama 3.1-70b, Phi-3-Mini, Phi-3-Medium, Gemma 2-9b, and Gemma 2-27b), were prompted to respond to a set of 2522 questions from the United States Medical Licensing Examination (MedQA database). Additionally, the models rated their confidence from 0 to 100, and the token probability of each response was extracted. The models' success rates were measured, and the predictive performances of both expressed confidence and response token probability in predicting response accuracy were evaluated using area under the receiver operating characteristic curves (AUROCs), adapted calibration error, and Brier score. Sensitivity analyses were conducted using additional questions sourced from other databases in English (MedMCQA: n=2797), Chinese (MedQA Mainland China: n=3413 and Taiwan: n=2808), and French (FrMedMCQA: n=1079), different prompting strategies, and temperature settings.

Results: Overall, mean accuracy ranged from 56.5% (95% CI 54.6-58.5) for Phi-3-Mini to 89% (95% CI 87.7-90.2) for GPT-4o. Across the United States Medical Licensing Examination questions, all chatbots consistently expressed high levels of confidence in their responses (ranging from 90, 95% CI 90-90 for Llama 3.1-70b to 100, 95% CI 100-100 for GPT-3.5). However, expressed confidence failed to predict response accuracy (AUROC ranging from 0.52, 95% CI 0.50-0.53 for Phi-3-Mini to 0.68, 95% CI 0.65-0.71 for GPT-4o). In contrast, the response token probability consistently outperformed expressed confidence for predicting response accuracy (AUROCs ranging from 0.71, 95% CI 0.69-0.73 for Phi-3-Mini to 0.87, 95% CI 0.85-0.89 for GPT-4o; all P<.001). Furthermore, all models demonstrated imperfect calibration, with a general trend toward overconfidence. These findings were consistent in sensitivity analyses.

Conclusions: Due to the limited capacity of chatbots to accurately evaluate their confidence when responding to medical queries, clinicians and patients should abstain from relying on their self-rated certainty. Instead, token probabilities emerge as a promising and easily accessible alternative for gauging the inner doubts of these models.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12396779PMC
http://dx.doi.org/10.2196/64348DOI Listing

Publication Analysis

Top Keywords

expressed confidence
16
token probability
12
response accuracy
12
token probabilities
8
large language
8
language models
8
high levels
8
levels confidence
8
confidence responses
8
llama 31-70b
8

Similar Publications