Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine.

Qiao Jin , Fangyuan Chen , Yiliang Zhou , Ziyang Xu , Justin M Cheung , Robert Chen , Ronald M Summers , Justin F Rousseau , Peiyun Ni , Marc J Landsman , Sally L Baxter , Subhi J Al'Aref , Yijia Li , Alexander Chen , Josef A Brejt , Michael F Chiang , Yifan Peng , Zhiyong Lu

ArXiv

National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Published: August 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10896362	PMC

Publication Analysis

Top Keywords

human physicians

accuracy multi-choice

multi-choice questions

image comprehension

gpt-4v performs

accuracy

hidden flaws

flaws expert-level

expert-level accuracy

accuracy multimodal

A PHP Error was encountered