How Artificial Intelligence Differs From Humans in Peer Review.

Michael V Joachim , Thomas B Dodson , Amir Laviv

J Oral Maxillofac Surg

Senior Faculty, Department of Oral and Maxillofacial Surgery, Goldschleger School of Dentistry, Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel; Attending Surgeon, Oral and Maxillofacial Surgery Unit, Samson Assuta University Hospital, Ashdod, Israel.

Published: August 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: The peer review process faces challenges of reviewer fatigue and bias. Artificial intelligence (AI) may help address these issues, but its application in the oral and maxillofacial surgery peer review process remains unexplored.

Purpose: The purpose of the study was to measure and compare manuscript review performance among 4 large language models and human reviewers. large language models are AI systems trained on vast text datasets that can generate human-like responses.

Study Design/setting/sample: In this cross-sectional study, we evaluated original research articles submitted to the Journal of Oral and Maxillofacial Surgery between January and December 2023. Manuscripts were randomly selected from all submissions that received at least one external peer review.

Predictor Variable: The predictor variable was source of review: human reviewers or AI models. We tested 4 AI models: Generative Pretrained Transformer-4o and Generative Pretrained Transformer-o1 (OpenAI, San Francisco, CA), Claude (version 3.5; Anthropic, San Francisco, CA), and Gemini (version 1.5; Google, Mountain View, CA). These models will be referred to by their architectural design characteristics, ie, dense transformers, sparse-expert, multimodal, and base transformer, to highlight their technical differences rather than their commercial identities.

Outcome Variables: Primary outcomes included reviewer recommendations (accept = 3 to reject = 0) and responses to 6 Journal of Oral and Maxillofacial Surgery editor questions. Secondary outcomes comprised temporal stability (consistency of AI evaluations over time) analysis, domain-specific assessments (methodology, statistical analysis, clinical relevance, originality, and presentation clarity; 1 to 5 scale), and model clustering patterns.

Analyses: Agreement between AI and human recommendations was assessed using weighted Cohen's kappa. Intermodel reliability and temporal stability (24-hour interval) were evaluated using intraclass correlation coefficients. Domain scoring patterns were analyzed using multivariate analysis of variance with post hoc comparisons and hierarchical clustering.

Results: From 22 manuscripts, human reviewers rejected 15 (68.2%), while AI rejection rates were statistically significantly lower (0 to 9.1%, P < .001). AI models demonstrated high consistency in their evaluations over time (intraclass correlation coefficient = 0.88, P < .001) and showed moderate agreement with human decisions (κ = 0.38 to 0.46).

Conclusions: While AI models showed reliable internal consistency, they were less likely to recommend rejection than human reviewers. This suggests their optimal use is as screening tools complementing expert human review rather than as replacements.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.joms.2025.03.015	DOI Listing

Publication Analysis

Top Keywords

human reviewers

peer review

oral maxillofacial

maxillofacial surgery

artificial intelligence

review process

large language

language models

journal oral

generative pretrained

A PHP Error was encountered