Severity: Warning
Message: file_get_contents(https://...@gmail.com&api_key=61f08fa0b96a73de8c900d749fcb997acc09&a=1): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests
Filename: helpers/my_audit_helper.php
Line Number: 197
Backtrace:
File: /var/www/html/application/helpers/my_audit_helper.php
Line: 197
Function: file_get_contents
File: /var/www/html/application/helpers/my_audit_helper.php
Line: 271
Function: simplexml_load_file_from_url
File: /var/www/html/application/helpers/my_audit_helper.php
Line: 3165
Function: getPubMedXML
File: /var/www/html/application/controllers/Detail.php
Line: 597
Function: pubMedSearch_Global
File: /var/www/html/application/controllers/Detail.php
Line: 511
Function: pubMedGetRelatedKeyword
File: /var/www/html/index.php
Line: 317
Function: require_once
98%
921
2 minutes
20
Background: The peer review process faces challenges of reviewer fatigue and bias. Artificial intelligence (AI) may help address these issues, but its application in the oral and maxillofacial surgery peer review process remains unexplored.
Purpose: The purpose of the study was to measure and compare manuscript review performance among 4 large language models and human reviewers. large language models are AI systems trained on vast text datasets that can generate human-like responses.
Study Design/setting/sample: In this cross-sectional study, we evaluated original research articles submitted to the Journal of Oral and Maxillofacial Surgery between January and December 2023. Manuscripts were randomly selected from all submissions that received at least one external peer review.
Predictor Variable: The predictor variable was source of review: human reviewers or AI models. We tested 4 AI models: Generative Pretrained Transformer-4o and Generative Pretrained Transformer-o1 (OpenAI, San Francisco, CA), Claude (version 3.5; Anthropic, San Francisco, CA), and Gemini (version 1.5; Google, Mountain View, CA). These models will be referred to by their architectural design characteristics, ie, dense transformers, sparse-expert, multimodal, and base transformer, to highlight their technical differences rather than their commercial identities.
Outcome Variables: Primary outcomes included reviewer recommendations (accept = 3 to reject = 0) and responses to 6 Journal of Oral and Maxillofacial Surgery editor questions. Secondary outcomes comprised temporal stability (consistency of AI evaluations over time) analysis, domain-specific assessments (methodology, statistical analysis, clinical relevance, originality, and presentation clarity; 1 to 5 scale), and model clustering patterns.
Analyses: Agreement between AI and human recommendations was assessed using weighted Cohen's kappa. Intermodel reliability and temporal stability (24-hour interval) were evaluated using intraclass correlation coefficients. Domain scoring patterns were analyzed using multivariate analysis of variance with post hoc comparisons and hierarchical clustering.
Results: From 22 manuscripts, human reviewers rejected 15 (68.2%), while AI rejection rates were statistically significantly lower (0 to 9.1%, P < .001). AI models demonstrated high consistency in their evaluations over time (intraclass correlation coefficient = 0.88, P < .001) and showed moderate agreement with human decisions (κ = 0.38 to 0.46).
Conclusions: While AI models showed reliable internal consistency, they were less likely to recommend rejection than human reviewers. This suggests their optimal use is as screening tools complementing expert human review rather than as replacements.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.joms.2025.03.015 | DOI Listing |