98%
921
2 minutes
20
Purpose: This study investigates the inter-rater reliability between human experts (a forensic psychologist and a social worker) and a large language model (LLM) in the assessment of child sexual abuse statements. The research aims to explore the potential, limitations, and consistency of this class of AI as an evaluation tool within the framework of Criteria-Based Content Analysis (CBCA), a widely used method for assessing statement credibility.
Materials And Methods: Sixty-five anonymized transcripts of forensic interviews with child sexual abuse victims ( = 65) were independently evaluated by three raters: a forensic psychologist, a social worker, and a large language model (ChatGPT, GPT-4o Plus). Each statement was coded using the 19-item CBCA framework. Inter-rater reliability was analyzed using Intraclass Correlation Coefficient (ICC), Cohen's Kappa (κ), and other agreement statistics to compare the judgments between the human-human pairing and the human-AI pairings.
Results: A high degree of inter-rater reliability was found between the two human experts, with the majority of criteria showing "good" to "excellent" agreement (15 of 19 criteria with ICC > .75). In stark contrast, a dramatic and significant decrease in reliability was observed when the AI model's evaluations were compared with those of the human experts. The AI demonstrated systematic disagreement on criteria requiring nuanced, contextual judgment, with reliability coefficients frequently falling into "poor" or negative ranges (e.g. ICC = -.106 for "Logical structure"), indicating its evaluation logic fundamentally differs from expert reasoning.
Discussion: The findings reveal a profound gap between the nuanced, contextual reasoning of human experts and the pattern-recognition capabilities of the LLM tested. The study concludes that this type of AI, in its current, prompt-engineered form, cannot reliably replicate expert judgment in the complex task of credibility assessment. While not a viable autonomous evaluator, it may hold potential as a "cognitive assistant" to support expert workflows. The assessment of child testimony credibility remains a task that deeply requires professional judgment and appears far beyond the current capabilities of such generative AI models.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1080/26408066.2025.2547211 | DOI Listing |
High Alt Med Biol
September 2025
International Commission for Mountain Emergency Medicine (ICAR MEDCOM), Zurich, Switzerland.
McLaughlin, Kyle, Charley Shimanski, Ken Zafren, Ian Jackson, Gerold Biner, Maurizio Folini, Andreas Hermansky, Eric Ridington, Peter Hicks, Giacomo Strapazzon, Marika Falla, Alastair Hopper, Dave Weber, Ryan Jackson, and Hermann Brugger. Helicopter rescue at very high altitude: Recommendations of the International Commission for Mountain Emergency Medicine (ICAR MedCom) 2025. 00:00-00, 2025.
View Article and Find Full Text PDFEndocrinol Diabetes Metab
September 2025
Department of Otolaryngology-Head and Neck Surgery, University of Texas Southwestern Medical Center, Dallas, Texas, USA.
Objective(s): To evaluate the quality, reliability and accuracy of hyperthyroidism-related content on TikTok using validated assessment tools.
Methods: We systematically searched TikTok for 'hyperthyroid' and 'high thyroid', analysing 115 videos after exclusions. Two independent researchers assessed videos using the Global Quality Scale (GQS, range 0-5) for overall content quality, the modified DISCERN (mDISCERN, range 0-5) for reliability and the Accuracy in Digital Information (ANDI, range 0-4) tool for factual correctness.
J Cancer Res Clin Oncol
September 2025
Division of Gastroenterology, Department of Medicine, Asahikawa Medical University, Asahikawa, Japan.
Purpose: Next-generation sequencing (NGS) has revolutionized cancer treatment by enabling comprehensive cancer genomic profiling (CGP) to guide genotype-directed therapies. While several prospective trials have demonstrated varying outcomes with CGP in patients with advanced solid tumors, its clinical utility in colorectal cancer (CRC) remains to be evaluated.
Methods: We conducted a prospective observational study of CGP in our hospital between September 2019 and March 2024.
Mamm Genome
September 2025
Department of Animal Health and Anatomy, Center for Animal Biotechnology and Gene Therapy, Universitat Autònoma de Barcelona, Travessera Dels Turons, 08193, Cerdanyola del Vallès, Barcelona, Spain.
The mouse remains the principal animal model for investigating human diseases due, among other reasons, to its anatomical similarities to humans. Despite its widespread use, the assumption that mouse anatomy is a fully established field with standardized and universally accepted terminology is misleading. Many phenotypic anatomical annotations do not refer to the authority or origin of the terminology used, while others inappropriately adopt outdated or human-centric nomenclature.
View Article and Find Full Text PDFNat Commun
September 2025
Institute of Computational Biology, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany.
Atherosclerosis, a major cause of cardiovascular diseases, is characterized by the buildup of lipids and chronic inflammation in the arteries, leading to plaque formation and potential rupture. Despite recent advances in single-cell transcriptomics (scRNA-seq), the underlying immune mechanisms and transformations in structural cells driving plaque progression remain incompletely defined. Existing datasets often lack comprehensive coverage and consistent annotations, limiting the utility of downstream analyses.
View Article and Find Full Text PDF