Can a Large Language Model Judge a Child's Statement?: A Comparative Analysis of ChatGPT and Human Experts in Credibility Assessment.

Zeki Karataş

J Evid Based Soc Work (2019)

Department of Social Work, Faculty of Health Sciences, Recep Tayyip Erdoğan University, Rize, Turkey.

Published: August 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Purpose: This study investigates the inter-rater reliability between human experts (a forensic psychologist and a social worker) and a large language model (LLM) in the assessment of child sexual abuse statements. The research aims to explore the potential, limitations, and consistency of this class of AI as an evaluation tool within the framework of Criteria-Based Content Analysis (CBCA), a widely used method for assessing statement credibility.

Materials And Methods: Sixty-five anonymized transcripts of forensic interviews with child sexual abuse victims ( = 65) were independently evaluated by three raters: a forensic psychologist, a social worker, and a large language model (ChatGPT, GPT-4o Plus). Each statement was coded using the 19-item CBCA framework. Inter-rater reliability was analyzed using Intraclass Correlation Coefficient (ICC), Cohen's Kappa (κ), and other agreement statistics to compare the judgments between the human-human pairing and the human-AI pairings.

Results: A high degree of inter-rater reliability was found between the two human experts, with the majority of criteria showing "good" to "excellent" agreement (15 of 19 criteria with ICC > .75). In stark contrast, a dramatic and significant decrease in reliability was observed when the AI model's evaluations were compared with those of the human experts. The AI demonstrated systematic disagreement on criteria requiring nuanced, contextual judgment, with reliability coefficients frequently falling into "poor" or negative ranges (e.g. ICC = -.106 for "Logical structure"), indicating its evaluation logic fundamentally differs from expert reasoning.

Discussion: The findings reveal a profound gap between the nuanced, contextual reasoning of human experts and the pattern-recognition capabilities of the LLM tested. The study concludes that this type of AI, in its current, prompt-engineered form, cannot reliably replicate expert judgment in the complex task of credibility assessment. While not a viable autonomous evaluator, it may hold potential as a "cognitive assistant" to support expert workflows. The assessment of child testimony credibility remains a task that deeply requires professional judgment and appears far beyond the current capabilities of such generative AI models.

Download full-text PDF	Source
http://dx.doi.org/10.1080/26408066.2025.2547211	DOI Listing

Publication Analysis

Top Keywords

human experts

large language

language model

inter-rater reliability

credibility assessment

reliability human

forensic psychologist

psychologist social

social worker

worker large

Similar Publications

Helicopter Rescue at Very High Altitude: Recommendations of the International Commission for Mountain Emergency Medicine (ICAR MedCom) 2025.

High Alt Med Biol

September 2025

International Commission for Mountain Emergency Medicine (ICAR MEDCOM), Zurich, Switzerland.

Kyle McLaughlin , Charley Shimanski , Ken Zafren , Ian Jackson , Gerold Biner

McLaughlin, Kyle, Charley Shimanski, Ken Zafren, Ian Jackson, Gerold Biner, Maurizio Folini, Andreas Hermansky, Eric Ridington, Peter Hicks, Giacomo Strapazzon, Marika Falla, Alastair Hopper, Dave Weber, Ryan Jackson, and Hermann Brugger. Helicopter rescue at very high altitude: Recommendations of the International Commission for Mountain Emergency Medicine (ICAR MedCom) 2025. 00:00-00, 2025.

View Article and Find Full Text PDF

Similar Publications

Quality, Reliability and Accuracy of Hyperthyroidism-Related Content on Social Media Platform TikTok.

Endocrinol Diabetes Metab

September 2025

Department of Otolaryngology-Head and Neck Surgery, University of Texas Southwestern Medical Center, Dallas, Texas, USA.

Aayush Shah , Raika Bourmand , Freddy Albaladejo , Karthik Jarugula , Sofia Olsson

Objective(s): To evaluate the quality, reliability and accuracy of hyperthyroidism-related content on TikTok using validated assessment tools.

Methods: We systematically searched TikTok for 'hyperthyroid' and 'high thyroid', analysing 115 videos after exclusions. Two independent researchers assessed videos using the Global Quality Scale (GQS, range 0-5) for overall content quality, the modified DISCERN (mDISCERN, range 0-5) for reliability and the Accuracy in Digital Information (ANDI, range 0-4) tool for factual correctness.

View Article and Find Full Text PDF

Similar Publications

Clinical utility of comprehensive genomic profiling test for colorectal cancer: a single institution prospective observational study.

J Cancer Res Clin Oncol

September 2025

Division of Gastroenterology, Department of Medicine, Asahikawa Medical University, Asahikawa, Japan.

Hiroki Tanabe , Katsuyoshi Ando , Keitaro Takahashi , Tomomi Kamanaka , Sayaka Yuzawa

Purpose: Next-generation sequencing (NGS) has revolutionized cancer treatment by enabling comprehensive cancer genomic profiling (CGP) to guide genotype-directed therapies. While several prospective trials have demonstrated varying outcomes with CGP in patients with advanced solid tumors, its clinical utility in colorectal cancer (CRC) remains to be evaluated.

Methods: We conducted a prospective observational study of CGP in our hospital between September 2019 and March 2024.

View Article and Find Full Text PDF

Similar Publications

Harmonizing mouse anatomy terminology: a common language?

Mamm Genome

September 2025

Department of Animal Health and Anatomy, Center for Animal Biotechnology and Gene Therapy, Universitat Autònoma de Barcelona, Travessera Dels Turons, 08193, Cerdanyola del Vallès, Barcelona, Spain.

Jesús Ruberte , Paul N Schofield , John P Sundberg , Sergi Olvera-Maneu , Ana Carretero

The mouse remains the principal animal model for investigating human diseases due, among other reasons, to its anatomical similarities to humans. Despite its widespread use, the assumption that mouse anatomy is a fully established field with standardized and universally accepted terminology is misleading. Many phenotypic anatomical annotations do not refer to the authority or origin of the terminology used, while others inappropriately adopt outdated or human-centric nomenclature.

View Article and Find Full Text PDF

Similar Publications

Integrated single-cell atlas of human atherosclerotic plaques.

Nat Commun

September 2025

Institute of Computational Biology, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany.

Korbinian Traeuble , Matthias Munz , Jessica Pauli , Nadja Sachs , Eshan Vafadarnejad

Atherosclerosis, a major cause of cardiovascular diseases, is characterized by the buildup of lipids and chronic inflammation in the arteries, leading to plaque formation and potential rupture. Despite recent advances in single-cell transcriptomics (scRNA-seq), the underlying immune mechanisms and transformations in structural cells driving plaque progression remain incompletely defined. Existing datasets often lack comprehensive coverage and consistent annotations, limiting the utility of downstream analyses.

View Article and Find Full Text PDF

Similar Publications