Can a Large Language Model Judge a Child's Statement?: A Comparative Analysis of ChatGPT and Human Experts in Credibility Assessment.

J Evid Based Soc Work (2019)

Department of Social Work, Faculty of Health Sciences, Recep Tayyip Erdoğan University, Rize, Turkey.

Published: August 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Purpose: This study investigates the inter-rater reliability between human experts (a forensic psychologist and a social worker) and a large language model (LLM) in the assessment of child sexual abuse statements. The research aims to explore the potential, limitations, and consistency of this class of AI as an evaluation tool within the framework of Criteria-Based Content Analysis (CBCA), a widely used method for assessing statement credibility.

Materials And Methods: Sixty-five anonymized transcripts of forensic interviews with child sexual abuse victims ( = 65) were independently evaluated by three raters: a forensic psychologist, a social worker, and a large language model (ChatGPT, GPT-4o Plus). Each statement was coded using the 19-item CBCA framework. Inter-rater reliability was analyzed using Intraclass Correlation Coefficient (ICC), Cohen's Kappa (κ), and other agreement statistics to compare the judgments between the human-human pairing and the human-AI pairings.

Results: A high degree of inter-rater reliability was found between the two human experts, with the majority of criteria showing "good" to "excellent" agreement (15 of 19 criteria with ICC > .75). In stark contrast, a dramatic and significant decrease in reliability was observed when the AI model's evaluations were compared with those of the human experts. The AI demonstrated systematic disagreement on criteria requiring nuanced, contextual judgment, with reliability coefficients frequently falling into "poor" or negative ranges (e.g. ICC = -.106 for "Logical structure"), indicating its evaluation logic fundamentally differs from expert reasoning.

Discussion: The findings reveal a profound gap between the nuanced, contextual reasoning of human experts and the pattern-recognition capabilities of the LLM tested. The study concludes that this type of AI, in its current, prompt-engineered form, cannot reliably replicate expert judgment in the complex task of credibility assessment. While not a viable autonomous evaluator, it may hold potential as a "cognitive assistant" to support expert workflows. The assessment of child testimony credibility remains a task that deeply requires professional judgment and appears far beyond the current capabilities of such generative AI models.

Download full-text PDF

Source
http://dx.doi.org/10.1080/26408066.2025.2547211DOI Listing

Publication Analysis

Top Keywords

human experts
20
large language
12
language model
12
inter-rater reliability
12
credibility assessment
8
reliability human
8
forensic psychologist
8
psychologist social
8
social worker
8
worker large
8

Similar Publications

McLaughlin, Kyle, Charley Shimanski, Ken Zafren, Ian Jackson, Gerold Biner, Maurizio Folini, Andreas Hermansky, Eric Ridington, Peter Hicks, Giacomo Strapazzon, Marika Falla, Alastair Hopper, Dave Weber, Ryan Jackson, and Hermann Brugger. Helicopter rescue at very high altitude: Recommendations of the International Commission for Mountain Emergency Medicine (ICAR MedCom) 2025. 00:00-00, 2025.

View Article and Find Full Text PDF

Objective(s): To evaluate the quality, reliability and accuracy of hyperthyroidism-related content on TikTok using validated assessment tools.

Methods: We systematically searched TikTok for 'hyperthyroid' and 'high thyroid', analysing 115 videos after exclusions. Two independent researchers assessed videos using the Global Quality Scale (GQS, range 0-5) for overall content quality, the modified DISCERN (mDISCERN, range 0-5) for reliability and the Accuracy in Digital Information (ANDI, range 0-4) tool for factual correctness.

View Article and Find Full Text PDF

Purpose: Next-generation sequencing (NGS) has revolutionized cancer treatment by enabling comprehensive cancer genomic profiling (CGP) to guide genotype-directed therapies. While several prospective trials have demonstrated varying outcomes with CGP in patients with advanced solid tumors, its clinical utility in colorectal cancer (CRC) remains to be evaluated.

Methods: We conducted a prospective observational study of CGP in our hospital between September 2019 and March 2024.

View Article and Find Full Text PDF

Harmonizing mouse anatomy terminology: a common language?

Mamm Genome

September 2025

Department of Animal Health and Anatomy, Center for Animal Biotechnology and Gene Therapy, Universitat Autònoma de Barcelona, Travessera Dels Turons, 08193, Cerdanyola del Vallès, Barcelona, Spain.

The mouse remains the principal animal model for investigating human diseases due, among other reasons, to its anatomical similarities to humans. Despite its widespread use, the assumption that mouse anatomy is a fully established field with standardized and universally accepted terminology is misleading. Many phenotypic anatomical annotations do not refer to the authority or origin of the terminology used, while others inappropriately adopt outdated or human-centric nomenclature.

View Article and Find Full Text PDF

Atherosclerosis, a major cause of cardiovascular diseases, is characterized by the buildup of lipids and chronic inflammation in the arteries, leading to plaque formation and potential rupture. Despite recent advances in single-cell transcriptomics (scRNA-seq), the underlying immune mechanisms and transformations in structural cells driving plaque progression remain incompletely defined. Existing datasets often lack comprehensive coverage and consistent annotations, limiting the utility of downstream analyses.

View Article and Find Full Text PDF