ChatGPT as an item calibration tool: Psychometric insights in a high-stakes examination.

Daniela S M Pereira , Francisco Mourão , João Carlos Ribeiro , Patrício Costa , Serafim Guimarães , José Miguel Pêgo

Med Teach

Life and Health Sciences Research Institute (ICVS), School of Medicine, University of Minho, Largo do Paço, Braga, Portugal.

Published: April 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Introduction: ChatGPT has attracted a lot of interest worldwide for its versatility in a range of natural language tasks, including in the education and evaluation industry. It can automate time- and labor-intensive tasks with clear economic and efficiency gains.

Methods: This study evaluated the potential of ChatGPT to automate psychometric analysis of test questions from the 2020 Portuguese National Residency Selection Exam (PNA). ChatGPT was queried 100 times with the 150 MCQ from the exam. Using ChatGPT's responses, difficulty indices were calculated for each question based on the proportion of correct answers. The predicted difficulty levels were compared to the actual difficulty levels of the 2020 exam MCQ's using methods from classical test theory.

Results: ChatGPT's predicted item difficulty indices positively correlated with the actual item difficulties (r (148) = -0.372, < .001), suggesting a general consistency between the real and the predicted values. There was also a moderate significant negative correlation between the difficulty index predicted by ChatGPT and the number of challenges (r (148) = -0.302, < .001), highlighting ChatGPT's potential for identifying less problematic questions.

Conclusion: These findings unveiled ChatGPT's potential as a tool for assessment development, proving its capability to predict the psychometric characteristics of high-stakes test items in automated item calibration without pre-testing in real-life scenarios.

Download full-text PDF	Source
http://dx.doi.org/10.1080/0142159X.2024.2376205	DOI Listing

Publication Analysis

Top Keywords

item calibration

difficulty indices

difficulty levels

chatgpt's potential

chatgpt

difficulty

chatgpt item

calibration tool

tool psychometric

psychometric insights

Similar Publications

Reducing Calibration Bias for Person Fit Assessment by Mixture Model Expansion.

Educ Psychol Meas

September 2025

Maastricht University, Maastricht, the Netherlands.

Johan Braeken , Saskia van Laar

Measurement appropriateness concerns the question of whether the test or survey scale under consideration can provide a valid measure for a specific individual. An aberrant item response pattern would provide internal counterevidence against using the test/scale for this person, whereas a more typical item response pattern would imply a fit of the measure to the person. Traditional approaches, including the popular Lz person fit statistic, are hampered by their two-stage estimation procedure and the fact that the fit for the person is determined based on the model calibrated on data that include the misfitting persons.

View Article and Find Full Text PDF

Similar Publications

Cross-cultural adaptation of the West and Central African version of the ABILHAND-Kids questionnaire for children with cerebral palsy.

Disabil Rehabil

September 2025

School of Rehabilitation Sciences, Faculty of Medicine, Université Laval, Quebec City, Canada.

Emmanuel Segnon Sogbossi , Ange Loutou , Darnelle Audrey Noukimi , Sourou Melkiade Ahouandjinou , Aurore Houssou

Purpose: To adapt a West and Central African version of the widely used ABILHAND-Kids questionnaire for measuring manual ability in children with cerebral palsy (CP).

Materials And Methods: This cross-sectional study included 136 children with CP from Benin ( = 67) and Cameroon ( = 69). Data were collected from parents using an experimental version with 64 items.

View Article and Find Full Text PDF

Similar Publications

Construction of risk prediction model for dysphagia in hospitalized elderly patients with frailty.

Front Med (Lausanne)

August 2025

Mianyang Central Hospital, School of Medicine, University of Electronic Science and Technology of China, Mianyang, China.

Yuzhu Lin , Chen Xu , Xue Song , Qin Tan , Xiaolei Lian

Background: Dysphagia is a common complication in elderly patients with frailty, affecting their prognosis and quality of life. Constructing a risk prediction model can help with early screening and intervention.

Objective: To investigate the current status of dysphagia in hospitalized elderly patients with frailty, analyze its influencing factors, and construct a risk prediction model for dysphagia in hospitalized elderly patients with frailty.

View Article and Find Full Text PDF

Similar Publications

Joint analysis of dispersed count-time data using a bivariate latent factor model.

Br J Math Stat Psychol

August 2025

University of Southern Florida, Tampa, Florida, USA.

Cornelis J Potgieter , Akihito Kamata , Yusuf Kara , Xin Qiao

In this study, we explore parameter estimation for a joint count-time data model with a two-factor latent trait structure, representing accuracy and speed. Each count-time variable pair corresponds to a specific item on a measurement instrument, where each item consists of a fixed number of tasks. The count variable represents the number of successfully completed tasks and is modeled using a Beta-binomial distribution to account for potential over-dispersion.

View Article and Find Full Text PDF

Similar Publications

Automatic- and Transformer-Based Automatic Item Generation: A Critical Review.

J Intell

August 2025

Department of Psychology, University of Graz, Universitätsplatz 2, 8010 Graz, Austria.

Markus Sommer , Martin Arendasy

This article provides a critical review of conceptually different approaches to automatic and transformer-based automatic item generation. Based on a discussion of the current challenges that have arisen due to changes in the use of psychometric tests in recent decades, we outline the requirements that these approaches should ideally fulfill. Subsequently, each approach is examined individually to determine the extent to which it can contribute to meeting the challenges.

View Article and Find Full Text PDF

Similar Publications