GATmath and GATLc: Comprehensive benchmarks for evaluating Arabic large language models.

PLoS One

Research Chair of Online Dialogue and Cultural Communication, Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.

Published: September 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

The evolution of Large Language Models (LLMs) has significantly advanced artificial intelligence, driving innovation across various applications. Their continued development relies on a deep understanding of their capabilities and limitations. This is achieved primarily through rigorous evaluation based on diverse datasets. However, assessing state-of-the-art models in Arabic remains a formidable challenge due to the scarcity of comprehensive benchmarks. The absence of robust evaluation tools hinders the progress and refinement of Arabic LLMs and limits their potential applications and effectiveness in real-world scenarios. In response, we introduce the GATmath (7k questions) and GATLc (9k questions), two Arabic, large-scale, and multitask reasoning and language understanding benchmarks. Derived from the General Aptitude Test (GAT) examination, each dataset covers multiple categories, demanding skills in reasoning, semantic analysis, language comprehension, and mathematical problem-solving. To the best of our knowledge, our dataset is the first comprehensive and large-scale reasoning dataset specifically tailored to the Arabic language. We conducted a comprehensive evaluation and analysis of seven prominent LLMs on our datasets. Remarkably, even the highest-performing model attained a mere 66.9% and 64.3% accuracy, underscoring the considerable challenge posed by our datasets. This outcome illustrates the intricate nature of the tasks within our datasets and highlights the substantial room for improvement in the realm of Arabic language model development.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404542PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0329129PLOS

Publication Analysis

Top Keywords

comprehensive benchmarks
8
large language
8
language models
8
arabic language
8
arabic
6
language
6
gatmath gatlc
4
comprehensive
4
gatlc comprehensive
4
benchmarks evaluating
4

Similar Publications

State of eye-tracking technology research to enhance construction safety.

J Safety Res

September 2025

Department of Construction Engineering and Management, North China University of Water Resources and Electric Power, Zhengzhou 450046, China. Electronic address:

Introduction: This study aims to provide a comprehensive review of the application of eye-tracking technology in construction safety, establishing a theoretical foundation and benchmark to guide future research and innovation in the field.

Method: This study identified 116 relevant papers published between 2003 and 2023 indexed by Web of Science (WoS), Scopus, and the American Society of Civil Engineers (ASCE) Library. The analysis of the 116 papers revealed trends about the dates of the publication of the papers, the locations of the research, the journals and conference proceedings that published the studies, and the extent of the collaboration between authors, which indicate that eye-tracking technology has become an important tool to enhance construction safety.

View Article and Find Full Text PDF

Introduction: The continuous progression of autonomous driving technology is propelling the automotive industry into an unprecedented era, with the intelligence and driving safety capabilities of autonomous vehicles serving as crucial benchmarks for assessing industry development. However, crashes involving autonomous vehicles have raised concerns among both government authorities and the general public regarding this technology. Consequently, conducting a comprehensive analysis of crash causes and key causal factors holds immense significance for technological progress, personnel safety, and shaping the future direction of the automotive industry.

View Article and Find Full Text PDF

TATrack: Target-oriented adaptive vision transformer for UAV tracking.

Neural Netw

September 2025

School of Automation, Southeast University, Nanjing, 210096, China; Advanced Ocean Institute of Southeast University Nantong, Nantong, 226010, China. Electronic address:

Unmanned Aerial Vehicle (UAV) tracking requires accurate target localization from aerial top-down perspectives while operating under the computational constraints of aerial platforms. Current mainstream UAV trackers, constrained by the limited resources, predominantly employ lightweight Convolutional Neural Network (CNN) extractor, coupled with an appearance-based fusion mechanism. The absence of comprehensive target perception significantly constrains the balance between tracking accuracy and computational efficiency.

View Article and Find Full Text PDF

Deep feature engineering for accurate sperm morphology classification using CBAM-enhanced ResNet50.

PLoS One

September 2025

School of Computer Science, CHART Laboratory, University of Nottingham, Nottingham, United Kingdom.

Background And Objective: Male fertility assessment through sperm morphology analysis remains a critical component of reproductive health evaluation, as abnormal sperm morphology is strongly correlated with reduced fertility rates and poor assisted reproductive technology outcomes. Traditional manual analysis performed by embryologists is time-intensive, subjective, and prone to significant inter-observer variability, with studies reporting up to 40% disagreement between expert evaluators. This research presents a novel deep learning framework combining Convolutional Block Attention Module (CBAM) with ResNet50 architecture and advanced deep feature engineering (DFE) techniques for automated, objective sperm morphology classification.

View Article and Find Full Text PDF

Accurately identifying associations between human genes (proteins) and clinical phenotypes is critical for advancing drug development and precision medicine. While the human phenotype ontology (HPO) standardizes clinical phenotypes, current computational approaches for predicting human protein-phenotype associations suffer from two limitations: (1) underutilization of multimodal protein-related information and (2) lack of state-of-the-art deep learning representations tailored to diverse data modalities, such as text and sequence. To overcome these limitations, we introduce MultiFusion2HPO, a novel multimodal model that integrates diverse features and advanced learning methods from multiple data sources to enhance the prediction of human protein-HPO associations.

View Article and Find Full Text PDF