A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processing.

Wadee A Nashir , Abdulqader M Mohsen , Asma A Al-Shargabi , Mohamed K Nour , Badriyya B Al-Onazi

Data Brief

Department of Language Preparation, Arabic Language Teaching Institute, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia.

Published: October 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from established Quranic digital resources, EQTB features systematically expanded orthographic representations generated via algorithmic processing and validation; rigorously refined morphological annotations based on expanded expert-informed schemas, automated re-annotation, and manual curation; and critically, a novel, complete syntactic layer constructed through algorithmic conversion of prior graphical data, Deep Learning-based parsing achieving full coverage under a hybrid constituency-dependency framework, and expert validation. Encompassing the entire Quran (∼132,736 tokens), the dataset is structured in an adapted CoNLL-X format across 43 columns, detailing multiple orthographies, fine-grained morphology (45 tags), and complete hybrid syntax (140 tags/labels), complemented by auxiliary lexicons and schemas. EQTB offers significant reuse potential, providing crucial training/evaluation data for diverse CA NLP tasks (parsing, morphology, diacritization), supporting linguistic research, and enabling the development of advanced pedagogical tools and language technologies.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12361616	PMC
http://dx.doi.org/10.1016/j.dib.2025.111940	DOI Listing

Publication Analysis

Top Keywords

quranic treebank

classical arabic

complete multi-layered

quranic

multi-layered quranic

treebank dataset

dataset hybrid

hybrid syntactic

syntactic annotations

annotations classical

Similar Publications

A complete, multi-layered quranic treebank dataset with hybrid syntactic annotations for classical arabic processing.

Data Brief

October 2025

Department of Language Preparation, Arabic Language Teaching Institute, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia.

Wadee A Nashir , Abdulqader M Mohsen , Asma A Al-Shargabi , Mohamed K Nour , Badriyya B Al-Onazi

View Article and Find Full Text PDF

Similar Publications