98%
921
2 minutes
20
This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from established Quranic digital resources, EQTB features systematically expanded orthographic representations generated via algorithmic processing and validation; rigorously refined morphological annotations based on expanded expert-informed schemas, automated re-annotation, and manual curation; and critically, a novel, complete syntactic layer constructed through algorithmic conversion of prior graphical data, Deep Learning-based parsing achieving full coverage under a hybrid constituency-dependency framework, and expert validation. Encompassing the entire Quran (∼132,736 tokens), the dataset is structured in an adapted CoNLL-X format across 43 columns, detailing multiple orthographies, fine-grained morphology (45 tags), and complete hybrid syntax (140 tags/labels), complemented by auxiliary lexicons and schemas. EQTB offers significant reuse potential, providing crucial training/evaluation data for diverse CA NLP tasks (parsing, morphology, diacritization), supporting linguistic research, and enabling the development of advanced pedagogical tools and language technologies.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12361616 | PMC |
http://dx.doi.org/10.1016/j.dib.2025.111940 | DOI Listing |
Data Brief
October 2025
Department of Language Preparation, Arabic Language Teaching Institute, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia.
This article describes the Extended Quranic Treebank (EQTB), a comprehensive, multi-layered, and computationally accessible linguistic resource for Classical Arabic (CA), meticulously developed to overcome the documented limitations of the original Quranic Treebank. Leveraging foundational data from established Quranic digital resources, EQTB features systematically expanded orthographic representations generated via algorithmic processing and validation; rigorously refined morphological annotations based on expanded expert-informed schemas, automated re-annotation, and manual curation; and critically, a novel, complete syntactic layer constructed through algorithmic conversion of prior graphical data, Deep Learning-based parsing achieving full coverage under a hybrid constituency-dependency framework, and expert validation. Encompassing the entire Quran (∼132,736 tokens), the dataset is structured in an adapted CoNLL-X format across 43 columns, detailing multiple orthographies, fine-grained morphology (45 tags), and complete hybrid syntax (140 tags/labels), complemented by auxiliary lexicons and schemas.
View Article and Find Full Text PDF