Bochun: Automatically annotated stance detection dataset for Sorani Kurdish language.

Data Brief

Information Technology Department, Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, Kurdistan Region, Iraq.

Published: August 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

This Research presents the first-ever, high-quality, automatically annotated Kurdish stance detection dataset in the Sorani dialect to fill the gap of lacking annotated resources for Kurdish, a low-resource language in Natural Language Processing (NLP). The dataset consists of 2,174 Kurdish news articles-1,410 economic and 764 political-that were originally published in 2024 and 2025, which are recent and topically relevant. By selecting these texts from well-known Kurdish news agencies, content validity and linguistic purity were preserved throughout. Necessary preprocessing techniques are applied. Annotation is carried out in two steps. First, a pattern-recognition method with 2,456 phrases and keywords was applied to determine if the subject of every text fell into the economics or politics category. Next, the position of every article was annotated with an extended lexicon of 4,243 adjectives and verbs, categorized under support, oppose, and neutral. Wherever direct matches were not possible, semantic similarity and zero-shot classification were used as fallback measures. In order to verify the automatic annotation, a team of domain experts manually assessed a representative sample of the annotated texts, with a high inter-annotator agreement score confirming the validity of the approach. The dataset is made available in XLSX (Excel) format, facilitating ease of use and versatility for a variety of research tasks in NLP. Due to its annotated and organized corpus, this dataset is a solid starting point for researchers who are building Kurdish language processing models. The dataset is released publicly to allow other researchers to build upon it and push the limits of NLP system performance on low-resource languages.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12266528PMC
http://dx.doi.org/10.1016/j.dib.2025.111839DOI Listing

Publication Analysis

Top Keywords

automatically annotated
8
stance detection
8
detection dataset
8
dataset sorani
8
kurdish language
8
language processing
8
kurdish news
8
annotated
6
dataset
6
kurdish
6

Similar Publications

Atherosclerosis, a major cause of cardiovascular diseases, is characterized by the buildup of lipids and chronic inflammation in the arteries, leading to plaque formation and potential rupture. Despite recent advances in single-cell transcriptomics (scRNA-seq), the underlying immune mechanisms and transformations in structural cells driving plaque progression remain incompletely defined. Existing datasets often lack comprehensive coverage and consistent annotations, limiting the utility of downstream analyses.

View Article and Find Full Text PDF

Benchmarking AI-driven acoustic monitoring for floating marine debris: Challenges in deep learning-based debris extraction.

Mar Pollut Bull

September 2025

Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-8563, Japan. Electronic address:

Existing studies have identified a substantial amount of invisible floating debris in low-visibility marine environments, in addition to debris on the surface and seabed. These suspended pollutants represent a persistent and dynamic threat to marine ecosystems and maritime safety. Although sonar technology facilitates debris monitoring in low-visibility waters, the automatic extraction of small and weakly contrasted debris targets remains a critical challenge.

View Article and Find Full Text PDF

Multimodal self-supervised retinal vessel segmentation.

Neural Netw

September 2025

Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China. Electronic address:

Automatic segmentation of retinal vessels from retinography images is crucial for timely clinical diagnosis. However, the high cost and specialized expertise required for annotating medical images often result in limited labeled datasets, which constrains the full potential of deep learning methods. Recent advances in self-supervised pretraining using unlabeled data have shown significant benefits for downstream tasks.

View Article and Find Full Text PDF

Background: When analyzing cells in culture, assessing cell morphology (shape), confluency (density), and growth patterns are necessary for understanding cell health. These parameters are generally obtained by a skilled biologist inspecting light microscope images, but this can become very laborious for high-throughput applications. One way to speed up this process is by automating cell segmentation.

View Article and Find Full Text PDF

Background: Automated cardiac MR segmentation enables accurate and reproducible ventricular function assessment in Tetralogy of Fallot (ToF), whereas manual segmentation remains time-consuming and variable.

Purpose: To evaluate the deep learning (DL)-based models for automatic left ventricle (LV), right ventricle (RV), and LV myocardium segmentation in ToF, compared with manual reference standard annotations.

Study Type: Retrospective.

View Article and Find Full Text PDF