Bochun: Automatically annotated stance detection dataset for Sorani Kurdish language.

Data Brief

Information Technology Department, Technical College of Informatics, Sulaimani Polytechnic University, Sulaimani, Kurdistan Region, Iraq.

Published: August 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

This Research presents the first-ever, high-quality, automatically annotated Kurdish stance detection dataset in the Sorani dialect to fill the gap of lacking annotated resources for Kurdish, a low-resource language in Natural Language Processing (NLP). The dataset consists of 2,174 Kurdish news articles-1,410 economic and 764 political-that were originally published in 2024 and 2025, which are recent and topically relevant. By selecting these texts from well-known Kurdish news agencies, content validity and linguistic purity were preserved throughout. Necessary preprocessing techniques are applied. Annotation is carried out in two steps. First, a pattern-recognition method with 2,456 phrases and keywords was applied to determine if the subject of every text fell into the economics or politics category. Next, the position of every article was annotated with an extended lexicon of 4,243 adjectives and verbs, categorized under support, oppose, and neutral. Wherever direct matches were not possible, semantic similarity and zero-shot classification were used as fallback measures. In order to verify the automatic annotation, a team of domain experts manually assessed a representative sample of the annotated texts, with a high inter-annotator agreement score confirming the validity of the approach. The dataset is made available in XLSX (Excel) format, facilitating ease of use and versatility for a variety of research tasks in NLP. Due to its annotated and organized corpus, this dataset is a solid starting point for researchers who are building Kurdish language processing models. The dataset is released publicly to allow other researchers to build upon it and push the limits of NLP system performance on low-resource languages.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12266528	PMC
http://dx.doi.org/10.1016/j.dib.2025.111839	DOI Listing

Publication Analysis

Top Keywords

automatically annotated

stance detection

detection dataset

dataset sorani

kurdish language

language processing

kurdish news

annotated

dataset

kurdish

Similar Publications

Integrated single-cell atlas of human atherosclerotic plaques.

Nat Commun

September 2025

Institute of Computational Biology, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany.

Korbinian Traeuble , Matthias Munz , Jessica Pauli , Nadja Sachs , Eshan Vafadarnejad

Atherosclerosis, a major cause of cardiovascular diseases, is characterized by the buildup of lipids and chronic inflammation in the arteries, leading to plaque formation and potential rupture. Despite recent advances in single-cell transcriptomics (scRNA-seq), the underlying immune mechanisms and transformations in structural cells driving plaque progression remain incompletely defined. Existing datasets often lack comprehensive coverage and consistent annotations, limiting the utility of downstream analyses.

View Article and Find Full Text PDF

Similar Publications

Benchmarking AI-driven acoustic monitoring for floating marine debris: Challenges in deep learning-based debris extraction.

Mar Pollut Bull

September 2025

Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba 277-8563, Japan. Electronic address:

Xiaoteng Zhou , Katsunori Mizuno

Existing studies have identified a substantial amount of invisible floating debris in low-visibility marine environments, in addition to debris on the surface and seabed. These suspended pollutants represent a persistent and dynamic threat to marine ecosystems and maritime safety. Although sonar technology facilitates debris monitoring in low-visibility waters, the automatic extraction of small and weakly contrasted debris targets remains a critical challenge.

View Article and Find Full Text PDF

Similar Publications

Multimodal self-supervised retinal vessel segmentation.

Neural Netw

September 2025

Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China. Electronic address:

Pengshuai Yin , Jingqi Zhang , Huichou Huang , Ruirui Liu , Yanxia Liu

Automatic segmentation of retinal vessels from retinography images is crucial for timely clinical diagnosis. However, the high cost and specialized expertise required for annotating medical images often result in limited labeled datasets, which constrains the full potential of deep learning methods. Recent advances in self-supervised pretraining using unlabeled data have shown significant benefits for downstream tasks.

View Article and Find Full Text PDF

Similar Publications

SAMCell: Generalized label-free biological cell segmentation with segment anything.

PLoS One

September 2025

School of Computer Science, Georgia Institute of Technology, Atlanta, Georgia, United States of America.

Alexandra Dunnum VandeLoo , Nathan J Malta , Saahil Sanganeriya , Emilio Aponte , Caitlin van Zyl

Background: When analyzing cells in culture, assessing cell morphology (shape), confluency (density), and growth patterns are necessary for understanding cell health. These parameters are generally obtained by a skilled biologist inspecting light microscope images, but this can become very laborious for high-throughput applications. One way to speed up this process is by automating cell segmentation.

View Article and Find Full Text PDF

Similar Publications

A Deep Learning-Based Fully Automated Cardiac MRI Segmentation Approach for Tetralogy of Fallot Patients.

J Magn Reson Imaging

September 2025

Department of Medical Imaging and Intervention, Chang Gung Memorial Hospital at Linkou, Taoyuan City, Taiwan.

Wen-Yen Chai , Gigin Lin , Chao-Jan Wang , Hsin-Ju Chiang , Shu-Hang Ng

Background: Automated cardiac MR segmentation enables accurate and reproducible ventricular function assessment in Tetralogy of Fallot (ToF), whereas manual segmentation remains time-consuming and variable.

Purpose: To evaluate the deep learning (DL)-based models for automatic left ventricle (LV), right ventricle (RV), and LV myocardium segmentation in ToF, compared with manual reference standard annotations.

Study Type: Retrospective.

View Article and Find Full Text PDF

Similar Publications