A curated dataset for hate speech detection on social media text.

Data Brief

Department of Computer Science, Faculty of Science and Environmental Studies, Lakehead University, Ontario, Canada.

Published: February 2023


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Social media platforms have become the most prominent medium for spreading hate speech, primarily through hateful textual content. An extensive dataset containing emoticons, emojis, hashtags, slang, and contractions is required to detect hate speech on social media based on current trends. Therefore, our dataset is curated from various sources like Kaggle, GitHub, and other websites. This dataset contains hate speech sentences in English and is confined into two classes, one representing hateful content and the other representing non-hateful content. It has 451,709 sentences in total. 371,452 of these are hate speech, and 80,250 are non-hate speech. An augmented balanced dataset with 726,120 samples is also generated to create a custom vocabulary of 145,046 words. The total number of contractions considered in the dataset is 6403. The total number of bad words usually used in hateful content is 377. The text in each sentence of the final dataset, which is utilized for training and cross-validation, is limited to 180 words. The generated contractions dataset can be used for any projects in the area of NLP for data preprocessing. The augmented dataset can help to reduce the number of out-of-vocabulary words, and the hate speech dataset can be used as a classifier to detect hate or no hate on social media platforms.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9807815PMC
http://dx.doi.org/10.1016/j.dib.2022.108832DOI Listing

Publication Analysis

Top Keywords

hate speech
24
social media
16
dataset
9
hate
8
dataset hate
8
media platforms
8
detect hate
8
hateful content
8
total number
8
speech
7

Similar Publications

Journalists face intricate decisions regarding what to publish, especially when problematic content may impact public opinion in a way that could fuel hate and/or undermine democratic attitudes. While scholarship has recognized the importance of this issue, most studies focus on published content, how citizens engage with it, and the implications of published news. In this article, we provide a fresh perspective on the crucial dilemma faced by journalists concerning their perceived impact on public opinion, by leveraging data based on 36 semistructured in-depth interviews with journalists covering Brazil's political landscape.

View Article and Find Full Text PDF

Post-traumatic stress disorder (PTSD) after traumatic events is prevalent and can lead to negative consequences. While social media use has been associated with PTSD, little is known about the specific association of online hate speech on social media networks and PTSD, and whether such association is stronger among those with difficulties in emotion regulation, who may have a harder time coping with hate speech. In a general population sample of Jewish adults (aged 18-70) in Israel (N = 3,998), assessed about two months after the wide-scale terror attacks of October 7, 2023, regression analysis was used to explore the association of online hate speech and self-reported PTSD symptomology.

View Article and Find Full Text PDF

This paper examines the case of Iruda, an AI chatbot launched in December 2020 by the South Korean startup Scatter Lab. Iruda quickly became the center of a controversy, because of inappropriate remarks and sexual exchanges. As conversations between Iruda and users spread through online communities, the controversy expanded to other issues, including hate speech against minorities and privacy violations.

View Article and Find Full Text PDF

Coopting the Rainbow: Analyzing Malicious Survey Responses.

J Homosex

July 2025

Health & Wellbeing, Whitireia Community Polytechnic, Porirua, New Zealand.

In late 2022, a collaborative research study was designed by a group of polytechnic researchers that aimed to explore how safe and inclusive the various campuses of New Zealand's polytechnic sector were for rainbow students. Two online surveys were distributed to students and staff in 14 of the nation's 16 polytechnics. One of the surveys was designed to be completed by rainbow students and the other, by cisgender (cis) heterosexual students and all staff.

View Article and Find Full Text PDF