Hate Speech Classification of Codeswitched Data
$ 64.5
Description
Identifying short text messages containing hate speech from the gigantic content generated by users on social media is a challenging classification task. Social media data presents unprecedented challenges to conventional natural language processing techniques regarding extracting high-quality features from the noisy, highly dimensional, codeswitched, and big unstructured data. Besides, a systematic review of previous studies indicated lack of publicly available annotated datasets for comparative studies, little evidence of theoretical underpinning for the annotation schemes used, and hardly any study on codeswitched data. To address these gaps, this book explores a data-driven approach in identifying highly qualitative and discriminative features in hate text messages from social media. The goal was to subsequently use these features to train a better performing machine classification model in effectively capturing subtle hate speech text messages from social media. Approximately 400k messages were crawled from social media for a period of one year during the 2017 general election period in Kenya using a combination of problematic hashtags, ethnic slurs, hate patterns, and messages from pro-hate user accounts. A random sample of ~50k messages was manually labeled into three classes, i.e., Hate Speech, Offensive, or Neither, by a team of 27 human annotators. Subsequently, this dataset was further reduced by extracting a psychosocial feature subset (PDC) informed by the conceptual framework using a hierarchical probability modeling technique.