Minority target class detection for short text classification

Student thesis: Doctoral Thesis

Abstract

The rise of social media has resulted in large and socially relevant informational contents, and unhealthy behaviours such as cyberbullying, suicidal ideation and hate speech. These behaviours are shown to have offline consequences and measures have been put in place by lawmakers and social media platforms to detect such behaviours. However, the measures are manual and unscalable, hence making them ineffective for the evolving web. Numerous research has been done from both computational linguistics, and machine learning point of view for the effective and robust automatic detection and identification of such contents (target classes), which make up only a small percentage of the overall social media posts and needs to be distinguished from other discourse on social media that may discuss such behaviours without displaying that behaviour (non-target classes), thus, making this a challenging task. In this thesis, we employ short text classification to improve the detection of the target classes from Twitter. We reviewed the literature related to short text classification of unhealthy social media behaviours, highlighting the impact of text ambiguity on classification performance when distinguishing target classes from the non-target classes. In addition, relevant machine learning techniques and methods were identified where performance of the most popular machine learning algorithms for short text classification of unhealthy social media behaviours on Twitter data were empirically investigated. Besides, we introduce two methods that aim to improve the detection of the target class in a binary classification problem by minimizing common or ambiguous terms. We refer to the minimization process as “term disambiguation”. The first method, Short Text Term Disambiguation (STTD), increases the target and non-target class terms by identifying and minimizing terms that are common to the two classes. The second method, Partition Based - Short Text Term Disambiguation (PB-STTD), aim to further improve the detection of the target class by explicitly addressing class imbalance as part of the term disambiguation process. Finally, we validated and evaluated the proposed term disambiguation methods on three data sets containing unhealthy social media behaviours, using different machine learning algorithms. The results showed that both proposed term disambiguation methods led to improved detection of the target class (i.e. unhealthy behaviours from Twitter data).
Date of AwardJan 2021
Original languageEnglish
SupervisorMohamed Bader-El-Den (Supervisor) & Ella Haig (Supervisor)

Cite this

'