Comparative performance of multi-level pre-trained embeddings on CNN, LSTM and CNN-LSTM for hate speech and offensive language detection

Noor Azeera Abdul Aziz*, Anazida Zainal, Bander Ali Saleh Al-rimy, Fuad Abdulgaleel Abdoh Ghaleb

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With growing concerns over hate speech, social media platforms provide policies for monitoring hate content. Nowadays, platforms like Twitter and Facebook rely on humans and machines as content moderators. As for machine moderators, many studies proposed hate speech detection using machine learning approaches. This study investigated which pre-trained text embedding (Word2Vec, GloVe, FastText, Elmo, and BERT) is the best for each tokenization level (word, subword, and character) and which neural network architecture (CNN, LSTM, and CNN-LSTM) is the best as an encoding method for hate speech and offensive language detection. The character-level GloVe with CNN-LSTM performed best among all tested methods. GloVe (character level) scored 93% for F1-score and 92% for accuracy. At the word level, BERT word embedding with CNN-LSTM had the best classification scores of 90% F1-score and 91% accuracy. At the subword level, CNN-LSTM and CNN fared best with BERT word embeddings, which had 86% for both accuracy and F1-score. The performance findings show that pre-trained embeddings at different tokenization levels capture diverse information. Moreover, with an average of 85% for F1-score and 86% for accuracy, CNN-LSTM yielded the best score for almost all text embedding regardless of the tokenization level compared to CNN and LSTM. These results show that CNN-LSTM complements each other to capture sequential and local patterns in the input text.

Original languageEnglish
Title of host publicationRecent Advances on Soft Computing and Data Mining - Proceedings of the 6th International Conference on Soft Computing and Data Mining SCDM 2024
EditorsRozaida Ghazali, Nazri Mohd Nawi, Nureize Arbaiy, Mustafa Mat Deris, Jemal H. Abawajy
PublisherSpringer
Pages186-195
Number of pages10
ISBN (Electronic)9783031669651
ISBN (Print)9783031669644
DOIs
Publication statusPublished - 30 Jul 2024
Event6th International Conference on Soft Computing and Data Mining, SCDM 2024 - Virtual, Online
Duration: 21 Aug 202422 Aug 2024

Publication series

NameLecture Notes in Networks and Systems
Volume1078 LNNS
ISSN (Print)2367-3370
ISSN (Electronic)2367-3389

Conference

Conference6th International Conference on Soft Computing and Data Mining, SCDM 2024
CityVirtual, Online
Period21/08/2422/08/24

Keywords

  • Hate speech detection
  • neural network
  • text classification
  • text embedding

Fingerprint

Dive into the research topics of 'Comparative performance of multi-level pre-trained embeddings on CNN, LSTM and CNN-LSTM for hate speech and offensive language detection'. Together they form a unique fingerprint.

Cite this