TY - GEN
T1 - Comparative performance of multi-level pre-trained embeddings on CNN, LSTM and CNN-LSTM for hate speech and offensive language detection
AU - Aziz, Noor Azeera Abdul
AU - Zainal, Anazida
AU - Al-rimy, Bander Ali Saleh
AU - Ghaleb, Fuad Abdulgaleel Abdoh
PY - 2024/7/30
Y1 - 2024/7/30
N2 - With growing concerns over hate speech, social media platforms provide policies for monitoring hate content. Nowadays, platforms like Twitter and Facebook rely on humans and machines as content moderators. As for machine moderators, many studies proposed hate speech detection using machine learning approaches. This study investigated which pre-trained text embedding (Word2Vec, GloVe, FastText, Elmo, and BERT) is the best for each tokenization level (word, subword, and character) and which neural network architecture (CNN, LSTM, and CNN-LSTM) is the best as an encoding method for hate speech and offensive language detection. The character-level GloVe with CNN-LSTM performed best among all tested methods. GloVe (character level) scored 93% for F1-score and 92% for accuracy. At the word level, BERT word embedding with CNN-LSTM had the best classification scores of 90% F1-score and 91% accuracy. At the subword level, CNN-LSTM and CNN fared best with BERT word embeddings, which had 86% for both accuracy and F1-score. The performance findings show that pre-trained embeddings at different tokenization levels capture diverse information. Moreover, with an average of 85% for F1-score and 86% for accuracy, CNN-LSTM yielded the best score for almost all text embedding regardless of the tokenization level compared to CNN and LSTM. These results show that CNN-LSTM complements each other to capture sequential and local patterns in the input text.
AB - With growing concerns over hate speech, social media platforms provide policies for monitoring hate content. Nowadays, platforms like Twitter and Facebook rely on humans and machines as content moderators. As for machine moderators, many studies proposed hate speech detection using machine learning approaches. This study investigated which pre-trained text embedding (Word2Vec, GloVe, FastText, Elmo, and BERT) is the best for each tokenization level (word, subword, and character) and which neural network architecture (CNN, LSTM, and CNN-LSTM) is the best as an encoding method for hate speech and offensive language detection. The character-level GloVe with CNN-LSTM performed best among all tested methods. GloVe (character level) scored 93% for F1-score and 92% for accuracy. At the word level, BERT word embedding with CNN-LSTM had the best classification scores of 90% F1-score and 91% accuracy. At the subword level, CNN-LSTM and CNN fared best with BERT word embeddings, which had 86% for both accuracy and F1-score. The performance findings show that pre-trained embeddings at different tokenization levels capture diverse information. Moreover, with an average of 85% for F1-score and 86% for accuracy, CNN-LSTM yielded the best score for almost all text embedding regardless of the tokenization level compared to CNN and LSTM. These results show that CNN-LSTM complements each other to capture sequential and local patterns in the input text.
KW - Hate speech detection
KW - neural network
KW - text classification
KW - text embedding
UR - http://www.scopus.com/inward/record.url?scp=85200997691&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-66965-1_19
DO - 10.1007/978-3-031-66965-1_19
M3 - Conference contribution
AN - SCOPUS:85200997691
SN - 9783031669644
T3 - Lecture Notes in Networks and Systems
SP - 186
EP - 195
BT - Recent Advances on Soft Computing and Data Mining - Proceedings of the 6th International Conference on Soft Computing and Data Mining SCDM 2024
A2 - Ghazali, Rozaida
A2 - Nawi, Nazri Mohd
A2 - Arbaiy, Nureize
A2 - Deris, Mustafa Mat
A2 - Abawajy, Jemal H.
PB - Springer
T2 - 6th International Conference on Soft Computing and Data Mining, SCDM 2024
Y2 - 21 August 2024 through 22 August 2024
ER -