Skip to content

Identification and classification of misogynous tweets using multi-classifier fusion

Research output: Chapter in Book/Report/Conference proceedingConference contribution

For this study, we used the Doc2Vec embedding approach for feature extraction, with the context window size of 2, minimum word frequency of 2, sampling rate of 0.001, learning rate of 0.025, minimum learning rate of 1.0E-4, 200 layers, batch size of 10000 and 40 epochs. Distributed Memory (DM) is used as the embedding learning algorithm with the negative sampling rate of 5.0. Before feature extraction, all the tweets were pre-processed by converting the characters to their lower case, removing stop words, numbers, punctuations and words that contain no more than 3 characters as well as stemming all the kept words by Snowball Stemmer. Additionally, three classifiers are trained by using SVM with a linear kernel, random forests (RF) and gradient boosted trees (GBT). In the testing stage, the same way of text pre-processing and feature extraction is applied to test instances separately, and each pair of two out of the three trained classifiers (SVM+RF, SVM+GBT and RF+GBT) are fused by combining the probabilities for each class by averaging.
Original languageEnglish
Title of host publicationProceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages
EditorsPaolo Rosso, Julio Gonzalo, Raquel Martínez, Soto Montalvo, Jorge Carrillo-de-Albornoz
PublisherCEUR Workshop Proceedings
Number of pages6
Publication statusPublished - 27 Jul 2018
EventEvaluation of Human Language Technologies for Iberian Languages: IberEval 2018 - Seville, Spain
Duration: 19 Sep 201821 Sep 2018

Publication series

NameCEUR Workshop Proceedings
ISSN (Print)1613-0073


WorkshopEvaluation of Human Language Technologies for Iberian Languages


  • AMI_paper7

    Final published version, 800 KB, PDF document

Related information

Relations Get citation (various referencing formats)

ID: 10852433