For this study, we used the Doc2Vec embedding approach for feature extraction, with the context window size of 2, minimum word frequency of 2, sampling rate of 0.001, learning rate of 0.025, minimum learning rate of 1.0E-4, 200 layers, batch size of 10000 and 40 epochs. Distributed Memory (DM) is used as the embedding learning algorithm with the negative sampling rate of 5.0. Before feature extraction, all the tweets were pre-processed by converting the characters to their lower case, removing stop words, numbers, punctuations and words that contain no more than 3 characters as well as stemming all the kept words by Snowball Stemmer. Additionally, three classifiers are trained by using SVM with a linear kernel, random forests (RF) and gradient boosted trees (GBT). In the testing stage, the same way of text pre-processing and feature extraction is applied to test instances separately, and each pair of two out of the three trained classifiers (SVM+RF, SVM+GBT and RF+GBT) are fused by combining the probabilities for each class by averaging.
|Name||CEUR Workshop Proceedings|
|Workshop||Evaluation of Human Language Technologies for Iberian Languages|
|Period||19/09/18 → 21/09/18|
- Multi-classiffier Fusion
- Social Media