Abstract
In this paper we investigate the use of Distributed Neural Networks for the imputation of missing values in Big Data context. The presented framework for data imputation is implemented in Spark, allowing easy imputation as an additional step to the data pre-processing pipeline. The Distributed Neural Networks model is using Mini-batch Stochastic Gradient Descent, scaling well with the cluster size and minimizing the communication among the workers. The model is tested on a real-world Recommender Systems dataset, where the missing data is generally a problem for new items, as the systems ranking is usually biased towards the popular items. The model is compared with univariate (Mean and Median Imputation) and multivariate (K-Nearest Neighbours and Linear Regression) imputation techniques, and its performance is validated using prediction accuracy and speed. Furthermore, we evaluate the speedup compared to the sequential implementation of Neural Networks with Stochastic Gradient Descent.
Original language | English |
---|---|
Title of host publication | 2018 International Joint Conference on Neural Networks (IJCNN) |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 131-138 |
Number of pages | 9 |
ISBN (Electronic) | 978-1-5090-6014-6 |
ISBN (Print) | 978-1-5090-6015-3 |
DOIs | |
Publication status | Published - 15 Oct 2018 |
Event | IEEE WCCI 2018, World Congress on Computational Intelligence - Roi de Janeiro, Brazil Duration: 8 Jul 2018 → 13 Jul 2018 http://www.ecomp.poli.br/~wcci2018/ |
Publication series
Name | IEEE IJCNN Proceedings Series |
---|---|
Publisher | IEEE |
ISSN (Electronic) | 2161-4407 |
Conference
Conference | IEEE WCCI 2018, World Congress on Computational Intelligence |
---|---|
Abbreviated title | IJCNN |
Country/Territory | Brazil |
Period | 8/07/18 → 13/07/18 |
Internet address |
Keywords
- Distributed Computation
- Big Data
- Missing Data Imputation
- Neural Networks