This paper investigates data imputation techniques for pre-processing of dataset with missing values. The current literature is mainly focused on the overall accuracy, evaluated estimating the missing values on the dataset at hand, however the predictions can be suboptimal when considering the model performance for each feature. To address this problem, a Column-wise Guided Data Imputation method (cGDI) is proposed. Its main novelty resides in the selection of the most suitable model from a multitude of imputation techniques for each individual feature, through a learning process on the known data. To assess the performance of the proposed technique, empirical experiments have been conducted on 13 publicly available datasets. The results show that cGDI outperforms two baselines and has always comparable or greater estimation accuracy over four state-of-the-art methods, widely applied to solve the problem at hand. Furthermore, cGDI has a straightforward implementation and any other known imputation technique can be easily added.
|Name||Procedia Computer Science|
- missing data
- data imputation
- multitude of imputation models