Abstract
In machine learning tasks, it is essential for a data set to be partitioned into atraining set and a test set in a specific ratio. In this context, the training setis used for learning a model for making predictions on new instances, whereasthe test set is used for evaluating the prediction accuracy of a model on newinstances. In the context of human learning, a training set can be viewed aslearning material that covers knowledge, whereas a test set can be viewed asan exam paper that provides questions for students to answer. In practice, data partitioning has typically been done by randomly selecting 70% instancesfor training and the rest for testing. In this paper, we argue that random datapartitioning is likely to result in the sample representativeness issue, i.e., trainingand test instances show very dissimilar characteristics leading to the case similarto testing students on material that was not taught. To address the aboveissue, we propose a subclass-based semi-random data partitioning approach.The experimental results show that the proposed data partitioning approachleads to significant advances in learning performance due to the improvementof sample representativeness.
Original language | English |
---|---|
Pages (from-to) | 208-221 |
Number of pages | 14 |
Journal | Information Sciences |
Volume | 478 |
Early online date | 5 Nov 2018 |
DOIs | |
Publication status | Published - 1 Apr 2019 |
Keywords
- Classification
- Data mining
- Decision tree learning
- If-then rules
- Machine learning
- Rule learning