Subclass-based semi-random data partitioning for improving sample representativeness

Han Liu, Shyi-Ming Chen, Mihaela Cocea

Research output: Contribution to journalArticlepeer-review

239 Downloads (Pure)


In machine learning tasks, it is essential for a data set to be partitioned into atraining set and a test set in a specific ratio. In this context, the training setis used for learning a model for making predictions on new instances, whereasthe test set is used for evaluating the prediction accuracy of a model on newinstances. In the context of human learning, a training set can be viewed aslearning material that covers knowledge, whereas a test set can be viewed asan exam paper that provides questions for students to answer. In practice, data partitioning has typically been done by randomly selecting 70% instancesfor training and the rest for testing. In this paper, we argue that random datapartitioning is likely to result in the sample representativeness issue, i.e., trainingand test instances show very dissimilar characteristics leading to the case similarto testing students on material that was not taught. To address the aboveissue, we propose a subclass-based semi-random data partitioning approach.The experimental results show that the proposed data partitioning approachleads to significant advances in learning performance due to the improvementof sample representativeness.
Original languageEnglish
Pages (from-to)208-221
Number of pages14
JournalInformation Sciences
Early online date5 Nov 2018
Publication statusPublished - 1 Apr 2019


  • Classification
  • Data mining
  • Decision tree learning
  • If-then rules
  • Machine learning
  • Rule learning


Dive into the research topics of 'Subclass-based semi-random data partitioning for improving sample representativeness'. Together they form a unique fingerprint.

Cite this