TY - JOUR
T1 - Self-optimised cost-sensitive classifiers for early field failure prediction in storage systems
AU - Bader-El-Den, Mohamed
AU - Perry, Todd
PY - 2023/12/1
Y1 - 2023/12/1
N2 - Data storage systems such as disk arrays go through rigorous testing in the production phase, however, a few of these DAs fail in the field and are returned back to the manufacturer. Although the failure appears in relatively a small percentage of the manufactured DAs, it results in a significant loss of data, time and money. This paper is motivated by the hypothesis that many of these failures could be predicted at the testing stage through data mining and machine learning. Field failure is modelled as a classification problem, however, as in many real-world problems, the problem suffers from significant class imbalance. Several approaches have been proposed that attempt to improve the performance of imbalanced classification by either modifying the dataset (resampling), or assigning classification costs to the classes’ cost matrix. These methods have been shown to improve performance, but they come with many parameters that need to be set, something that usually requires a lengthy exhaustive search, especially on problems with several classes. This paper presents a new scalable genetic algorithm approach for automating the design of the cost matrix CM along with the algorithm parameters. The proposed algorithms are tested on a real-world manufacturing dataset from Seagate disk arrays; the target is to predict from the devices’ testing data those that are likely to fail in the field. To demonstrate its performance, the proposed approach evaluated on a number of standard datasets and compared with other state-of-the-art methods.
AB - Data storage systems such as disk arrays go through rigorous testing in the production phase, however, a few of these DAs fail in the field and are returned back to the manufacturer. Although the failure appears in relatively a small percentage of the manufactured DAs, it results in a significant loss of data, time and money. This paper is motivated by the hypothesis that many of these failures could be predicted at the testing stage through data mining and machine learning. Field failure is modelled as a classification problem, however, as in many real-world problems, the problem suffers from significant class imbalance. Several approaches have been proposed that attempt to improve the performance of imbalanced classification by either modifying the dataset (resampling), or assigning classification costs to the classes’ cost matrix. These methods have been shown to improve performance, but they come with many parameters that need to be set, something that usually requires a lengthy exhaustive search, especially on problems with several classes. This paper presents a new scalable genetic algorithm approach for automating the design of the cost matrix CM along with the algorithm parameters. The proposed algorithms are tested on a real-world manufacturing dataset from Seagate disk arrays; the target is to predict from the devices’ testing data those that are likely to fail in the field. To demonstrate its performance, the proposed approach evaluated on a number of standard datasets and compared with other state-of-the-art methods.
KW - machine learning
KW - failure analysis
KW - classification algorithm
KW - data storage systems
KW - evolutionary algorithms
KW - genetic algorithm
KW - random forest
UR - https://www.sciencedirect.com/science/article/abs/pii/S221065022300161X?CMX_ID=&SIS_ID=&dgcid=STMJ_AUTH_SERV_PUBLISHED&utm_acid=76325585&utm_campaign=STMJ_AUTH_SERV_PUBLISHED&utm_in=DM405096&utm_medium=email&utm_source=AC_
U2 - 10.1016/j.swevo.2023.101388
DO - 10.1016/j.swevo.2023.101388
M3 - Article
SN - 2210-6502
VL - 83
JO - Swarm and Evolutionary Computation
JF - Swarm and Evolutionary Computation
M1 - 101388
ER -