版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:USIC&T Guru Gobind Singh Indraprastha University Sector 16 Dwarka New Delhi India
出 版 物:《Procedia Computer Science》
年 卷 期:2024年第235卷
页 面:250-263页
主 题:class imbalance problem oversampling under sampling SMOTE metaheuristic algorithms whale optimization algorithm
摘 要:The problem of class imbalance has become a predominant area of research recently. Synthetic Minority Oversampling Technique (SMOTE) stands as a popular and widely adopted oversampling technique that effectively addresses the challenge of class imbalance. However, its performance relies on one of the critical parameters i.e., the number of nearest neighbors, k_neighbors , which is often arbitrarily chosen by the users due to which it may not yield optimal results. Furthermore, the varying imbalance ratios across datasets add complexity to the task of parameter selection in SMOTE. In an effort to address this issue, this paper proposes a hybrid rebalancing technique called Whale optimization algorithm based SMOTE (WOA-SMOTE) that combines a metaheuristic technique, WOA, with SMOTE. The algorithm utilizes the advantages of WOA in finding the optimal value of k_neighbors of SMOTE which is crucial in generating synthetic samples that represent the distribution of samples more appropriately, thereby improving the performance of classifiers on imbalanced datasets. The study evaluates the performance of WOA-SMOTE alongside 6 benchmark sampling techniques on 10 real-world imbalanced datasets from Keel repository. These belong to different domains with imbalance ratio (IR) ranging from 1.25 to 15.46. Four different classifiers are used and the evaluation is based on three performance measures: AUC, g-mean and F1 scores. The experimental results showcase WOA-SMOTE s superior performance over SMOTE in majority of the datasets. Notably, WOA-SMOTE outperforms existing techniques in terms of F1 scores when using SVM and XGBoost classifiers in 8 out of 10 datasets. Moreover, its performance remains impressive in 6 of these datasets with random forest and logistic regression classifiers.