Most of the existing oversampling algorithms based on clustering do not consider the spatial distribution of Majority class, and it is easy to overlap classes and ignore important information points when synthesizing ...
详细信息
Most of the existing oversampling algorithms based on clustering do not consider the spatial distribution of Majority class, and it is easy to overlap classes and ignore important information points when synthesizing new samples. To solve this problem, this paper analyzes the influence of the spatial distribution on the oversampling process, and proposes an oversampling algorithm based on Adaptive Density Difference Peak Clustering and Spatial Distribution Entropy. Firstly, the spatial distribution situation of two classes samples is introduced into the clustering process, and the local density difference is used to cluster of Minority class by the peak value, so as to achieve scientific and reasonable selection of sub-cluster centers and reduce the occurrence of class overlap. At the same time, the method of determining the truncation distance according to the previous experience is change. The spatial distribution situation of two classes samples is characterized by constructing Spatial Distribution Entropy. On this basis, the automatic selection and optimization of truncation distance are realized. Then the boundary points and sparse points are screened according to the absolute value of local density difference, and the sampling probabilities of each minority class sample are determined to focus on these important information points. Finally, Spatial Distribution Entropy is used to evaluate the synthetic samples set to ensure that they can balance the distribution of the two classes samples in the dataset. To test the effectiveness of the algorithm, five oversampling algorithms are used to perform comparative experiments on four classifiers and 16 common datasets. The results show that compared with SMOTE, K-means-SMOTE, BS-SMOTE, ADASYN, DPC-SMOTE, the algorithm has significantly improved in all evaluation indexes.
Imbalanced data problem is a big challenge for judicial data analysis since it often leads to a low accuracy of the data classification. Synthesizing new samples by means of oversampling is a useful method to handle t...
详细信息
Imbalanced data problem is a big challenge for judicial data analysis since it often leads to a low accuracy of the data classification. Synthesizing new samples by means of oversampling is a useful method to handle this problem. However, most oversampling algorithms have been obtained regardless of noise samples and the data distribution has not been fully taken into consideration. For this purpose, an improved cluster-based synthetic oversampling algorithm, namely distributed fuzzy-based adaptive synthetic oversampling (DFBASO) algorithm, is proposed by simultaneously considering the distribution of interclass, the distribution of intra-cluster and the characteristic of noise samples. The proposed DFBASO algorithm is equipped with: 1) fuzzy c-means (FCM) clustering algorithm application for samples of minority and majority classes;2) weighted distribution based on two factors including the inter-class distance and the cluster capacity;and 3) a mixed synthetic method under different distribution cases of intra-cluster. Finally, the judicial data set and eight public data sets are utilized to show the effectiveness and universal applicability of the proposed DFBASO algorithm for the imbalanced data classification. (c) 2021 Elsevier Inc. All rights reserved. With the arrival of the big data era and the rapid improvement of data acquisition systems, the judicial data analysis has become a hot research topic and gained much attention in both academic and applicable fields. In the judicial research frontier, big data analysis is usually combined with artificial intelligent algorithms so as to help organizations and mechanisms have access to blind spots of problems, make an improvement of the trial efficiency and the judicial justice, and accelerate the establishment of intelligent trial system. So far, a great effort has been made on the classification of judicial data and some remarkable results have been reported in the literature, see 129,19,34,2] and references ther
Learning from class-imbalanced data is a challenging problem as standard classification algorithms are designed to handle balanced class distributions. Scholars solve this problem by modifying classifiers or and gener...
详细信息
Learning from class-imbalanced data is a challenging problem as standard classification algorithms are designed to handle balanced class distributions. Scholars solve this problem by modifying classifiers or and generating artificial data by oversampling. The former usually design corresponding classifier to adapt them to the imbalanced data, while the latter exploits the sampling algorithm, which are the data preprocessing steps independent of the classifier. In this paper, we propose a novel synergistic oversampling algorithm to combine the oversampling and classification into one without training the classifier repeatedly, which can generate new pertinent samples according to the classification performance of the classifier without repeat training or deep understanding of the classifier, so the generated samples can guarantee the performance improvement of the classifier. Moreover, The proposed framework enclosures the oversampling method without traditional parameters in oversampling methods. Experimental results on several real-life imbalanced datasets demonstrate the effectiveness and efficiency of the proposed algorithm in binary classification problems.
oversampling algorithms are methods employed in the field of machine learning to address the constraints associated with data quantity. This study aimed to explore the variations in reliability as data volume is progr...
详细信息
oversampling algorithms are methods employed in the field of machine learning to address the constraints associated with data quantity. This study aimed to explore the variations in reliability as data volume is progressively increased through the use of oversampling algorithms. For this purpose, the synthetic minority oversampling technique (SMOTE) and the borderline synthetic minority oversampling technique (BSMOTE) are chosen. The data inputs, which included air temperature, humidity, and wind speed, are parameters used in the Fosberg Fire-Weather Index (FFWI). Starting with a base of 52 entries, new data sets are generated by incrementally increasing the data volume by 10% up to a total increase of 100%. This augmented data is then utilized to predict FFWI using a deep neural network. The coefficient of determination (R2) 2 ) is calculated for predictions made with both the original and the augmented datasets. Suggesting that increasing data volume by more than 50% of the original dataset quantity yields more reliable outcomes. This study introduces a methodology to alleviate the challenge of establishing a standard for data augmentation when employing oversampling algorithms, as well as a means to assess reliability.
This paper presents a novel, sequentially executed supervised machine learning-based electric theft detection framework using a Jaya-optimized combined Kernel and Tree Boosting (KTBoost) classifier. It utilizes the in...
详细信息
This paper presents a novel, sequentially executed supervised machine learning-based electric theft detection framework using a Jaya-optimized combined Kernel and Tree Boosting (KTBoost) classifier. It utilizes the intelligence of the XGBoost algorithm to estimate the missing values in the acquired dataset during the data pre-processing phase. An oversampling algorithm based on the Robust-SMOTE technique is utilized to avoid the unbalanced data class distribution issue. Afterward, with the aid of few very significant statistical, temporal, and spectral features extracted from the acquired kWh dataset, the complex underlying data patterns are comprehended to enhance the accuracy and detection rate of the classifier. For effectively classifying the consumers into "Honest" and "Fraudster," the ensemble machine learning-based classifier KTBoost, with Jaya algorithm optimized hyperparameters, is utilized. Finally, the developed model is re-trained using a reduced set of highly important features to minimize the computational resources without compromising the performance of the developed model. The outcome of this study reveals that the proposed theft detection method achieves the highest accuracy (93.38%), precision (95%), and recall (93.18%) among all the studied methods, thus signifying its importance in the studied area of research.
Machine learning techniques play a crucial part in intrusion detection and greatly change the original intrusion detection methods. How to use machine learning technologies to achieve better detection results is impor...
详细信息
ISBN:
(纸本)9781509063529
Machine learning techniques play a crucial part in intrusion detection and greatly change the original intrusion detection methods. How to use machine learning technologies to achieve better detection results is important. However, due to defects in the machine learning algorithms and the data imbalance problem between the attack behaviors and the normal behaviors in the network, the detection rate of low-frequent attack behaviors cannot be effectively improved. In order to solve this issue, from the consideration of data level, a novel Region Adaptive Synthetic Minority oversampling Technique (RA-SMOTE) is proposed. Three different types of classifiers, including support vector machines (SVM), BP neural network (BPNN), and random forests (RF), are used to test the effectiveness of the algorithm. Empirical results test on DSL-KDD dataset show that the proposed algorithm can effectively solve the class imbalance problem and improve the detection rate of low-frequent attacks.
The integration of industrialization and informatization has exposed industrial control systems (ICSs) to increasingly serious security challenges. Currently, the mainstream method to protect the security of ICSs is i...
详细信息
ISBN:
(纸本)9781665464512
The integration of industrialization and informatization has exposed industrial control systems (ICSs) to increasingly serious security challenges. Currently, the mainstream method to protect the security of ICSs is intrusion detection system (IDS) based on deep-learning. However, these methods depend on a massive amount of high-quality data. Owing to the characteristics and protocol limitations, ICSs data usually experience low-quality and data imbalance problems, which significantly affects the accuracy of IDS. In this study, an IDS for ICS that combines data expansion algorithm and CNN was proposed. A novel normalized neighborhood weighted convex combined random sample (NNW-CCRS) oversampling algorithm was designed, which automatically attenuates the effects of noise and expanding imbalanced data to produce balanced ICS datasets. By reducing the impact of imbalanced ICS data on IDSs, our system effectively protects the security of ICS. Secure Water Treatment dataset (SWaT) was used for experimental validation. The experimental results confirmed that the accuracy of the proposed system improved by approximately 20%, compared to the ICS without data expansion.
Class-imbalanced is a common phenomenon in rockburst data, and the prediction of rockburst intensity through intelligent methods requires a balanced dataset. This fact presents challenges for standard classification a...
详细信息
Class-imbalanced is a common phenomenon in rockburst data, and the prediction of rockburst intensity through intelligent methods requires a balanced dataset. This fact presents challenges for standard classification algorithms that are designed for class distributions that are well-balanced. This paper develops the modified synthetic minority oversampling technique by K-means cluster (KM-SMOTE) to reduce the imbalance phenomenon in the rockburst dataset. First, the study collects 226 rockburst cases worldwide as the original supporting dataset and selects four indexes to predict the rockburst intensity, namely, the maximum tangential stress of the surrounding rock & sigma;& theta;, the uniaxial compressive strength of rock & sigma;c, the tensile strength of rock & sigma;t, and the elastic energy index Wet. Second, the KM-SMOTE uses a K-means cluster to cluster the minority-class samples and then performs SMOTE oversampling on each cluster to obtain 388 data. To establish a nonlinear correlation between rockburst intensity and its predictors, six machine-learning classifiers are used. The dataset is randomly divided into training and test sets, with 80% of the data used for training. In the data training and testing phases, the original dataset, SMOTE-processed dataset, and KM-SMOTE-processed dataset were put into the machine learning models for predicting rockburst intensity, where KM-SMOTE was 3.3% and 10.5% more accurate than the SMOTEprocessed dataset in predicting rockburst intensity, respectively. In the Jiangbian Hydropower Station engineering application, the KM-SMOTE algorithm can achieve a maximum improvement of 25% in accuracy compared with the data processed by SMOTE. Overall, the proposed modified oversampling algorithm effectively overcomes class-imbalanced in the rockburst dataset and significantly contributes to the intelligent prediction of rockburst by machine learning in engineering.
The increase in mining depth necessitates higher strength requirements for hard rock pillars, making mine pillar stability analysis crucial for pillar design and underground safety operations. To enhance the accuracy ...
详细信息
The increase in mining depth necessitates higher strength requirements for hard rock pillars, making mine pillar stability analysis crucial for pillar design and underground safety operations. To enhance the accuracy of predicting the stability state of mine pillars, a prediction model based on the subtraction-average-based optimizer (SABO) for hyperparameter optimization of the least-squares support vector machine (LSSVM) is proposed. First, by analyzing the redundancy of features in the mine pillar dataset and conducting feature selection, five parameter combinations were constructed to examine their effects on the performance of different models. Second, the SABO-LSSVM prediction model was compared vertically with classic models and horizontally with other optimized models to ensure comprehensive and objective evaluation. Finally, two data sampling methods and a combined sampling method were used to correct the bias of the optimized model for different categories of mine pillars. The results demonstrated that the SABO-LSSVM model exhibited good accuracy and comprehensive performance, thereby providing valuable insights for mine pillar stability prediction.
This paper aims to build an employee attrition classification model based on the Stacking *** algorithm is applied to address the issue of data imbalance and the Randomforest feature importance ranking method is used ...
详细信息
This paper aims to build an employee attrition classification model based on the Stacking *** algorithm is applied to address the issue of data imbalance and the Randomforest feature importance ranking method is used to resolve the overfitting problem after data cleaning and ***,different algorithms are used to establish classification models as control experiments,and R-squared indicators are used to ***,the Stacking algorithm is used to establish the final classification *** model has practical and significant implications for both human resource management and employee attrition analysis.
暂无评论