In supervised learning, the efficacy of classifier algorithms is heavily dependent on the quality of data. Imbalanced datasets, where the class distribution is not uniform, pose a significant challenge, often leading ...
详细信息
In supervised learning, the efficacy of classifier algorithms is heavily dependent on the quality of data. Imbalanced datasets, where the class distribution is not uniform, pose a significant challenge, often leading to suboptimal classifier performance. Traditional approaches to rectifying this imbalance have relied on duplicating minority class instances or generating synthetic data, which can introduce bias or outliers. Our novel Statistic Deviation Mode Balancer (SDMB) algorithm addresses these issues by generating new instances that closely mirror the original data structure. Utilizing standard deviation and mode analysis, SDMB strategically synthesizes minority class data while avoiding the pitfalls of outlier generation. The result is a balanced dataset that facilitates more accurate learning by classifier algorithms. We have rigorously tested SDMB across various datasets and compared its performance against existing balancing methods. Our findings indicate that SDMB not only outperforms its counterparts but also significantly enhances the practical application of classifier algorithms in real-world datasets.
Software bug prediction (SBP) involves identifying or categorizing software modules likely to contain defects, utilizing underlying system properties such as software metrics. SBP plays a crucial role in enhancing sof...
详细信息
Software bug prediction (SBP) involves identifying or categorizing software modules likely to contain defects, utilizing underlying system properties such as software metrics. SBP plays a crucial role in enhancing software project quality and mitigating maintenance risks. Numerous machine learning (ML) algorithms have been developed to predict software bugs. Class imbalance poses a significant challenge for these algorithms, significantly impeding their effectiveness and resulting in imbalanced false-positive and false-negative outcomes. However, limited research has been conducted to specifically tackle the issue of class imbalance in the context of SBP. This study investigates the prediction performance of a homogeneous ensemble: Bagging, boosting, and voting classifiers (VC) methods combined with the under-sampling methods to address the class imbalance problem and improve the accuracy of SBP. Two ensembles are classified as bagging ensembles: decision tree (DT) and random forest (RF);two ensembles are classified as boosting ensembles: AdaBoost (AB) and gradient boosting (GB), while the DT, RF, K-Nearest Neighbours (K-NN), and support vector machine (SVM) are considered as VC. To establish the effectiveness of the proposed models, the experiments were conducted on the available benchmark datasets, which comprise five public datasets based on both class and file-level metrics. We compared and evaluated the performance of the proposed models according to several performance measures, namely accuracy, precision, recall, f-measure, Matthew's correlation coefficient (MCC), and the area under the receiver operating characteristic curve (AUROC). The experimental findings demonstrated that the proposed models exhibit superior efficiency in predicting software bugs on balanced datasets compared to the original datasets, with an improvement of up to 11% accuracy for the class-level metrics and 10% for the file-level metrics. The results indicate that the use of data sampli
In the pay-per-click online advertisement model, fraudulent publishers' presence is rare than that of genuine publishers. This high-class imbalance between fraudulent and genuine publishers poses a challenge for t...
详细信息
In the pay-per-click online advertisement model, fraudulent publishers' presence is rare than that of genuine publishers. This high-class imbalance between fraudulent and genuine publishers poses a challenge for the accurate classification of fraudsters due to the bias of automated learning models towards the outnumbered class. In this work, an empirical evaluation of significant popular datasampling methods is carried out using nine state-of-the-art learning models for classifying fraudsters in online advertisement. The main objective of this work is to understand the effect of oversampling, under-sampling, and hybrid sampling methods on the performance of various classifiers in click fraud detection. Extensive experiments are performed on the benchmark FDMA-2012 user-click dataset. The performance of each combination of datasampling method and classifiers is validated using average precision, recall, f1-score, and AUC. The results are also compared with the existing state-of-the-art models. The results suggest that adaptive synthetic sampling (ADASYN) oversampling with a gradient tree boosting (GTB) model performs best with an average precision score of 64.32%.
Many recent visual analytics tools use exploratory model analysis workflow to enable users exploring set of potential machine/deep learning models. As part of the workflow, these tools provide summary view of underlyi...
详细信息
ISBN:
(纸本)9783031422928;9783031422935
Many recent visual analytics tools use exploratory model analysis workflow to enable users exploring set of potential machine/deep learning models. As part of the workflow, these tools provide summary view of underlying dataset to enable the users to better understand trends in their data. Due to the iterative nature of such workflows, users may need to go back to data exploration phase multiple times. In order to save time and resources at data pre-processing and visualization time, we propose to use sampled data rather than complete dataset for showing trends in data summary views. As a proof-of-concept, we built a visualization tool, called SamS-Vis, that uses five samplingtechniques to collect sampled data and then shows the summary views using histogram line-charts. It enables the users to see the whole data summary view of the selected field(s) using histogram bar-chart based on demand.
Sentiment analysis (also called opinion mining) is one of the widely used research fields of natural language processing. E-commerce service providers use this technique to analyze the sentiment of a product or a serv...
详细信息
Sentiment analysis (also called opinion mining) is one of the widely used research fields of natural language processing. E-commerce service providers use this technique to analyze the sentiment of a product or a service in texts, posts, and comments. In particular, the service providers and users want to understand the sentiment on product aspect categories rather than the overall sentiment of a product. These aspect categories encounter the class imbalance problem. Therefore, the BERT (Bidirectional Encoder Representation from Transformers) based fine-tuning model is presented to deal with the imbalanced aspect categorization task. Specifically, this paper studies various data sampling techniques such as stratified random sampling (SRS), random undersampling (RUS), and random oversampling (ROS) for reducing the class imbalance problem. Empirically, the results show that the proposed BERT fine-tuning model with the SRS technique achieves better results. In particular, the model achieves 96.21% for the validation and 96.47% for testing using the news aggregator data. Similarly, the SMS spam collection data achieves 99.20% for the validation and 99.10% for testing.
A high imbalance exists between technical debt and non-technical debt source code comments. Such imbalance affects Self-Admitted Technical Debt (SATD) detection performance, and existing literature lacks empirical evi...
详细信息
ISBN:
(纸本)9781728187105
A high imbalance exists between technical debt and non-technical debt source code comments. Such imbalance affects Self-Admitted Technical Debt (SATD) detection performance, and existing literature lacks empirical evidence on the choice of balancing technique. In this work, we evaluate the impact of multiple balancing techniques, including data level, Classifier level, and Hybrid, for SATD detection in Within-Project and Cross-Project setup. Our results show that the data level balancing technique SMOTE or Classifier level Ensemble approaches Random Forest or XGBoost are reasonable choices depending on whether the goal is to maximize Precision, Recall, F1, or AUC-ROC. We compared our best-performing model with the previous SATD detection benchmark (cost-sensitive Convolution Neural Network). Interestingly the top-performing XGBoost with SMOTE sampling improved the Within-project F1 score by 10% but fell short in Cross-Project set up by 9%. This supports the higher generalization capability of deep learning in Cross-Project SATD detection, yet while working within individual projects, classical machine learning algorithms can deliver better performance. We also evaluate and quantify the impact of duplicate source code comments in SATD detection performance. Finally, we employ SHAP and discuss the interpreted SATD features. We have included the replication package1 and shared a web-based SATD prediction tool2 with the balancing techniques in this study.
In biological studies, Saccharomyces cerevisiae yeast cells are used to study the behaviour of proteins. This is a time consuming and not completely objective process. Hence, image analysis platforms are developed to ...
详细信息
ISBN:
(纸本)9781467365161
In biological studies, Saccharomyces cerevisiae yeast cells are used to study the behaviour of proteins. This is a time consuming and not completely objective process. Hence, image analysis platforms are developed to address these problems and to offer analysis per cell as well. The segmentation algorithms implemented in such platforms can segment the healthy cells, along with artefacts such as debris and dead cells that exist in the cultured medium. The novel idea in this work is to apply a machine learning approach to train the segmentation system in order to classify the healthy cell objects from the other objects. Such approach is based on the analysis of a set of relevant individual cell features extracted from the microscope images of yeast cells. These features include texture measurements and wavelet-based texture measurements, as well as moment invariant features. Those features were introduced to describe the intensity and morphology characteristics in a more sophisticated way. A set of classification systems, data sampling techniques, data normalization schemes and feature selection algorithms were tested and evaluated to build a classification model in order to be used within the segmentation module. The study picks the simple logistic classification model as the best approach to classify our dataset of 1380 cells. This system increases the performance level in our image and data analysis modules, improve the segmentation and consequently the analysis of the measurement results. This leads to a better pattern recognition system as well.
Bioinformatics datasets pose two major challenges to researchers and data-mining practitioners: class imbalance and high dimensionality. Class imbalance occurs when instances of one class vastly outnumber instances of...
详细信息
ISBN:
(纸本)9781479958795;9781479958801
Bioinformatics datasets pose two major challenges to researchers and data-mining practitioners: class imbalance and high dimensionality. Class imbalance occurs when instances of one class vastly outnumber instances of the other class(es), and high dimensionality occurs when a dataset has many independent features (genes). datasampling is often used to tackle the problem of class imbalance, and the problem of excessive features in the dataset may be alleviated through feature selection. In this work, we examine various approaches for applying these techniques simultaneously to tackle both of these challenges and build effective classification models. In particular, we ask whether the order of these techniques and the use of unsampled or sampled datasets for building classification models makes a difference. We conducted an empirical study on a series of seven high-dimensional and severely imbalanced biological datasets using six commonly used learners and four feature selection rankers from three different families of feature selection techniques. We compared three different data-sampling approaches: datasampling followed by feature selection using the unsampled data (DS-FS-UnSam) and selected features;datasampling followed by feature selection using the sampled data (DS-FS-Sam) and selected features;and feature selection followed by datasampling (FS-DS) using sampled data and selected features. We used Random Undersampling (RUS) to achieve the minority: majority class ratios of 35:65 and 50:50. The experimental results show that there are statistically significant differences among the three data-sampling approaches only when using the class ratio of 50:50, with a multiple comparison test showing that DS-FS-UnSam outperforms the other approaches. Thus, although specific combinations of learner and ranker may favor other approaches, across all choices of learner and ranker we would recommend the use of the DS-FS-UnSam approach for this class ratio. On the other h
暂无评论