检索结果-内蒙古大学图书馆

Statistic deviation mode balancer (SDMB): A novel sampling algorithm for imbalanced data

NEUROCOMPUTING 2025年 624卷

作者： Alimoradi, Mahmoud Sadeghi, Reza Daliri, Arman Zabihimayvan, Mahdieh Shafagh Inst Higher Educ Dept Comp Engn Tonekabon Iran Islamic Azad Univ Dept Comp Engn Lahijan Branch Lahijan Iran Marist Coll Dept Comp Sci Poughkeepsie NY USA Islamic Azad Univ Dept Comp Engn Karaj Branch Karaj Iran Cent Connecticut State Univ Dept Comp Sci New Britain CT USA

In supervised learning, the efficacy of classifier algorithms is heavily dependent on the quality of data. Imbalanced datasets, where the class distribution is not uniform, pose a significant challenge, often leading to suboptimal classifier performance. Traditional approaches to rectifying this imbalance have relied on duplicating minority class instances or generating synthetic data, which can introduce bias or outliers. Our novel Statistic Deviation Mode Balancer (SDMB) algorithm addresses these issues by generating new instances that closely mirror the original data structure. Utilizing standard deviation and mode analysis, SDMB strategically synthesizes minority class data while avoiding the pitfalls of outlier generation. The result is a balanced dataset that facilitates more accurate learning by classifier algorithms. We have rigorously tested SDMB across various datasets and compared its performance against existing balancing methods. Our findings indicate that SDMB not only outperforms its counterparts but also significantly enhances the practical application of classifier algorithms in real-world datasets.

关键词： Imbalanced datasets Classifier Performance data sampling techniques Algorithmic Classification Diagnostic Analytics data Balancing

来源：评论

学校读者我要写书评

暂无评论

Ensemble-Based Machine Learning Algorithms Combined with Near Miss Method for Software Bug Prediction

引用

INTERNATIONAL JOURNAL OF NETWORKED AND DISTRIBUTED COMPUTING 2025年第1期13卷 1-17页

作者： Khleel, Nasraldeen Alnor Adam Nehez, Karoly Fadulalla, Montaser Hisaen, Ahmed Univ Miskolc Dept Informat Engn H-3515 Miskolc Hungary Univ Kassala Dept Informat Technol Kassala Sudan Univ Holy Quran Taseel Sci Dept Informat Technol Aljazeera Sudan

Software bug prediction (SBP) involves identifying or categorizing software modules likely to contain defects, utilizing underlying system properties such as software metrics. SBP plays a crucial role in enhancing software project quality and mitigating maintenance risks. Numerous machine learning (ML) algorithms have been developed to predict software bugs. Class imbalance poses a significant challenge for these algorithms, significantly impeding their effectiveness and resulting in imbalanced false-positive and false-negative outcomes. However, limited research has been conducted to specifically tackle the issue of class imbalance in the context of SBP. This study investigates the prediction performance of a homogeneous ensemble: Bagging, boosting, and voting classifiers (VC) methods combined with the under-sampling methods to address the class imbalance problem and improve the accuracy of SBP. Two ensembles are classified as bagging ensembles: decision tree (DT) and random forest (RF);two ensembles are classified as boosting ensembles: AdaBoost (AB) and gradient boosting (GB), while the DT, RF, K-Nearest Neighbours (K-NN), and support vector machine (SVM) are considered as VC. To establish the effectiveness of the proposed models, the experiments were conducted on the available benchmark datasets, which comprise five public datasets based on both class and file-level metrics. We compared and evaluated the performance of the proposed models according to several performance measures, namely accuracy, precision, recall, f-measure, Matthew's correlation coefficient (MCC), and the area under the receiver operating characteristic curve (AUROC). The experimental findings demonstrated that the proposed models exhibit superior efficiency in predicting software bugs on balanced datasets compared to the original datasets, with an improvement of up to 11% accuracy for the class-level metrics and 10% for the file-level metrics. The results indicate that the use of data sampli

关键词： Software metrics Software bug prediction Machine learning Ensemble methods Class imbalance data sampling techniques

来源：评论

学校读者我要写书评

暂无评论

data sampling Strategies for Click Fraud Detection Using Imbalanced User Click data of Online Advertising: An Empirical Review

引用

IETE TECHNICAL REVIEW 2022年第4期39卷 789-798页

作者： Sisodia, Deepti Sisodia, Dilip Singh Natl Inst Technol Dept Comp Sci & Engn Raipur Madhya Pradesh India

In the pay-per-click online advertisement model, fraudulent publishers' presence is rare than that of genuine publishers. This high-class imbalance between fraudulent and genuine publishers poses a challenge for the accurate classification of fraudsters due to the bias of automated learning models towards the outnumbered class. In this work, an empirical evaluation of significant popular data sampling methods is carried out using nine state-of-the-art learning models for classifying fraudsters in online advertisement. The main objective of this work is to understand the effect of oversampling, under-sampling, and hybrid sampling methods on the performance of various classifiers in click fraud detection. Extensive experiments are performed on the benchmark FDMA-2012 user-click dataset. The performance of each combination of data sampling method and classifiers is validated using average precision, recall, f1-score, and AUC. The results are also compared with the existing state-of-the-art models. The results suggest that adaptive synthetic sampling (ADASYN) oversampling with a gradient tree boosting (GTB) model performs best with an average precision score of 64.32%.

关键词： Class imbalance Click fraud data sampling techniques Ensemble learning Online advertising Pay per click

来源：评论

学校读者我要写书评

暂无评论

SamS-Vis: A Tool to Visualize Summary View Using Sampled data 19th

SamS-Vis: A Tool to Visualize Summary View Using Sampled Dat...

引用

19th International-Federation-for-Information-Processing-Technical-Committee-13 (IFIP TC13) International Conference on Human-Computer Interaction (INTERACT)

作者： Humayoun, Shah Rukh Zaidi, Salman AlTarawneh, Ragaad San Francisco State Univ Dept Comp Sci San Francisco CA 94132 USA Univ Kaiserslautern Kaiserslautern Germany Intel Corp Intel Labs Santa Clara CA USA

ISBN: (纸本)9783031422928;9783031422935

Many recent visual analytics tools use exploratory model analysis workflow to enable users exploring set of potential machine/deep learning models. As part of the workflow, these tools provide summary view of underlying dataset to enable the users to better understand trends in their data. Due to the iterative nature of such workflows, users may need to go back to data exploration phase multiple times. In order to save time and resources at data pre-processing and visualization time, we propose to use sampled data rather than complete dataset for showing trends in data summary views. As a proof-of-concept, we built a visualization tool, called SamS-Vis, that uses five sampling techniques to collect sampled data and then shows the summary views using histogram line-charts. It enables the users to see the whole data summary view of the selected field(s) using histogram bar-chart based on demand.

关键词： data summary visualization data sampling techniques

来源：评论

学校读者我要写书评

暂无评论

Imbalanced aspect categorization using bidirectional encoder representation from transformers

引用

Procedia Computer Science 2023年 218卷 757-765页

作者： Ashok Kumar Jayaraman Abirami Murugappan Tina Esther Trueman Gayathri Ananthakrishnan Ashish Ghosh

Sentiment analysis (also called opinion mining) is one of the widely used research fields of natural language processing. E-commerce service providers use this technique to analyze the sentiment of a product or a service in texts, posts, and comments. In particular, the service providers and users want to understand the sentiment on product aspect categories rather than the overall sentiment of a product. These aspect categories encounter the class imbalance problem. Therefore, the BERT (Bidirectional Encoder Representation from Transformers) based fine-tuning model is presented to deal with the imbalanced aspect categorization task. Specifically, this paper studies various data sampling techniques such as stratified random sampling (SRS), random undersampling (RUS), and random oversampling (ROS) for reducing the class imbalance problem. Empirically, the results show that the proposed BERT fine-tuning model with the SRS technique achieves better results. In particular, the model achieves 96.21% for the validation and 96.47% for testing using the news aggregator data. Similarly, the SMS spam collection data achieves 99.20% for the validation and 99.10% for testing.

关键词： Deep learning aspect category detection BERT class Imbalance transformers data sampling techniques

来源：评论

学校读者我要写书评

暂无评论

data Balancing Improves Self-Admitted Technical Debt Detection 18

Data Balancing Improves Self-Admitted Technical Debt Detecti...

引用

29th IEEE/ACM International Conference on Program Comprehension (ICPC) / 18th IEEE/ACM International Conference on Mining Software Repositories (MSR)

作者： Sridharan, Murali Mantyla, Mika Rantala, Leevi Claes, Maelick Univ Oulu M3S ITEE Oulu Finland

ISBN: (纸本)9781728187105

A high imbalance exists between technical debt and non-technical debt source code comments. Such imbalance affects Self-Admitted Technical Debt (SATD) detection performance, and existing literature lacks empirical evidence on the choice of balancing technique. In this work, we evaluate the impact of multiple balancing techniques, including data level, Classifier level, and Hybrid, for SATD detection in Within-Project and Cross-Project setup. Our results show that the data level balancing technique SMOTE or Classifier level Ensemble approaches Random Forest or XGBoost are reasonable choices depending on whether the goal is to maximize Precision, Recall, F1, or AUC-ROC. We compared our best-performing model with the previous SATD detection benchmark (cost-sensitive Convolution Neural Network). Interestingly the top-performing XGBoost with SMOTE sampling improved the Within-project F1 score by 10% but fell short in Cross-Project set up by 9%. This supports the higher generalization capability of deep learning in Cross-Project SATD detection, yet while working within individual projects, classical machine learning algorithms can deliver better performance. We also evaluate and quantify the impact of duplicate source code comments in SATD detection performance. Finally, we employ SHAP and discuss the interpreted SATD features. We have included the replication package1 and shared a web-based SATD prediction tool2 with the balancing techniques in this study.

关键词： Self-Admitted Technical Debt data imbalance classification data sampling techniques cost-sensitive technique ensemble techniques

来源：评论

学校读者我要写书评

暂无评论

Machine Learning Approach to Segment Saccharomyces cerevisiae Yeast Cells

Machine Learning Approach to Segment <i>Saccharomyces cerevi...

引用

International Conference on Advances in Biomedical Engineering (ICABME)

作者： Tleis, Mohamed Verbeek, Fons Leiden Univ LIACS Sect Imaging & BioInformat Leiden Netherlands

ISBN: (纸本)9781467365161

In biological studies, Saccharomyces cerevisiae yeast cells are used to study the behaviour of proteins. This is a time consuming and not completely objective process. Hence, image analysis platforms are developed to address these problems and to offer analysis per cell as well. The segmentation algorithms implemented in such platforms can segment the healthy cells, along with artefacts such as debris and dead cells that exist in the cultured medium. The novel idea in this work is to apply a machine learning approach to train the segmentation system in order to classify the healthy cell objects from the other objects. Such approach is based on the analysis of a set of relevant individual cell features extracted from the microscope images of yeast cells. These features include texture measurements and wavelet-based texture measurements, as well as moment invariant features. Those features were introduced to describe the intensity and morphology characteristics in a more sophisticated way. A set of classification systems, data sampling techniques, data normalization schemes and feature selection algorithms were tested and evaluated to build a classification model in order to be used within the segmentation module. The study picks the simple logistic classification model as the best approach to classify our dataset of 1380 cells. This system increases the performance level in our image and data analysis modules, improve the segmentation and consequently the analysis of the measurement results. This leads to a better pattern recognition system as well.

关键词： biological techniques biology computing cellular biophysics feature extraction image classification image segmentation image texture learning (artificial intelligence) microorganisms optical microscopy wavelet transforms S. cerevisiae cell segmentation Saccharomyces cerevisiae data analysis module data normalization schemes data sampling techniques feature selection algorithms image analysis module logistic classification model machine learning approach moment invariant features pattern recognition system segmentation system training wavelet based texture measurements yeast cell microscope images yeast cells Accuracy Biomedical measurement Image segmentation Logistics Machine learning algorithms Vegetation Saccharomyces cerevisiae image segmentation Pattern Recognition System biology computing cellular biophysics Machine learning algorithms Biological techniques image texture Image classification wavelet transforms Feature extraction Logistics optical microscopy biomedical measurement selection algorithm

来源：评论

学校读者我要写书评

暂无评论

Classification Performance of Three Approaches for Combining data sampling and Gene Selection on Bioinformatics data 15

Classification Performance of Three Approaches for Combining...

引用

15th IEEE International Conference on Information Reuse and Integration (IEEE IRI) / IRI-HI / FMI / DIM / EM-RITE / WICSOC / SocialSec / IICPC / NatSec

作者： Khoshgoftaar, Taghi M. Fazelpour, Alireza Dittman, David J. Napolitano, Amri Florida Atlantic Univ Boca Raton FL 33431 USA

ISBN: (纸本)9781479958795;9781479958801

Bioinformatics datasets pose two major challenges to researchers and data-mining practitioners: class imbalance and high dimensionality. Class imbalance occurs when instances of one class vastly outnumber instances of the other class(es), and high dimensionality occurs when a dataset has many independent features (genes). data sampling is often used to tackle the problem of class imbalance, and the problem of excessive features in the dataset may be alleviated through feature selection. In this work, we examine various approaches for applying these techniques simultaneously to tackle both of these challenges and build effective classification models. In particular, we ask whether the order of these techniques and the use of unsampled or sampled datasets for building classification models makes a difference. We conducted an empirical study on a series of seven high-dimensional and severely imbalanced biological datasets using six commonly used learners and four feature selection rankers from three different families of feature selection techniques. We compared three different data-sampling approaches: data sampling followed by feature selection using the unsampled data (DS-FS-UnSam) and selected features;data sampling followed by feature selection using the sampled data (DS-FS-Sam) and selected features;and feature selection followed by data sampling (FS-DS) using sampled data and selected features. We used Random Undersampling (RUS) to achieve the minority: majority class ratios of 35:65 and 50:50. The experimental results show that there are statistically significant differences among the three data-sampling approaches only when using the class ratio of 50:50, with a multiple comparison test showing that DS-FS-UnSam outperforms the other approaches. Thus, although specific combinations of learner and ranker may favor other approaches, across all choices of learner and ranker we would recommend the use of the DS-FS-UnSam approach for this class ratio. On the other h

关键词： data sampling techniques data sampling order class imbalance feature selection

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：