检索结果-内蒙古大学图书馆

A weighted hybrid ensemble method for classifying imbalanced data

KNOWLEDGE-BASED SYSTEMS 2020年 203卷

作者： Zhao, Jiakun Jin, Ju Chen, Si Zhang, Ruifeng Yu, Bilin Liu, Qingfang Xi An Jiao Tong Univ Sch Software Engn Xian 710049 Peoples R China Univ Sci & Technol China Sch Management Hefei 230026 Peoples R China Xi An Jiao Tong Univ Sch Math & Stat Xian 710049 Peoples R China

In real datasets, most are unbalanced. data imbalance can be defined as the number of instances in some classes greatly exceeds the number of instances in other classes. Whether in the field of data mining or machine learning, data imbalance can have adverse effects. At present, the methods to solve the problem of data imbalance can be divided into data-level methods, algorithm-level methods and hybrid methods. In this paper, we propose a weighted hybrid ensemble method for classifying imbalanced data in binary classification tasks, called WHMBoost. In the framework of the boosting algorithm, the presented method combines two data sampling methods and two base classifiers, and each sampling method and each base classifier is assigned corresponding weights, which makes them have better complementary advantages. The performance of WHMBoost has been evaluated on 40 benchmark imbalanced datasets with state of the art ensemble methods like AdaBoost, RUSBoost, SMOTEBoost using AUC, F-Measure and Geometric Mean as the performance evaluation criteria. Experimental results show significant improvement over the other methods and it can be concluded that WHMBoost is a promising and effective algorithm to deal with imbalance datasets. (C) 2020 Elsevier B.V. All rights reserved.

关键词： data imbalance Binary classification Boosting algorithm data sampling methods Base classifiers

来源：评论

学校读者我要写书评

暂无评论

Metabolic Syndrome and Development of Diabetes Mellitus: Predictive Modeling Based on Machine Learning Techniques

引用

IEEE ACCESS 2019年 7卷 1365-1375页

作者： Perveen, Sajida Shahbaz, Muhammad Keshavjee, Karim Guergachi, Aziz Univ Engn & Technol Dept Comp Sci & Engn Lahore 54890 Pakistan Ryerson Univ Res Lab Adv Syst Modelling Toronto ON M5B 2K3 Canada Univ Toronto Dalla Lana Sch Publ Hlth Toronto ON M5S Canada Ryerson Univ Ted Rogers Sch Informat Technol Management Toronto ON M5B 2K3 Canada York Univ Dept Math & Stat Toronto ON M3J 1P3 Canada

The objective of this inductive research was to investigate: 1) the relationship between diabetes mellitus and individual risk factors of metabolic syndrome (MetS), in a non-conservative setting;2) the prediction of future onset of diabetes using relevant risk factors of MetS;and 3) to investigate the relative performance of machine learning methods when data sampling techniques are used to generate balanced training sets. The dataset used in this research contains 667 907 records for a period ranging from 2003 to 2013. Quantifying the contribution of individual risk factors of MetS in the development of diabetes in a non-conservative setting logistic regression analysis was performed. Our analyses contradict the view that diabetes is commonly associated with low levels of high-density lipoprotein (HDL). Instead, our results demonstrate that the increased levels of HDL are positively correlated with diabetes onset, particularly in women. We also proposed J48 decision tree and Naive Bayes methods for prediction of future onset of diabetes using relevant risk factors obtained from logistic regression analysis, over balanced and unbalanced datasets. The results demonstrated the supremacy of Naive Bayes with K-medoids under-sampling technique as compared to random under-sampling, oversampling, and no sampling. It is achieved on average 79% receiver operating characteristic performance with the increased true positive rate. The results of this paper suggest further research to clarify the pathophysiological significance of HDL and pathways in the development of diabetes.

关键词： Metabolic syndrome decision tree the National Heart Lungs and Blood Institute and American Heart Association (NHLBI) and American Heart Association (AHA) diagnostic prediction data sampling methods K-medoids random under sampling over sampling

来源：评论

学校读者我要写书评

暂无评论

An Under-sampling Method with Support Vectors in Multi-class Imbalanced data Classification 13

An Under-Sampling Method with Support Vectors in Multi-class...

引用

13th International Conference on Software, Knowledge, Information Management and Applications (SKIMA) / International Workshop on Applied Artificial Intelligence (AI Maldives)

作者： Arafat, Md. Yasir Hoque, Sabera Xu, Shuxiang Farid, Dewan Md. United Int Univ Dept Comp Sci & Engn Madani Ave Dhaka 1212 Bangladesh Univ Tasmania Fac Sci Engn & Technol Sch Engn & ICT Hobart Tas Australia

ISBN: (纸本)9781728127415

Multi-class imbalanced data classification in supervised learning is one of the most challenging research issues in machine learning for data mining applications. Although several data sampling methods have been introduced by computational intelligence researchers in the past decades for handling imbalanced data, still learning from imbalanced data is a challenging task and played as a significant focused research interest as well. Traditional machine learning algorithms usually biased to the majority class instances whereas ignored the minority class instances. As a result, ignoring minority class instances may affect the prediction accuracy of classifiers. Generally, under-sampling and over-sampling methods are commonly used in single model classifiers or ensemble learning for dealing with imbalanced data. In this paper, we have introduced an under-sampling method with support vectors for classifying imbalanced data. The proposed approach selects the most informative majority class instances based on the support vectors that help to engender decision boundary. We have tested the performance of the proposed method with single classifiers (C4.5 Decision Tree classifier and naive Bayes classifier) and ensemble classifiers (Random Forest and AdaBoost) on 13 benchmark imbalanced datasets. It is explicitly shown by the experimental result that the proposed method produces high accuracy when classifying both the minority and majority class instances compared to other existing methods.

关键词： data sampling methods Ensemble Learning Imbalanced data Over-sampling Under-sampling

来源：评论

学校读者我要写书评

暂无评论

MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction

引用

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2018年第6期44卷 534-550页

作者： Benni, Kwabena Ebo Keung, Jacky Phannachitta, Passakorn Monden, Akito Mensah, Solomon City Univ Hong Kong Dept Comp Sci Kowloon Tong Hong Kong Peoples R China Chiang Mai Univ Coll Arts Media & Technol Chiang Mai 50200 Thailand Okayama Univ Grad Sch Nat Sci & Technol Okayama 7000082 Japan

Highly imbalanced data typically make accurate predictions difficult. Unfortunately, software defect datasets tend to have fewer defective modules than non-defective modules. Synthetic oversampling approaches address this concern by creating new minority defective modules to balance the class distribution before a model is trained. Notwithstanding the successes achieved by these approaches, they mostly result in over-generalization (high rates of false alarms) and generate near-duplicated data instances (less diverse data). In this study, we introduce MAHAKIL, a novel and efficient synthetic oversampling approach for software defect datasets that is based on the chromosomal theory of inheritance. Exploiting this theory, MAHAKIL interprets two distinct sub-classes as parents and generates a new instance that inherits different traits from each parent and contributes to the diversity within the data distribution. We extensively compare MAHAKIL with SMOTE, Borderline-SMOTE, ADASYN, Random Oversampling and the No sampling approach using 20 releases of defect datasets from the PROMISE repository and five prediction models. Our experiments indicate that MAHAKIL improves the prediction performance for all the models and achieves better and more significant pf values than the other oversampling approaches, based on Brunner's statistical significance test and Cliff's effect sizes. Therefore, MAHAKIL is strongly recommended as an efficient alternative for defect prediction models built on highly imbalanced datasets.

关键词： Software defect prediction class imbalance learning synthetic sample generation data sampling methods classification problems

来源：评论

学校读者我要写书评

暂无评论

MAHAKIL: Diversity based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction Extended Abstract 18

MAHAKIL: Diversity based Oversampling Approach to Alleviate ...

引用

40th ACM/IEEE International Conference on Software Engineering (ICSE)

作者： Bennin, Kwabena E. Keung, Jacky Phannachitta, Passakorn Monden, Akito Mensah, Solomon City Univ Hong Kong Dept Comp Sci Hong Kong Peoples R China Chiang Mai Univ Coll Arts Media & Technol Chiang Mai Thailand Okayama Univ Grad Sch Nat Sci & Technol Okayama Japan

ISBN: (纸本)9781450356381

This study presents MAHAKIL, a novel and efficient synthetic oversampling approach for software defect datasets that is based on the chromosomal theory of inheritance. Exploiting this theory, MAHAKIL interprets two distinct sub-classes as parents and generates a new instance that inherits different traits from each parent and contributes to the diversity within the data distribution. We extensively compare MAHAKIL with five other sampling approaches using 20 releases of defect datasets from the PROMISE repository and five prediction models. Our experiments indicate that MAHAKIL improves the prediction performance for all the models and achieves better and more significant pf values than the other oversampling approaches, based on robust statistical tests.

关键词： Software defect prediction Class imbalance learning Synthetic sample generation data sampling methods Classification problems

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：