In the last decade, ensemble learning has become a prolific discipline in pattern recognition, based on the assumption that the combination of the output of several models obtains better results than the output of any...
详细信息
In the last decade, ensemble learning has become a prolific discipline in pattern recognition, based on the assumption that the combination of the output of several models obtains better results than the output of any individual model. On the basis that the same principle can be applied to feature selection, we describe two approaches: (i) homogeneous, i.e., using the same feature selection method with different training data and distributing the dataset over several nodes;and (ii) heterogeneous, i.e., using different feature selection methods with the same training data. Both approaches are based on combining rankings of features that contain all the ordered features. The results of the base selectors are combined using different combination methods, also called aggregators, and a practical subset is selected according to several different threshold values (traditional values based on fixed percentages, and more novel automatic methods based on data complexity measures). In testing using a Support Vector Machine as a classifier, ensemble results for seven datasets demonstrate performance that is at least comparable and often better than the performance of individual feature selection methods. (C) 2016 Elsevier B.V. All rights reserved.
Classification problems with more than two classes can be handled in different ways. The most used approach is the one which transforms the original multi-class problem into a series of binary subproblems which are so...
详细信息
Classification problems with more than two classes can be handled in different ways. The most used approach is the one which transforms the original multi-class problem into a series of binary subproblems which are solved individually. In this approach, should the same base classifier be used on all binary subproblems? Or should these subproblems be tuned independently? Trying to answer this question, in this paper we propose a method to select a different base classifier in each subproblem-following the one-versus-one strategy-making use of data complexity measures. The experimental results on 17 real-world datasets corroborate the adequacy of the method.
Real data are often corrupted by noise, which can be provenient from errors in data collection, storage and processing. The presence of noise hampers the induction of Machine Learning models from data, which can have ...
详细信息
ISBN:
(纸本)9783642408465
Real data are often corrupted by noise, which can be provenient from errors in data collection, storage and processing. The presence of noise hampers the induction of Machine Learning models from data, which can have their predictive or descriptive performance impaired, while also making the training time longer. Moreover, these models can be overly complex in order to accomodate such errors. Thus, the identification and reduction of noise in a data set may benefit the learning process. In this paper, we thereby investigate the use of data complexity measures to identify the presence of noise in a data set. This identification can support the decision regarding the need of the application of noise redution techniques.
Although over 90 oversampling approaches have been developed in the imbalance learning domain, most of the empirical study and application work are still based on the "classical" resampling techniques. In th...
详细信息
ISBN:
(纸本)9783030474362;9783030474355
Although over 90 oversampling approaches have been developed in the imbalance learning domain, most of the empirical study and application work are still based on the "classical" resampling techniques. In this paper, several experiments on 19 benchmark datasets are set up to study the efficiency of six powerful oversampling approaches, including both "classical" and new ones. According to our experimental results, oversampling techniques that consider the minority class distribution (new ones) perform better in most cases and RACOG gives the best performance among the six reviewed approaches. We further validate our conclusion on our real-world inspired vehicle datasets and also find applying oversampling techniques can improve the performance by around 10%. In addition, seven data complexity measures are considered for the initial purpose of investigating the relationship between data complexity measures and the choice of resampling techniques. Although no obvious relationship can be abstracted in our experiments, we find Flv value, a measure for evaluating the overlap which most researchers ignore, has a strong negative correlation with the potential AUC value (after resampling).
Hydrologic and geomorphic classifications have gained traction in response to the increasing need for basin-wide water resources management. Regardless of the selected classification scheme, an open scientific challen...
详细信息
Hydrologic and geomorphic classifications have gained traction in response to the increasing need for basin-wide water resources management. Regardless of the selected classification scheme, an open scientific challenge is how to extend information from limited field sites to classify tens of thousands to millions of channel reaches across a basin. To address this spatial scaling challenge, this study leverages machine learning to predict reach-scale geomorphic channel types using publicly available geospatial data. A bottom-up machine learning approach selects the most accurate and stable model among similar to 20,000 combinations of 287 coarse geospatial predictors, preprocessing methods, and algorithms in a three-tiered framework to (i) define a tractable problem and reduce predictor noise, (ii) assess model performance in statistical learning, and (iii) assess model performance in prediction. This study also addresses key issues related to the design, interpretation, and diagnosis of machine learning models in hydrologic sciences. In an application to the Sacramento River basin (California, USA), the developed framework selects a Random Forest model to predict 10 channel types previously determined from 290 field surveys over 108,943 two hundred-meter reaches. Performance in statistical learning is reasonable with a 61% median cross-validation accuracy, a sixfold increase over the 10% accuracy of the baseline random model, and the predictions coherently capture the large-scale geomorphic organization of the landscape. Interestingly, in the study area, the persistent roughness of the topography partially controls channel types and the variation in the entropy-based predictive performance is explained by imperfect training information and scale mismatch between labels and predictors.
Background: Prediction of software vulnerabilities is a major concern in the field of software security. Many researchers have worked to construct various software vulnerability prediction (SVP) models. The emerging m...
详细信息
Background: Prediction of software vulnerabilities is a major concern in the field of software security. Many researchers have worked to construct various software vulnerability prediction (SVP) models. The emerging machine learning domain aids in building effective SVP models. The employment of data balancing/resampling techniques and optimal hyperparameters can upgrade their performance. Previous research studies have shown the impact of hyperparameter optimization (HPO) on machine learning algorithms and data balancing ***: The current study aims to analyze the impact of dual hyperparameter optimization on metrics-based SVP ***: This paper has proposed the methodology using the python framework Optuna that optimizes the hyperparameters for both machine learners and data balancing techniques. For the experimentation purpose, we have compared six combinations of five machine learners and five resampling techniques considering default parameters and optimized ***: Additionally, the Wilcoxon signed-rank test with the Bonferroni correction method was implied, and observed that dual HPO performs better than HPO on learners and HPO on data balancers. Furthermore, the paper has assessed the impact of data complexity measures and concludes that HPO does not improve the performance of those datasets that exhibit high ***: The experimental analysis unveils that dual HPO is 64% effective in enhancing the productivity of SVP models.
暂无评论