The study of data complexity metrics is an emergent area in the field of data mining and is focused on the analysis of several data set characteristics to extract knowledge from them. This information can be used to s...
详细信息
The study of data complexity metrics is an emergent area in the field of data mining and is focused on the analysis of several data set characteristics to extract knowledge from them. This information can be used to support the election of the proper classification algorithm. This paper addresses the analysis of the relationship between data complexity measures and classifiers behavior. Each one of the metrics is evaluated covering its range of values and studying the classifiers accuracy on these values. The results offer information about the usefullness of these measures, and which of them allow us to analyze the nature of the input data set and help us to decide which classification method could be the most promising one. (C) 2013 Elsevier Ltd. All rights reserved.
data preprocessing is an important step for designing classification model. Normalization is one of the preprocessing techniques used to handle the out-of-bounds attributes. This work develops 14 classification models...
详细信息
data preprocessing is an important step for designing classification model. Normalization is one of the preprocessing techniques used to handle the out-of-bounds attributes. This work develops 14 classification models using different learning algorithms for dynamic selection of normalization technique. This work extracts 12 data complexity measures for 48 datasets drawn from the KEEL dataset repository. Each of these datasets is normalized using min-max and z-score normalization technique. G-mean index is estimated for these normalized datasets using Gaussian Kernel Extreme Learning Machine (KELM) in order to determine the best-suited normalization technique. The data complexity measures along with the best suited normalization technique are used as an input for developing the aforementioned dynamic models. These models predict the best suitable normalization technique based on the estimated data complexity measures of the dataset The result shows that the model developed using Gaussian Kernel ELM (KELM) and Support Vector Machine (SVM) give promising results for most of the evaluated classification problems. (C) 2018 Elsevier Ltd. All rights reserved.
A causative attack which manipulates training samples to mislead learning is a common attack scenario. Current countermeasures reduce the influence of the attack to a classifier with the loss of generalization ability...
详细信息
A causative attack which manipulates training samples to mislead learning is a common attack scenario. Current countermeasures reduce the influence of the attack to a classifier with the loss of generalization ability. Therefore, the collected samples should be analyzed carefully. Most countermeasures of current causative attack focus on data sanitization and robust classifier design. To our best knowledge, there is no work to determinate whether a given dataset is contaminated by a causative attack. In this study, we formulate a causative attack detection as a 2-class classification problem in which a sample represents a dataset quantified by data complexity measures, which describe the geometrical characteristics of data. As geometrical natures of a dataset are changed by a causative attack, we believe data complexity measures provide useful information for causative attack detection. Furthermore, a two-step secure classification model is proposed to demonstrate how the proposed causative attack detection improves the robustness of learning. Either a robust or traditional learning method is used according to the existence of causative attack. Experimental results illustrate that data complexity measures separate untainted datasets from attacked ones clearly, and confirm the promising performance of the proposed methods in terms of accuracy and robustness. The results consistently suggest that data complexity measures provide the crucial information to detect causative attack, and are useful to increase the robustness of learning.
In the classification framework there are problems in which the number of examples per class is not equitably distributed, formerly known as imbalanced data sets. This situation is a handicap when trying to identify t...
详细信息
In the classification framework there are problems in which the number of examples per class is not equitably distributed, formerly known as imbalanced data sets. This situation is a handicap when trying to identify the minority classes, as the learning algorithms are not usually adapted to such characteristics. An usual approach to deal with the problem of imbalanced data sets is the use of a preprocessing step. In this paper we analyze the usefulness of the data complexity measures in order to evaluate the behavior of undersampling and oversampling methods. Two classical learning methods, C4.5 and PART, are considered over a wide range of imbalanced data sets built from real data. Specifically, oversampling techniques and an evolutionary undersampling one have been selected for the study. We extract behavior patterns from the results in the data complexity space defined by the measures, coding them as intervals. Then, we derive rules from the intervals that describe both good or bad behaviors of C4.5 and PART for the different preprocessing approaches, thus obtaining a complete characterization of the data sets and the differences between the oversampling and undersampling results.
Empirical behavior of a classifier depends strongly on the characteristics of the underlying imbalanced dataset;therefore, an analysis of intrinsic data complexity would appear to be vital in order to choose classifie...
详细信息
Empirical behavior of a classifier depends strongly on the characteristics of the underlying imbalanced dataset;therefore, an analysis of intrinsic data complexity would appear to be vital in order to choose classifiers suitable for particular problems. data complexity metrics (CMs), a fairly recent proposal, identify dataset features which imply some difficulty for the classification task and identify relationships with classifier accuracy. In this paper, we introduce two CMs for imbalanced datasets, which help in explaining the factors responsible for the deterioration in classifier performance. These metrics are based on the weighted k-nearest neighbors approach. The experiments are performed in MATLAB software using 48 simulated datasets and 22 real-world datasets for different choices of neighborhood size k considered as 3, 5, 7, 9, 11. The results help to illustrate the usefulness of the proposed metrics.
We study the data complexity of model checking for logics with team semantics. We focus on dependence, inclusion, and independence logic formulas under both strict and lax team semantics. Our results delineate a clear...
详细信息
We study the data complexity of model checking for logics with team semantics. We focus on dependence, inclusion, and independence logic formulas under both strict and lax team semantics. Our results delineate a clear tractability/intractability frontiers in data complexity of both quantifier-free and quantified formulas for each of the logics. For inclusion logic under the lax semantics, we reduce the model-checking problem to the satisfiability problem of so-called dual-Horn Boolean formulas. Via this reduction, we give an alternative proof for the known result that the data complexity of inclusion logic is in PTIME.
complexity measures aim to characterize the underlying complexity of supervised data. These measures tackle factors hindering the performance of Machine Learning (ML) classifiers like overlap, density, linearity, etc....
详细信息
complexity measures aim to characterize the underlying complexity of supervised data. These measures tackle factors hindering the performance of Machine Learning (ML) classifiers like overlap, density, linearity, etc. The state-of-the-art has mainly focused on the dataset perspective of complexity, i.e., offering an estimation of the complexity of the whole dataset. Recently, the instance perspective has also been addressed. In this paper, the hostility measure, a complexity measure offering a multi-level (instance, class, and dataset) perspective of data complexity is proposed. The proposal is built by estimating the novel notion of hostility: the difficulty of correctly classifying a point, a class, or a whole dataset given their corresponding neighborhoods. The proposed measure is estimated at the instance level by applying the k-means algorithm in a recursive and hierarchical way, which allows to analyze how points from different classes are naturally grouped together across partitions. The instance information is aggregated to provide complexity knowledge at the class and the dataset levels. The validity of the proposal is evaluated through a variety of experiments dealing with the three perspectives and the corresponding comparative with the state-of-the-art measures. Throughout the experiments, the hostility measure has shown promising results and to be competitive, stable, and robust.
Nowadays, a lot of new classification and clustering techniques have been proposed for microarray data analysis. However, the multiclass microarray data classification is still regarded as a tough task because of the ...
详细信息
Nowadays, a lot of new classification and clustering techniques have been proposed for microarray data analysis. However, the multiclass microarray data classification is still regarded as a tough task because of the small sample size problem and the class imbalance problem. In this paper, we propose a novel error correcting output code (ECOC) algorithm for the classification of multiclass microarray data based on the data complexity (DC) theory. In this algorithm, an ECOC coding matrix is generated based on a hierarchical partition of the class space with the aim of Minimizing data complexity (named as ECOC-MDC). As the partition process can be mapped as a binary tree, a compact ensemble with high discrimination power is produced. The performance of ECOC-MDC is compared with some state-of-art ECOC algorithms on six multiclass microarray data sets, and it is found that the proposed algorithm can obtain better results in most cases. The correlation between DC measures and the dichotomizers' performances is checked, and the observations confirm that high complexity in data usually leads to high error rates of the connected dichotomizers. But the error correcting mechanism in the ECOC framework can effectively improve our algorithm's generalization ability. In short, ECOC-MDC can produce a compact ensemble system with high error correction capability through the application of diverse DC measures. Our Matlab code is available at: ***/MLDMXM2017/ECOC-MDC. (C) 2019 Elsevier Ltd. All rights reserved.
Using a number of measures for characterising the complexity of classification problems, we studied the comparative advantages of two methods for constructing decision forests - bootstrapping and random subspaces. We ...
详细信息
Using a number of measures for characterising the complexity of classification problems, we studied the comparative advantages of two methods for constructing decision forests - bootstrapping and random subspaces. We investigated a collection of 392 two-class problems from the UCI depository, and observed that there are strong correlations between the classifier accuracies and measures of length of class boundaries, thickness of the class manifolds, and nonlinearities of decision boundaries. We found characteristics of both difficult and easy cases where combination methods are no better than single classifiers. Also, we observed that the bootstrapping method is better when the training samples are sparse, and the subspace method is better when the classes are compact and the boundaries are smooth.
A method is proposed for obtaining the lower bounds of data complexity of statistical attacks on block or stream ciphers. The method is based on the Fano inequality and, unlike the available methods, doesn't use a...
详细信息
A method is proposed for obtaining the lower bounds of data complexity of statistical attacks on block or stream ciphers. The method is based on the Fano inequality and, unlike the available methods, doesn't use any asymptotic relations, approximate formulas or heuristic assumptions about the considered cipher. For a lot of known types of attacks, the obtained data complexity bounds have the classical form. For other types of attacks, these bounds allow us to introduce reasonable parameters that characterize the security of symmetric cryptosystems against these attacks.
暂无评论