检索结果-内蒙古大学图书馆

Analysis of data complexity measures for classification

EXPERT SYSTEMS WITH APPLICATIONS 2013年第12期40卷 4820-4831页

作者： Cano, Jose-Ramon Univ Jaen Dept Comp Sci Jaen Spain

The study of data complexity metrics is an emergent area in the field of data mining and is focused on the analysis of several data set characteristics to extract knowledge from them. This information can be used to support the election of the proper classification algorithm. This paper addresses the analysis of the relationship between data complexity measures and classifiers behavior. Each one of the metrics is evaluated covering its range of values and studying the classifiers accuracy on these values. The results offer information about the usefullness of these measures, and which of them allow us to analyze the nature of the input data set and help us to decide which classification method could be the most promising one. (C) 2013 Elsevier Ltd. All rights reserved.

关键词： data complexity Class overlapping Class separability Classification

来源：评论

学校读者我要写书评

暂无评论

Dynamic selection of normalization techniques using data complexity measures

引用

EXPERT SYSTEMS WITH APPLICATIONS 2018年 106卷 252-262页

作者： Jain, Sukirty Shukla, Sanyam Wadhvani, Rajesh Maulana Azad Natl Inst Technol Bhopal 462007 Madhya Pradesh India

data preprocessing is an important step for designing classification model. Normalization is one of the preprocessing techniques used to handle the out-of-bounds attributes. This work develops 14 classification models using different learning algorithms for dynamic selection of normalization technique. This work extracts 12 data complexity measures for 48 datasets drawn from the KEEL dataset repository. Each of these datasets is normalized using min-max and z-score normalization technique. G-mean index is estimated for these normalized datasets using Gaussian Kernel Extreme Learning Machine (KELM) in order to determine the best-suited normalization technique. The data complexity measures along with the best suited normalization technique are used as an input for developing the aforementioned dynamic models. These models predict the best suitable normalization technique based on the estimated data complexity measures of the dataset The result shows that the model developed using Gaussian Kernel ELM (KELM) and Support Vector Machine (SVM) give promising results for most of the evaluated classification problems. (C) 2018 Elsevier Ltd. All rights reserved.

关键词： data complexity data preprocessing MM-max normalization z-score normalization Gaussian Kernel ELM

来源：评论

学校读者我要写书评

暂无评论

Causative label flip attack detection with data complexity measures

引用

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS 2021年第1期12卷 103-116页

作者： Chan, Patrick P. K. He, Zhimin Hu, Xian Tsang, Eric C. C. Yeung, Daniel S. Ng, Wing W. Y. South China Univ Technol Sch Comp Sci & Engn Guangzhou Peoples R China Foshan Univ Sch Elect & Informat Engn Foshan 528000 Peoples R China Tencent Shenzhen Peoples R China Macau Univ Sci & Technol Fac Informat Technol Macau Peoples R China

A causative attack which manipulates training samples to mislead learning is a common attack scenario. Current countermeasures reduce the influence of the attack to a classifier with the loss of generalization ability. Therefore, the collected samples should be analyzed carefully. Most countermeasures of current causative attack focus on data sanitization and robust classifier design. To our best knowledge, there is no work to determinate whether a given dataset is contaminated by a causative attack. In this study, we formulate a causative attack detection as a 2-class classification problem in which a sample represents a dataset quantified by data complexity measures, which describe the geometrical characteristics of data. As geometrical natures of a dataset are changed by a causative attack, we believe data complexity measures provide useful information for causative attack detection. Furthermore, a two-step secure classification model is proposed to demonstrate how the proposed causative attack detection improves the robustness of learning. Either a robust or traditional learning method is used according to the existence of causative attack. Experimental results illustrate that data complexity measures separate untainted datasets from attacked ones clearly, and confirm the promising performance of the proposed methods in terms of accuracy and robustness. The results consistently suggest that data complexity measures provide the crucial information to detect causative attack, and are useful to increase the robustness of learning.

关键词： Adversarial learning Causative attack detection Label flip attack data complexity

来源：评论

学校读者我要写书评

暂无评论

Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling

引用

SOFT COMPUTING 2011年第10期15卷 1909-1936页

作者： Luengo, Julian Fernandez, Alberto Garcia, Salvador Herrera, Francisco Univ Granada Dept Comp Sci & Artificial Intelligence E-18071 Granada Spain Univ Jaen Dept Comp Sci Jaen 23071 Spain

In the classification framework there are problems in which the number of examples per class is not equitably distributed, formerly known as imbalanced data sets. This situation is a handicap when trying to identify the minority classes, as the learning algorithms are not usually adapted to such characteristics. An usual approach to deal with the problem of imbalanced data sets is the use of a preprocessing step. In this paper we analyze the usefulness of the data complexity measures in order to evaluate the behavior of undersampling and oversampling methods. Two classical learning methods, C4.5 and PART, are considered over a wide range of imbalanced data sets built from real data. Specifically, oversampling techniques and an evolutionary undersampling one have been selected for the study. We extract behavior patterns from the results in the data complexity space defined by the measures, coding them as intervals. Then, we derive rules from the intervals that describe both good or bad behaviors of C4.5 and PART for the different preprocessing approaches, thus obtaining a complete characterization of the data sets and the differences between the oversampling and undersampling results.

关键词： Classification Evolutionary algorithms data complexity Imbalanced data sets Oversampling Undersampling C4.5 PART

来源：评论

学校读者我要写书评

暂无评论

Weighted k-nearest neighbor based data complexity metrics for imbalanced datasets

引用

STATISTICAL ANALYSIS AND data MINING 2020年第4期13卷 394-404页

作者： Singh, Deepika Gosain, Anjana Saha, Anju Guru Gobind Singh Indraprasth Univ Univ Sch Informat & Commun Technol New Delhi India

Empirical behavior of a classifier depends strongly on the characteristics of the underlying imbalanced dataset;therefore, an analysis of intrinsic data complexity would appear to be vital in order to choose classifiers suitable for particular problems. data complexity metrics (CMs), a fairly recent proposal, identify dataset features which imply some difficulty for the classification task and identify relationships with classifier accuracy. In this paper, we introduce two CMs for imbalanced datasets, which help in explaining the factors responsible for the deterioration in classifier performance. These metrics are based on the weighted k-nearest neighbors approach. The experiments are performed in MATLAB software using 48 simulated datasets and 22 real-world datasets for different choices of neighborhood size k considered as 3, 5, 7, 9, 11. The results help to illustrate the usefulness of the proposed metrics.

关键词： Bayes error class imbalance classification complexity metrics data complexity data-level algorithms imbalance ratio overlapping small disjuncts undersampled

来源：评论

学校读者我要写书评

暂无评论

Tractability Frontier of data complexity in Team Semantics

引用

ACM TRANSACTIONS ON COMPUTATIONAL LOGIC 2022年第1期23卷 1-21页

作者： Durand, Arnaud Kontinen, Juha De Rugy-Altherre, Nicolas Vaananen, Jouko Univ Paris IMJ PRG 8 Pl Aurelie Nemours F-75013 Paris France CNRS Paris France Univ Helsinki Dept Math & Stat POB 68 Helsinki 00014 Finland Univ Lorraine Comp Sci Dept Vandoeuvre Les Nancy France Univ Amsterdam Inst Log Language & Computat Amsterdam Netherlands

We study the data complexity of model checking for logics with team semantics. We focus on dependence, inclusion, and independence logic formulas under both strict and lax team semantics. Our results delineate a clear tractability/intractability frontiers in data complexity of both quantifier-free and quantified formulas for each of the logics. For inclusion logic under the lax semantics, we reduce the model-checking problem to the satisfiability problem of so-called dual-Horn Boolean formulas. Via this reduction, we give an alternative proof for the known result that the data complexity of inclusion logic is in PTIME.

关键词： Team semantics dependence independence inclusion model checking data complexity

来源：评论

学校读者我要写书评

暂无评论

Hostility measure for multi-level study of data complexity

引用

APPLIED INTELLIGENCE 2023年第7期53卷 8073-8096页

作者： Lancho, Carmen De Diego, Isaac Martin Cuesta, Marina Acena, Victor Moguerza, Javier M. Rey Juan Carlos Univ Data Sci Lab C Tulipan S-N Mostoles 28933 Spain Madox Viajes C Cantabria 10 Arroyomolinos 28939 Spain

complexity measures aim to characterize the underlying complexity of supervised data. These measures tackle factors hindering the performance of Machine Learning (ML) classifiers like overlap, density, linearity, etc. The state-of-the-art has mainly focused on the dataset perspective of complexity, i.e., offering an estimation of the complexity of the whole dataset. Recently, the instance perspective has also been addressed. In this paper, the hostility measure, a complexity measure offering a multi-level (instance, class, and dataset) perspective of data complexity is proposed. The proposal is built by estimating the novel notion of hostility: the difficulty of correctly classifying a point, a class, or a whole dataset given their corresponding neighborhoods. The proposed measure is estimated at the instance level by applying the k-means algorithm in a recursive and hierarchical way, which allows to analyze how points from different classes are naturally grouped together across partitions. The instance information is aggregated to provide complexity knowledge at the class and the dataset levels. The validity of the proposal is evaluated through a variety of experiments dealing with the three perspectives and the corresponding comparative with the state-of-the-art measures. Throughout the experiments, the hostility measure has shown promising results and to be competitive, stable, and robust.

关键词： Hostility measure complexity measures data complexity Classification Supervised problems

来源：评论

学校读者我要写书评

暂无评论

A novel ECOC algorithm for multiclass microarray data classification based on data complexity analysis

引用

PATTERN RECOGNITION 2019年 90卷 346-362页

作者： Sun, Mengxin Liu, Kunhong Wu, Qingqiang Hong, Qingqi Wang, Beizhan Zhang, Haiying Xiamen Univ Software Sch Xiamen 361005 Fujian Peoples R China

Nowadays, a lot of new classification and clustering techniques have been proposed for microarray data analysis. However, the multiclass microarray data classification is still regarded as a tough task because of the small sample size problem and the class imbalance problem. In this paper, we propose a novel error correcting output code (ECOC) algorithm for the classification of multiclass microarray data based on the data complexity (DC) theory. In this algorithm, an ECOC coding matrix is generated based on a hierarchical partition of the class space with the aim of Minimizing data complexity (named as ECOC-MDC). As the partition process can be mapped as a binary tree, a compact ensemble with high discrimination power is produced. The performance of ECOC-MDC is compared with some state-of-art ECOC algorithms on six multiclass microarray data sets, and it is found that the proposed algorithm can obtain better results in most cases. The correlation between DC measures and the dichotomizers' performances is checked, and the observations confirm that high complexity in data usually leads to high error rates of the connected dichotomizers. But the error correcting mechanism in the ECOC framework can effectively improve our algorithm's generalization ability. In short, ECOC-MDC can produce a compact ensemble system with high error correction capability through the application of diverse DC measures. Our Matlab code is available at: ***/MLDMXM2017/ECOC-MDC. (C) 2019 Elsevier Ltd. All rights reserved.

关键词： Error correcting output codes (ECOC) data complexity Microarray data Multiclass

来源：评论

学校读者我要写书评

暂无评论

A data complexity analysis of comparative advantages of decision forest constructors

引用

PATTERN ANALYSIS AND APPLICATIONS 2002年第2期5卷 102-112页

作者： Ho, TK Bell Labs Lucent Technol Murray Hill NJ 07974 USA

Using a number of measures for characterising the complexity of classification problems, we studied the comparative advantages of two methods for constructing decision forests - bootstrapping and random subspaces. We investigated a collection of 392 two-class problems from the UCI depository, and observed that there are strong correlations between the classifier accuracies and measures of length of class boundaries, thickness of the class manifolds, and nonlinearities of decision boundaries. We found characteristics of both difficult and easy cases where combination methods are no better than single classifiers. Also, we observed that the bootstrapping method is better when the training samples are sparse, and the subspace method is better when the classes are compact and the boundaries are smooth.

关键词： bagging classifier combination data complexity decision forest decision tree random subspace method

来源：评论

学校读者我要写书评

暂无评论

NON-ASYMPTOTIC LOWER BOUNDS FOR THE data complexity OF STATISTICAL ATTACKS ON SYMMETRIC CRYPTOSYSTEMS

引用

CYBERNETICS AND SYSTEMS ANALYSIS 2018年第1期54卷 83-93页

作者： Alekseychuk, A. N. Natl Tech Univ Ukraine Igor Sikorsky Kyiv Polytech Inst Inst Special Commun & Informat Secur Kiev Ukraine

A method is proposed for obtaining the lower bounds of data complexity of statistical attacks on block or stream ciphers. The method is based on the Fano inequality and, unlike the available methods, doesn't use any asymptotic relations, approximate formulas or heuristic assumptions about the considered cipher. For a lot of known types of attacks, the obtained data complexity bounds have the classical form. For other types of attacks, these bounds allow us to introduce reasonable parameters that characterize the security of symmetric cryptosystems against these attacks.

关键词： symmetric cryptography statistical hypotheses testing statistical attack block cipher stream cipher correlation attack data complexity Fano's inequality

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：