This article assesses the use of high-resolution Unmanned Aerial Vehicle (UAV) data from commercial field sensors for classifying small-scale agricultural patterns in four crop types (Winter Wheat, Spring Barley, Rape...
详细信息
This article assesses the use of high-resolution Unmanned Aerial Vehicle (UAV) data from commercial field sensors for classifying small-scale agricultural patterns in four crop types (Winter Wheat, Spring Barley, Rapeseed, and Corn) acquired at ground sample distances (GSDs) of 0.027 m, 0.053 m and 0.064 m. Image harmonization challenges due to spectral and textural variations from varying GSDs and sensors are addressed. The study investigates the data and sample complexity required to develop an effective machine/deep learning (ML/DL) model, using techniques such as the Jeffries-Matusita Distance for assessment of class separability and feature importance ranking for feature and layer selection, semivariogram analysis for determining minimum sample patch sizes. The results demonstrate distinct classification capabilities based on spectral information in differentiating between sub-classes such as weed infestation, bare soil, disturbed canopy areas, and undisturbed canopy areas. However, there are limitations in detecting refined sub-classes of undisturbed canopy areas assigned to phenological groups, highlighting the need for class reduction and tailored feature and layer selection. The final set of sub-classes was proposed. The study also proposes a customized set of input layers for each crop type and identifies minimum patch sizes to enhance the efficiency of detecting specific agricultural patterns. It has been confirmed that to exploit texture information for classification (at smaller sample patch sizes < 120 pixels), Ground Sampling Distances (GSDs) between 0.027 m and 0.064 m (for RGB and CIR sensors of commercial drones, respectively) are suitable for capturing detailed patterns of Corn and Spring Barley. However, the CIR sensor, at GSDs of 0.053 m and 0.064 m, performs better for Winter Wheat and Rapeseed.
The security of Federated Learning(FL)/Distributed Machine Learning(DML)is gravely threatened by data poisoning attacks,which destroy the usability of the model by contaminating training samples,so such attacks are ca...
详细信息
The security of Federated Learning(FL)/Distributed Machine Learning(DML)is gravely threatened by data poisoning attacks,which destroy the usability of the model by contaminating training samples,so such attacks are called causative availability indiscriminate *** the problem that existing data sanitization methods are hard to apply to real-time applications due to their tedious process and heavy computations,we propose a new supervised batch detection method for poison,which can fleetly sanitize the training dataset before the local model *** design a training dataset generation method that helps to enhance accuracy and uses data complexity features to train a detection model,which will be used in an efficient batch hierarchical detection *** model stockpiles knowledge about poison,which can be expanded by retraining to adapt to new *** neither attack-specific nor scenario-specific,our method is applicable to FL/DML or other online or offline scenarios.
The logical foundations of the standard web ontology languages are provided by expressive Description Logics (DLs), such as SHIQ and SHOIQ. In the Semantic Web and other domains, ontologies are increasingly seen also ...
详细信息
The logical foundations of the standard web ontology languages are provided by expressive Description Logics (DLs), such as SHIQ and SHOIQ. In the Semantic Web and other domains, ontologies are increasingly seen also as a mechanism to access and query data repositories. This novel context poses an original combination of challenges that has not been addressed before: (i) sufficient expressive power of the DL to capture common data modelling constructs;(ii) well established and flexible query mechanisms such as those inspired by database technology;(iii) optimisation of inference techniques with respect to data size, which typically dominates the size of ontologies. This calls for investigating data complexity of query answering in expressive DLs. While the complexity of DLs has been studied extensively, few tight characterisations of data complexity were available, and the problem was still open for most DLs of the SH family and for standard query languages like conjunctive queries and their extensions. We tackle this issue and prove a tight coNP upper bound for positive existential queries without transitive roles in SHOQ, SHIQ, and SHOI. We thus establish that, for a whole range of sublogics of SHOIQ that contain AL, answering such queries has coNP-complete data complexity. We obtain our result by a novel tableaux-based algorithm for checking query entailment, which uses a modified blocking condition in the style of CARIN. The algorithm is sound for SHOIQ, and shown to be complete for all considered proper sublogics in the SH family.
Defect prediction is crucial for software quality assurance and has been extensively researched over recent decades. However, prior studies rarely focus on data complexity in defect prediction tasks, and even less on ...
详细信息
Defect prediction is crucial for software quality assurance and has been extensively researched over recent decades. However, prior studies rarely focus on data complexity in defect prediction tasks, and even less on understanding the difficulties of these tasks from the perspective of data complexity. In this article, we conduct an empirical study to estimate the hardness of over 33,000 instances, employing a set of measures to characterize the inherent difficulty of instances and the characteristics of defect datasets. Our findings indicate that: (1) instance hardness in both classes displays a right-skewed distribution, with the defective class exhibiting a more scattered distribution;(2) class overlap is the primary factor influencing instance hardness and can be characterized through feature, structural, and instance-level overlap;(3) no universal preprocessing technique is applicable to all datasets, and it may not consistently reduce data complexity, fortunately, dataset complexity measures can help identify suitable techniques for specific datasets;(4) integrating data complexity information into the learning process can enhance an algorithm's learning capacity. In summary, this empirical study highlights the crucial role of data complexity in defect prediction tasks, and provides a novel perspective for advancing research in defect prediction techniques.
How to measure the complexity of a finite set of vectors embedded in a multidimensional space? This is a non-trivial question which can be approached in many different ways. Here we suggest a set of data complexity me...
详细信息
How to measure the complexity of a finite set of vectors embedded in a multidimensional space? This is a non-trivial question which can be approached in many different ways. Here we suggest a set of data complexity measures using universal approximators, principal cubic complexes. Principal cubic complexes generalize the notion of principal manifolds for datasets with non-trivial topologies. The type of the principal cubic complex is determined by its dimension and a grammar of elementary graph transformations. The simplest grammar produces principal trees. We introduce three natural types of data complexity: (1) geometric (deviation of the data's approximator from some "idealized" configuration, such as deviation from harmonicity);(2) structural (how many elements of a principal graph are needed to approximate the data), and (3) construction complexity (how many applications of elementary graph transformations are needed to construct the principal object starting from the simplest one). We compute these measures for several simulated and real-life data distributions and show them in the "accuracy complexity" plots, helping to optimize the accuracy/complexity ratio. We discuss various issues connected with measuring data complexity. Software for computing data complexity measures from principal cubic complexes is provided as well. (C) 2012 Elsevier Ltd. All rights reserved.
Regularized linear classifiers have been successfully applied in undersampled, i.e. small sample size/high dimensionality biomedical classification problems. Additionally, a design of data complexity measures was prop...
详细信息
Regularized linear classifiers have been successfully applied in undersampled, i.e. small sample size/high dimensionality biomedical classification problems. Additionally, a design of data complexity measures was proposed in order to assess the competence of a classifier in a particular context. Our work was motivated by the analysis of ill-posed regression problems by Elden and the interpretation of linear discriminant analysis as a mean square error classifier. Using Singular Value Decomposition analysis, we define a discriminatory power spectrum and show that it provides useful means of data complexity assessment for undersampled classification problems. In five real-life biomedical data sets of increasing difficulty we demonstrate how the data complexity of a classification problem can be related to the performance of regularized linear classifiers. We show that the concentration of the discriminatory power manifested in the discriminatory power spectrum is a deciding factor for the success of the regularized linear classifiers in undersampled classification problems. As a practical outcome of our work, the proposed data complexity assessment may facilitate the choice of a classifier for a given undersampled problem. (c) 2006 Elsevier B.V. All rights reserved.
Purpose: This work proposes the hypothesis that data oversampling may lead to dataset simplification according to selected data difficulty metrics and that such simplification positively affects the quality of selecte...
详细信息
Purpose: This work proposes the hypothesis that data oversampling may lead to dataset simplification according to selected data difficulty metrics and that such simplification positively affects the quality of selected classifier learning methods. Methods: A set of computer experiments was performed for 47 benchmark datasets to make the hypothesis plausible. The experiments considered five oversampling methods, five classifiers, and 22 metrics for data difficulty assessment. The experiments aim to establish: (a) whether there is a relationship between resampling and change in the difficulty of the training data and (b) whether there is a relationship between changes in the values of training set difficulty metrics and classification quality. Results: Based on the obtained results, the research hypothesis was confirmed. It was indicated which measures correlate with selected classifiers. The experiments showed the relationship between the change of assessed difficulty measures after oversampling and the classification quality of selected models. Conclusion: The obtained results allow using the selected measures to predict whether a given oversampling method leads to favorable modifications of the learning set for a given type of classifier. Showed relationship between difficulty measures and classification will allow using the mentioned measures as a learning criterion. For example, guided oversampling can treat the modification of the learning set as an optimization task. During the oversampling process, no estimation of classification quality metrics will be required, but only an evaluation of the training set difficulty. This may contribute to the proposition of computationally efficient methods.
Feature selection (FS) is a pre-processing step often mandatory in data analysis by Machine Learning techniques. Its objective is to reduce data dimensionality by identifying and maintaining only the relevant features...
详细信息
ISBN:
(纸本)9781728119854
Feature selection (FS) is a pre-processing step often mandatory in data analysis by Machine Learning techniques. Its objective is to reduce data dimensionality by identifying and maintaining only the relevant features from a dataset. In this work we evaluate the use of complexity measures of classification problems in FS. These descriptors allow estimating the intrinsic difficulty of a classification problem by regarding on characteristics of the dataset available for learning. We propose a combined univariate-multivariate FS technique which employs two complexity measures: Fisher's maximum discriminant ratio and sum of intra-extra class distances. The results reveal that the complexity measures are indeed suitable for estimating feature importance in classification datasets. Large reductions in the numbers of features were obtained, while preserving, in general, the predictive accuracy of two strong classification techniques: Support Vector Machines and Random Forests.
The analysis of data complexity is a proper framework to characterize the tackled classification problem and to identify domains of competence of classifiers. As a practical outcome of this framework, the proposed dat...
详细信息
The analysis of data complexity is a proper framework to characterize the tackled classification problem and to identify domains of competence of classifiers. As a practical outcome of this framework, the proposed data complexity measures may facilitate the choice of a classifier for a given problem. The aim of this paper is to study the behaviour of a fuzzy rule based classification system and its relationship to data complexity. We use as a case of study the fuzzy hybrid genetic based machine learning method presented in [H. Ishibuchi, T. Yamamoto, T. Nakashima, Hybridization of fuzzy GBML approaches for pattern classification problems, IEEE Transactions on Systems, Man, and Cybemetics-Part B: Cybernetics 35 (2) (2005) 359-365]. We examine several metrics of data complexity over a wide range of data sets built from real data and try to extract behaviour patterns from the results. We obtain rules which describe both good or bad behaviours of the fuzzy rule based classification system. These rules use values of data complexity metrics in their antecedents, so we try to predict the behaviour of the method from the data set complexity metrics prior to its application. Therefore, we can establish the domains of competence of this fuzzy rule based classification system. (C) 2009 Elsevier B.V. All rights reserved.
Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size and the number...
详细信息
Cross-validation is a widely used model evaluation method in data mining applications. However, it usually takes a lot of effort to determine the appropriate parameter values, such as training data size and the number of experiment runs, to implement a validated evaluation. This study develops an efficient cross-validation method called complexity-based Efficient (CBE) cross-validation for binary classification problems. CBE cross-validation establishes a complexity index, called the CBE index, by exploring the geometric structure and noise of data. The CBE index is used to calculate the optimal training data size and the number of experiment runs to reduce model evaluation time when dealing with computationally expensive classification data sets. A simulated and three real data sets are employed to validate the performance of the proposed method in the study, while the validation methods compared are repeated random sub-sampling validation and K-fold cross-validation. The results show that CBE cross-validation, repeated random sub-sampling validation and K-fold cross-validation have similar validation performance, except that the training time required for CBE cross-validation is indeed lower than that for the other two methods. (C) 2010 Elsevier B.V. All rights reserved.
暂无评论