Few of the current existing methods for unsupervised learning (clustering) algorithms consider clustering the data points in a low-dimensional subspace in real time. In this paper, we present a grid based clustering a...
详细信息
ISBN:
(纸本)0769525210
Few of the current existing methods for unsupervised learning (clustering) algorithms consider clustering the data points in a low-dimensional subspace in real time. In this paper, we present a grid based clustering algorithm (GCA) with time complexity (O(n)). Unlike previous clustering algorithm, GCA pays more attention to the running time of the algorithm. GCA achieves low running time by (i) determiningthe number of the clusters according to the point density of the grid cell and (ii) computing the distances between the centers of the clusters and the grid cells, not the data points. In order to make GCA more efficient, principal component analysis (PCA) is introduced to transform the data points from high dimension to low dimension. Finally, we analyze the performance of GCA and show that it outperforms most of the current state-of-the-art methods in terms of efficiency. In particular, it outperforms k-means algorithm by several orders in the running time
Most partitional clustering algorithms require the number of desired clusters to be set a priori. Not only is this somewhat counter-intuitive, it is also difficult except in the simplest of situations. By contrast, hi...
详细信息
Most partitional clustering algorithms require the number of desired clusters to be set a priori. Not only is this somewhat counter-intuitive, it is also difficult except in the simplest of situations. By contrast, hierarchical clustering may create partitions with varying numbers of clusters. the actual final partition depends on a threshold placed on the similarity measure used. Given a cluster quality metric, one can efficiently discover an appropriate threshold through a form of semi-supervised learning. this paper shows one such solution for complete-link hierarchical agglomerative clustering using the F-measure and a small subset of labeled examples. Empirical evaluation demonstrates promise
this paper presents an application where machinelearning techniques are used to mine data gathered from online poker in order to explain what signifies successful play. the study focuses on short-handed small stakes ...
详细信息
this paper presents an application where machinelearning techniques are used to mine data gathered from online poker in order to explain what signifies successful play. the study focuses on short-handed small stakes Texas Hold'em, and the data set used contains 105 human players, each having played more than 500 hands. Techniques used are decision trees and G-REX, a rule extractor based on genetic programming. the overall result is that the rules induced are rather compact and have very high accuracy, thus providing good explanations of successful play. It is of course quite hard to assess the quality of the rules; i.e. if they provide something novel and non-trivial. the main picture is, however, that obtained rules are consistent with established poker theory. Withthis in mind, we believe that the suggested techniques will in future studies, where substantially more data is available, produce clear and accurate descriptions of what constitutes the difference between winning and losing in poker
In this study, we propose an improved semi-supervised support vector machine (SVM) based translation algorithm for brain-computer interface (BCI) systems, aiming at reducing the time-consuming training process and enh...
详细信息
ISBN:
(纸本)0769525210
In this study, we propose an improved semi-supervised support vector machine (SVM) based translation algorithm for brain-computer interface (BCI) systems, aiming at reducing the time-consuming training process and enhancing the adaptability of BCI systems. In this algorithm, we apply a semi-supervised SVM, which builds a SVM classifier based on small amounts of labeled data and large amounts of unlabeled data, to translating the features extracted from the electrical recordings of brain into control signals. For reducing the time to train the semi-supervised SVM, we improve it by introducing a batch-mode incremental training method, which also can be used to enhance the adaptability of online BCI systems. the off-line data analysis results demonstrated the effectiveness of our algorithm
We introduce a novel application of support vector machines (SVMs) to the problem of identifying potential supernovae using photometric and geometric features computed from astronomical imagery. the challenges of this...
详细信息
We introduce a novel application of support vector machines (SVMs) to the problem of identifying potential supernovae using photometric and geometric features computed from astronomical imagery. the challenges of this supervised learning application are significant: 1) noisy and corrupt imagery resulting in high levels of feature uncertainty, 2) features with heavy-tailed, peaked distributions, 3) extremely imbalanced and overlapping positive and negative data sets, and 4) the need to reach high positive classification rates, i.e. to find all potential supernovae, while reducing the burdensome workload of manually examining false positives. High accuracy is achieved via a sign-preserving, shifted log transform applied to features with peaked, heavy-tailed distributions. the imbalanced data problem is handled by oversampling positive examples, selectively sampling misclassified negative examples, and iteratively training multiple SVMs for improved supernova recognition on unseen test data. We present cross-validation results and demonstrate the impact on a large-scale supernova survey that currently uses the SVM decision value to rank-order 600,000 potential supernovae each night
the following topics are dealt with: parallel and distributed computing; software metrics and project management; communication systems and networks; datamining; data warehousing; information management systems; Inte...
the following topics are dealt with: parallel and distributed computing; software metrics and project management; communication systems and networks; datamining; data warehousing; information management systems; Internet; mobile computing; wireless computing; software engineering; information engineering; management information systems; image processing and patternrecognition; computer architecture and software testing; artificial intelligence; intelligent agent technology; Web engineering
this work presents a detailed comparison of three imputation techniques, Bayesian multiple imputation, regression imputation and k nearest neighbor imputation, at various missingness levels. Starting with a complete r...
详细信息
this work presents a detailed comparison of three imputation techniques, Bayesian multiple imputation, regression imputation and k nearest neighbor imputation, at various missingness levels. Starting with a complete real-world software measurement dataset called CCCS, missing values were injected into the dependent variable at four levels according to three different missingness mechanisms. the three imputation techniques are evaluated by comparing the imputed and actual values. Our analysis includes a three-way analysis of variance (ANOVA) model, which demonstrates that Bayesian multiple imputation obtains the best performance, followed closely by regression
For the present work, we deal withthe significant problem of high imbalance in data in binary or multi-class classification problems. We study two different linguistic applications. the former determines whether a sy...
详细信息
For the present work, we deal withthe significant problem of high imbalance in data in binary or multi-class classification problems. We study two different linguistic applications. the former determines whether a syntactic construction (environment) co-occurs with a verb in a natural text corpus consists a subcategorization frame of the verb or not. the latter is called Name Entity recognition (NER) and it concerns determining whether a noun belongs to a specific Name Entity class. Regarding the subcategorization domain, each environment is encoded as a vector of heterogeneous attributes, where a very high imbalance between positive and negative examples is observed (an imbalance ratio of approximately 1:80). In the NER application, the imbalance between a name entity class and the negative class is even greater (1:120). In order to confront the plethora of negative instances, we suggest a search tactic during training phase that employs Tomek links for reducing unnecessary negative examples from the training set. Regarding the classification mechanism, we argue that Bayesian networks are well suited and we propose a novel network structure which efficiently handles heterogeneous attributes without discretization and is more classification-oriented. Comparing the experimental results withthose of other known machinelearning algorithms, our methodology performs significantly better in detecting examples of the rare class.
When performing predictive modeling, the key criterion is always accuracy. Withthis in mind, complex techniques like neural networks or ensembles are normally used, resulting in opaque models impossible to interpret....
详细信息
When performing predictive modeling, the key criterion is always accuracy. Withthis in mind, complex techniques like neural networks or ensembles are normally used, resulting in opaque models impossible to interpret. When models need to be comprehensible, accuracy is often sacrificed by using simpler techniques directly producing transparent models; a tradeoff termed the accuracy vs. comprehensibility tradeoff. In order to reduce this tradeoff, the opaque model can be transformed into another, interpretable, model; an activity termed rule extraction. In this paper, it is argued that rule extraction algorithms should gain from using oracle data; i.e. test set instances, together with corresponding predictions from the opaque model. the experiments, using 17 publicly available data sets, clearly show that rules extracted using only oracle data were significantly more accurate than both rules extracted by the same algorithm, using training data, and standard decision tree algorithms. In addition, the same rules were also significantly more compact; thus providing better comprehensibility. the overall implication is that rules extracted in this fashion explain the predictions made on novel data better than rules extracted in the standard way; i.e. using training data only
We present CTC, a new approach to structural classification. It uses the predictive power of tree patterns correlating withthe class values, combining state-of-the-art tree mining with sophisticated pruning technique...
详细信息
ISBN:
(纸本)0769522785
We present CTC, a new approach to structural classification. It uses the predictive power of tree patterns correlating withthe class values, combining state-of-the-art tree mining with sophisticated pruning techniques to find the k most discriminative pattern in a dataset. In contrast to existing methods, CTC uses no heuristics and the only parameters to be chosen by the user are the maximum size of the rule set and a single, statistically well founded cut-off value. the experiments show that CTC classifiers achieve good accuracies while the induced models are smaller than those of existing approaches, facilitating comprehensibility.
暂无评论