Support vector machines (SVMs) have been promising methods for classification and regression analysis because of their solid mathematical foundations which convery several salient properties that other methods hardly ...
详细信息
ISBN:
(纸本)9781581137378
Support vector machines (SVMs) have been promising methods for classification and regression analysis because of their solid mathematical foundations which convery several salient properties that other methods hardly provide. However, despite the prominent properties of SVMs, they are not as favored for large-scale datamining as for patternrecognition or machinelearning because the training complexity of SVMs is highly dependent on the size of a data set. Many real-world datamining applications involve millions or billions of data records where even multiple scans of the entire data are too expensive to perform. this paper presents a new method, Clustering-Based SVM (CB-SVM), which is specifically designed for handling very large data sets. CB-SVM applies a hierarchical micro-clustering algorithm that scans the entire data set only once to provide an SVM with high quality samples that carry the statistical summaries of the data such that the summaries maximize the benefit of learningthe SVM. CB-SVM tries to generate the best SVM boundary for very large data sets given limited amount of resources. Our experiments on synthetic and real data sets show that CB-SVM is highly scalable for very large data sets while also generating high classification accuracy. Copyright 2003 ACM.
Dimensionality reduction methods for visualization map the original high-dimensional data typically into two dimensions. Mapping preserves the important information of the data, and in order to be useful, fulfils the ...
详细信息
ISBN:
(纸本)0819449989
Dimensionality reduction methods for visualization map the original high-dimensional data typically into two dimensions. Mapping preserves the important information of the data, and in order to be useful, fulfils the needs of a human observer. We have proposed a self-organizing map (SOM)-based approach for visual surface inspection. the method provides the advantages of unsupervised learning and an intuitive user interface that allows one to very easily set and tune the class boundaries based on observations made on visualization, for example, to adapt to changing conditions or material. there are, however, some problems with a SOM. It does not address the true distances between data, and it has a tendency to ignore rare samples in the training set at the expense of more accurate representation of common samples. In this paper, some alternative methods for a SOM are evaluated. these methods, PCA, MDS, LLE, ISOMAP, and GTM, are used to reduce dimensionality in order to visualize the data. their principal differences are discussed and performances quantitatively evaluated in a few special classification cases, such as in wood inspection using centile features. For the test material experimented with, SOM and GTM outperform the others when classification performance is considered. For datamining kinds of applications, ISOMAP and LLE appear to be more promising methods.
We investigate the following datamining problem from computer-aided drug design: From a large collection of compounds, find those that bind to a target molecule in as few iterations of biochemical testing as possible...
详细信息
We investigate the following datamining problem from computer-aided drug design: From a large collection of compounds, find those that bind to a target molecule in as few iterations of biochemical testing as possible. In each iteration a comparatively small batch of compounds is screened for binding activity toward this target. We employed the so-called "active learning paradigm" from machinelearning for selecting the successive batches. Our main selection strategy is based on the maximum margin hyperplane-generated by "Support Vector machines". this hyperplane separates the current set of active from the inactive compounds and has the largest possible distance from any labeled compound. We perform a thorough comparative study of various other selection strategies on data sets provided by DuPont Pharmaceuticals and show that the strategies based on the maximum margin hyperplane clearly outperform the simpler ones.
the following topics are discussed: bioinformatics; software engineering with computational intelligence; datamining; evolutionary computing; planning and scheduling; knowledge management and sharing; machine learnin...
详细信息
the following topics are discussed: bioinformatics; software engineering with computational intelligence; datamining; evolutionary computing; planning and scheduling; knowledge management and sharing; machinelearning; agents; vision and imaging; artificial intelligence in medicine; fuzzy logic; intelligent information retrieval; knowledge representation; satisfiability; computer vision and patternrecognition.
Recent times have seen an explosive growth in the availability of various kinds of data. It has resulted in an unprecedented opportunity to develop automated data-driven techniques of extracting useful knowledge. data...
详细信息
Support vector machines (SVM) are currently one of the classification systems most used in patternrecognition and datamining because of their accuracy and generalization capability. However, when dealing with very c...
详细信息
Support vector machines (SVM) are currently one of the classification systems most used in patternrecognition and datamining because of their accuracy and generalization capability. However, when dealing with very complex classification tasks where different errors bring different penalties, one should take into account the overall classification cost produced by the classifier more than its accuracy. It is thus necessary to provide some methods for tuning the SVM on the costs of the particular application. Depending on the characteristics of the cost matrix, this can be done during or after the learning phase of the classifier. In this paper we introduce two optimization schemes based on the two possible approaches and compare their performance on various data sets and kernels. the first experimental results show that boththe proposed schemes are suitable for tuning SVM in cost-sensitive applications.
Greiner and Zhou (1988) presented ELR, a discriminative parameter-learning algorithm that maximizes conditional likelihood (CL) for a fixed Bayesian belief network (BN) structure, and demonstrated that it often produc...
详细信息
Greiner and Zhou (1988) presented ELR, a discriminative parameter-learning algorithm that maximizes conditional likelihood (CL) for a fixed Bayesian belief network (BN) structure, and demonstrated that it often produces classifiers that are more accurate than the ones produced using the generative approach (OFE), which finds maximal likelihood parameters. this is especially true when learning parameters for incorrect structures, such as naive Bayes (NB). In searching for algorithms to learn better BN classifiers, this paper uses ELR to learn parameters of more nearly correct BN structures - e.g., of a general Bayesian network (GBN) learned from a structure-learning algorithm by Greiner and Zhou (2002). While OFE typically produces more accurate classifiers with GBN (vs. NB), we show that ELR does not, when the training data is not sufficient for the GBN structure learner to produce a good model. Our empirical studies also suggest that the better the BN structure is, the less advantages ELR has over OFE, for classification purposes. ELR learning on NB (i.e., with little structural knowledge) still performs about the same as OFE on GBN in classification accuracy, over a large number of standard benchmark datasets.
A novel classification algorithm, OCEC, based on evolutionary computation for datamining is proposed. It is compared to GA-based and non GA-based algorithms on 8 datasets from the UCI machinelearning repository. Res...
详细信息
ISBN:
(纸本)0780374886
A novel classification algorithm, OCEC, based on evolutionary computation for datamining is proposed. It is compared to GA-based and non GA-based algorithms on 8 datasets from the UCI machinelearning repository. Results show OCEC can achieve higher prediction accuracy, smaller number of rules and more stable performance.
the original k-means clustering algorithm is designed to work primarily on numeric data sets. this prohibits the algorithm from being directly applied to categorical data clustering in many datamining applications. T...
详细信息
the original k-means clustering algorithm is designed to work primarily on numeric data sets. this prohibits the algorithm from being directly applied to categorical data clustering in many datamining applications. the k-modes algorithm [Z. Huang, Clustering large data sets with mixed numeric and categorical value, in: Proceedings of the First Pacific Asia Knowledge Discovery and dataminingconference. World Scientific, Singapore, 1997, pp. 21-34] extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus the k-means fashion of minimizing a numerically valued cost. However, as is the case with most data clustering algorithms, the algorithm requires a pre-setting or random selection of initial points (modes) of the clusters. the differences on the initial points often lead to considerable distinct cluster results. In this paper we present an experimental study on applying Bradley and Fayyad's iterative initial-point refinement algorithm to the k-modes clustering to improve the accurate and repetitiveness of the clustering results [cf. P. Bradley, U. Fayyad, Refining initial points for k-mean clustering, in: Proceedings of the 15thinternationalconference on machinelearning, Morgan Kaufmann, Los Altos, CA, 1998]. Experiments show that the k-modes clustering algorithm using refined initial points leads to higher precision results much more reliably than the random selection method without refinement, thus making the refinement process applicable to many datamining applications with categorical data. (C) 2002 Elsevier Science B.V. All rights reserved.
暂无评论