Association rule clustering is one of the most important topics in data mining. This paper proposes a generalization of distance-based clustering algorithm of association rules on various types of attributes. Firstly,...
详细信息
ISBN:
(纸本)0819442828
Association rule clustering is one of the most important topics in data mining. This paper proposes a generalization of distance-based clustering algorithm of association rules on various types of attributes. Firstly, considering complex database with various data, we present numeralized processing to deal with rules on many kinds of attributes. Secondly, instead of these values of numeralized attributes being computed straightly, we propose an approach to normalize these attributes of association rules. Finally, with applying the numeralized as well as normalization methods, we present the generalization of clustering algorithm based on the different definitions of distances and diameters of rules. This algorithm can be used to handle the rules with attributes of different types and different scales, which extend the method of clustering in Ref.l. Two simple examples are also provided to demonstrate the better results of the clustering algorithm in the end of the paper.
Increasing amount of road traffic in 1990s has drawn much attention in Korea due to its influence on safety problems. Various types of data analyses are done in order to analyze the relationship between the severity o...
详细信息
Increasing amount of road traffic in 1990s has drawn much attention in Korea due to its influence on safety problems. Various types of data analyses are done in order to analyze the relationship between the severity of road traffic accident and driving environmental factors based on traffic accident records. Accurate results of such accident data analysis can provide crucial information for road accident prevention policy. In this paper, we use various algorithms to improve the accuracy of individual classifiers for two categories of severity of road traffic accident. Individual classifiers used are neural network and decision tree. Mainly three different approaches are applied: classifier fusion based on the Dempster-Shafer algorithm, the Bayesian procedure and logistic model;data ensemble fusion based on arcing and bagging;and clustering based on the k-means algorithm. Our empirical study results indicate that a clustering based classification algorithm works best for road traffic accident classification in Korea. (C) 2002 Elsevier Science Ltd. All rights reserved.
In this paper, threshold selection is considered in the continuous image rather than in digital image. We prove that, for each given object within 2D image, its optimal threshold is determined by the mean of the gray ...
详细信息
In this paper, threshold selection is considered in the continuous image rather than in digital image. We prove that, for each given object within 2D image, its optimal threshold is determined by the mean of the gray values of the points lying on its continuous boundary. Thus, we try to deduce threshold from the gray values of the boundary rather from the gray values of the given discrete sampling points (pixels or edge pixels). By the scheme, we well overcome some disadvantages existing in the threshold methods based on the histogram of edge pixels. Besides, the proposed method has the ability to well handle the image whose histogram has very unequal peaks and broad valley. (C) 2003 Elsevier Science B.V. All rights reserved.
We present a decision theoretic formulation of product partition models (PPMs) that allows a formal treatment of different decision problems such as estimation or hypothesis testing and clustering methods simultaneous...
详细信息
We present a decision theoretic formulation of product partition models (PPMs) that allows a formal treatment of different decision problems such as estimation or hypothesis testing and clustering methods simultaneously. A key observation in our construction is the fact that PPMs can be formulated in the context of model selection. The underlying partition structure in these models is closely related to that arising in connection with Dirichlet processes. This allows a straightforward adaptation of some computational strategies-originally devised for nonparametric Bayesian problems-to our framework. The resulting algorithms are more flexible than other competing alternatives that are used for problems involving PPMs. We propose an algorithm that yields Bayes estimates of the quantities of interest and the groups of experimental units. We explore the application of our methods to the detection of outliers in normal and Student t regression models, with clustering structure equivalent to that induced by, a Dirichlet process prior. We also discuss the sensitivity of the results considering different prior distributions for the partitions.
High resolution and high dimension satellite images cause problems for clustering methods due to clusters of different sizes, shapes and densities. The most common clustering methods, e.g. K-means and ISODATA, do not ...
详细信息
ISBN:
(纸本)0780377192
High resolution and high dimension satellite images cause problems for clustering methods due to clusters of different sizes, shapes and densities. The most common clustering methods, e.g. K-means and ISODATA, do not work well for such kinds of datasets. In this work, density estimation techniques and density-based clustering methods are exploited. Density-based clustering is well-known in data mining to classify a data set based on its density parameters, where high density areas are separated by lower density areas, although it can only work with a simple data set in which cluster densities are not very different. Our contribution is to propose the k nearest neighbor (knn) density-based rule for a high dimensional dataset and to develop a new knn density-based clustering (KNNCLUST) for such complex dataset. KNNCLUST is stable, clear and easy to understand and implement. The number of clusters is automatically determined. These properties are illustrated by the segmentation of a multispectral image of a floodplain in The Netherlands.
In designing a vector quantizer using a training sequence (TS), the training algorithm tries to find an empirically optimal quantizer that minimizes the selected distortion criteria using the sequence. In order to eva...
详细信息
In designing a vector quantizer using a training sequence (TS), the training algorithm tries to find an empirically optimal quantizer that minimizes the selected distortion criteria using the sequence. In order to evaluate the performance of the trained quantizer, we can use the empirically minimized distortion that we obtain when designing the quantizer. In this correspondence, several upper bounds on the empirically minimized distortions are proposed with numerical results. The bound holds pointwise, i.e., for each distribution with finite second moment in a class. From the pointwise bounds, it is possible to derive the worst case bound, which is better than the current bounds for practical training ratio beta, the ratio of the TS size to the codebook size. It is. shown that the empirically minimized distortion underestimates the true minimum distortion by more than a factor of (1 - 1 / m), where m is the sequence size. Furthermore, through an asymptotic analysis in the codebook size, a multiplication factor [1 - (1 - e(-beta))/beta] approximate to (1 - 1/beta) for an asymptotic bound is shown. Several asymptotic bounds in terms of the vector dimension and the type of source are also introduced.
This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to...
详细信息
This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelihood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently. This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.
In this paper, we propose a cluster-based and brute-correcting grammatical rules learning method which is based on some conclusions of the cognitive linguistics. First, instances of grammatical category are mapped to ...
详细信息
ISBN:
(纸本)0780379020
In this paper, we propose a cluster-based and brute-correcting grammatical rules learning method which is based on some conclusions of the cognitive linguistics. First, instances of grammatical category are mapped to graphic vectors and distance between two vectors is defined. The set of vectors and the defined distance are proved to form a distance space. Next, this space is mapped to Euclidean space and a simple clustering algorithm is applied to acquire clusters. Then, grammatical rules are learned to describe the cluster. Finally. brute-correcting progress helps to refine the rules. After describing the method we compare the brute-correcting progress with Eric Brill's transformation-based learning approach [E. Brill, 1995] informally and present an application in Chinese named entity recognition.
Deoxyribonucleic acid (DNA) sequences are difficult to analyze similarity due to their length and complexity. The challenge lies in being able to use digital signal processing (DSP) to solve highly relevant problems i...
详细信息
ISBN:
(纸本)0819449563
Deoxyribonucleic acid (DNA) sequences are difficult to analyze similarity due to their length and complexity. The challenge lies in being able to use digital signal processing (DSP) to solve highly relevant problems in DNA sequences. Here, we transfer a one-dimensional (ID) DNA sequence into a two-dimensional (2D) pattern by using the Peano scan algorithm. Four complex values are assigned to the characters "A", ''C'', "T", and "G", respectively. Then, Fourier transform is employed to obtain far-field amplitude distribution of the 2D pattern. Hereto, a ID DNA sequence becomes a 2D image pattern. Features are extracted from the 2D image pattern with the Principle Component Analysis (PCA) method. Therefore, the DNA sequence database can be established. Unfortunately, comparing features may take a long time when the database is large since multi-dimensional features are often available. This problem is solved by building indexing structure like a filter to filter-out non-relevant items and select a subset of candidate DNA sequences. clustering algorithms can organize the multi-dimensional feature data into the indexing structure for effective retrieval. Accordingly, the query sequence can be only compared against candidate ones rather than all sequences in database. In fact, our algorithm provides a pre-processing method to accelerate the DNA sequence search process. Finally, experimental results further demonstrate the efficiency of our proposed algorithm for DNA sequences similarity retrieval.
The evolution of artificial intelligence systems called by complicating of their operation topics and science perfecting has resulted in a diversification of the methods both the algorithms of knowledge representation...
详细信息
ISBN:
(纸本)081944958X
The evolution of artificial intelligence systems called by complicating of their operation topics and science perfecting has resulted in a diversification of the methods both the algorithms of knowledge representation and usage in these systems. Often by this reason it is very difficult to design the effective methods of knowledge discovering and operation for such systems. In the given activity the authors offer a method of unitized representation of the systems knowledge about objects of an external world by rank transformation of their descriptions, made in the different features spaces: deterministic, probabilistic, fuzzy and other. The proof of a sufficiency of the information about the rank configuration of the object states in the features space for decision making is presented. It is shown that the geometrical and combinatorial model of the rank configurations set introduce their by group of. some system of incidence, that allows to store the information on them in a convolute kind. The method of the rank configuration description by the DRP-code (distance rank preserving code) is offered. The problems of its completeness, information capacity, noise immunity and privacy are reviewed. It is shown, that the capacity of a transmission channel for such submission of the information is more than unit, as the code words contain the information both about the object states, and about the distance ranks between them. The effective algorithm of the data clustering for the object states identification, founded on the given code usage, is described. The knowledge representation with the help of the rank configurations allows to unitize and to simplify algorithms of the decision making by fulfillment of logic operations above the DRP-code words. Examples of the proposed. clustering techniques operation on the given samples set, the rank configuration of resulted clusters and its DRP-codes are presented.
暂无评论