A crucial step in understanding a large legacy software system is to decompose it into meaningful subsystems, which can be separately studied. This decomposition can be done either manually or automatically by a softw...
详细信息
ISBN:
(纸本)0769506569
A crucial step in understanding a large legacy software system is to decompose it into meaningful subsystems, which can be separately studied. This decomposition can be done either manually or automatically by a software clustering algorithm (SCA). Similar versions of a software system can be expected to have similar decompositions. We say an SCA is stable if small changes in its input (the software system) produce small changes in its output (the decomposition). This paper defines stability formally, explains why it is an essential property for an SCA, and gives experimental results from evaluating the stability of various decomposition algorithms suggested in the literature.
Semi-supervised clustering algorithms introduce partial knowledge into traditional unsupervised methods and generally improve results. Partial constrained clustering is one of the main kinds of semi-supervised cluster...
详细信息
ISBN:
(纸本)9781479972593
Semi-supervised clustering algorithms introduce partial knowledge into traditional unsupervised methods and generally improve results. Partial constrained clustering is one of the main kinds of semi-supervised clustering algorithms. Notably, constraints selected at random might probable bring only trivial improvement. To improve both the effectiveness and the efficiency of the partial constrained clustering algorithms, active selection for constraints is important. However, there are only few studies on the selection of active constraints. In view of this problem, in this paper we propose an improved selection approach of active constraints for partial constrained clustering algorithms. Compared to the state-of-the-art Explore and Consolidate approach, Experiments on a number of public benchmark data sets show that (i) our approach can find more informational constraints for partial constrained clustering algorithms and bring encouraging improvement;and (ii) our approach can find out constraints distributed among all the classes in investigated data sets quickly, which shows that our approach can be used in more occasions when only small numbers of constraints are allowed.
As part of the 2018 MIT-Amazon Graph Challenge on subgraph isomorphism, we propose a novel joint hierarchical clustering and parallel counting technique called the PHC algorithm that can compute the exact number of tr...
详细信息
ISBN:
(纸本)9781538659892
As part of the 2018 MIT-Amazon Graph Challenge on subgraph isomorphism, we propose a novel joint hierarchical clustering and parallel counting technique called the PHC algorithm that can compute the exact number of triangles in large graphs. The PHC algorithm consists of first pruning followed by hierarchical clustering based on geodesic distance and then triangle counting in parallel. This allows scalable software framework such as MapReduce/Hadoop to count triangles inside each cluster as well as those straddling between clusters in parallel. We characterize the performance of the PHC algorithm mathematically, and its performance evaluation using representative graphs including random graphs demonstrates its computational efficiency over other existing techniques.
clustering is one of the important techniques in Data Mining. The objective of clustering is to group objects into clusters such that objects within a cluster are more similar to each other than objects in different c...
详细信息
ISBN:
(纸本)0769509967;0769509975
clustering is one of the important techniques in Data Mining. The objective of clustering is to group objects into clusters such that objects within a cluster are more similar to each other than objects in different clusters. The similarity between two objects is defined by a distance function, e.g., the Euclidean distance, which satisfies the triangular inequality. Distance calculation is computationally ver)? expensive and many algorithms have been proposed so far to solve this problem. This paper considers gradual clustering problem. From practice, we noticed that user often begin clustering on a small number of attributes, e.g., two. If the result is partially satisfying, user will continue clustering on a higher number of attributes, e.g., ten. We refer to this problem as gradual clustering problem. In fact gradual clustering can be considered as vertically incremental clustering. Approaches are proposed to solve this problem. The main idea is to reduce the number of distance calculations by using the triangle inequality. Our method first stores in an index the distances between a representative object and objects in n-dimensional space. Then these pre-computed distances are used to avoid distance calculations in (n+m)-dimensional space. Two experiments on real data sets demonstrate the added value of our approaches. The implemented algorithm are based on DBSCAN algorithm with an associated M-Tree as bidder tree. However the principles of our idea can well be integrated with other tree structures such as MVP-Tree, R*-Tree, etc., and with other clustering algorithms.
Prior works have elaborated on the problem of joint clustering in the optimization and geography domains. However, prior works neither clearly specify the connected constraint in the geography domain nor propose effic...
详细信息
ISBN:
(纸本)9783540681243
Prior works have elaborated on the problem of joint clustering in the optimization and geography domains. However, prior works neither clearly specify the connected constraint in the geography domain nor propose efficient algorithms. In this paper, we formulate the joint clustering problem in which a connected constraint and the number of clusters should be specified. We propose an algorithm K-means with Local Search (abbreviated as KLS) to solve the joint clustering problem with the connected constraint. Experimental results show that KLS can find correct clusters efficiently.
The paper describes a process of clustering of article abstracts, taken from the largest bibliographic life sciences and biomedical information MEDLINE database into categories that correspond to types of medical inte...
详细信息
ISBN:
(纸本)9781467376983
The paper describes a process of clustering of article abstracts, taken from the largest bibliographic life sciences and biomedical information MEDLINE database into categories that correspond to types of medical interventions - types of patient treatments. Experiments were carried out to evaluate the quality of clustering for the following algorithms: K-means;K- means++;Hierarchical clustering, SIB (Sequential information bottleneck) together with the LSA (Latent Semantic Analysis) methods and MI (Mutual Information) which allow selecting feature vectors. Best results of clustering were achieved by K- means++ together with LSA then 210- dimensional space was chosen: Purity = 0.5719, Entropy = 1.3841, Normalized Entropy = 0.6299.
Aiming at the problem that the classification types are not easy to determine in the current fuzzy C-means classification, this paper proposes a power system load clustering method based on hierarchical and fuzzy theo...
详细信息
ISBN:
(纸本)9781728124551
Aiming at the problem that the classification types are not easy to determine in the current fuzzy C-means classification, this paper proposes a power system load clustering method based on hierarchical and fuzzy theory, and introduces the concept of silhouette coefficient in the mathematical field into the power system load classification to measure the classification *** at the problem of the number of clusters in the original Fuzzy C-means clustering algorithm, the idea of decision tree classification in the hierarchical clustering algorithm is integrated into the original algorithm, and the improved algorithm is fused. The improved algorithm can avoid the influence of prior values on the classification results, and then determine the optimal number of classifications according to the silhouette coefficient index. Finally, the reliability and validity of the algorithm are verified by the load data of PJM market in the United *** at the problem of the number of clusters in the original fuzzy C-means clustering algorithm, the idea of decision tree classification in the hierarchical clustering algorithm is integrated into the original algorithm, and the improved algorithm is fused. The improved algorithm can avoid the influence of prior values on the classification results, and then determine the optimal number of classifications according to the silhouette coefficient.
Measuring graph clustering quality remains an open problem. Here, we introduce three statistical measures to address the problem. We empirically explore their behavior under a number of stress test scenarios and compa...
详细信息
ISBN:
(数字)9783319928715
ISBN:
(纸本)9783319928715;9783319928708
Measuring graph clustering quality remains an open problem. Here, we introduce three statistical measures to address the problem. We empirically explore their behavior under a number of stress test scenarios and compare it to the commonly used modularity and conductance. Our measures are robust, immune to resolution limit, easy to intuitively interpret and also have a formal statistical interpretation. Our empirical stress test results confirm that our measures compare favorably to the established ones. In particular, they are shown to be more responsive to graph structure, less sensitive to sample size and breakdowns during numerical implementation and less sensitive to uncertainty in connectivity. These features are especially important in the context of larger data sets or when the data may contain errors in the connectivity patterns.
Image segmentation is one of the pre-processing steps required to analyze color images. Image segmentation in RGB space, which is performed by using clustering algorithm, is required long computation time even in smal...
详细信息
ISBN:
(纸本)9781509064946
Image segmentation is one of the pre-processing steps required to analyze color images. Image segmentation in RGB space, which is performed by using clustering algorithm, is required long computation time even in small images. Another approach that can he used for image segmentation is the histogram-based approach. However, histogram-based approaches can also be applied to single-channel or gray-scale images. Therefore, a hue-based approach is considered for segmentation in a color image. However, since the hue shows an angular change, it is not possible to use number line based operations. In this study, directional based clustering algorithms are used to solve this problem. The performance of directional based algorithms was measured and compared.
Document clustering refers to unsupervised classification (categorization) of documents into groups (clusters) in such a way that the documents in a cluster are similar, whereas dissimilar documents are assigned in di...
详细信息
ISBN:
(纸本)9783642316029
Document clustering refers to unsupervised classification (categorization) of documents into groups (clusters) in such a way that the documents in a cluster are similar, whereas dissimilar documents are assigned in different clusters. The documents may be web pages, blog posts, news articles, or other text files. A popular and computationally efficient clustering technique is flat clustering. Unlike hierarchical techniques, flat clustering algorithms aim to partition the document space into groups of similar documents. The cluster assignments however may be hard or soft. This paper presents our experimental work on evaluating some hard and soft flat-clustering algorithms, namely K-means, heuristic k-means and fuzzy C-means, for categorizing text documents. We experimented with different representations (tf, ***, Boolean) and feature selection schemes (with or without stop word removal and with or without stemming) on some standard datasets. The results indicate that tf:idf representation and the use of stemming obtains better clustering. Moreover, fuzzy clustering obtains better results than K-means on almost all datasets, and is also a more stable method.
暂无评论