clustering algorithms are an important component of data mining technology which has been applied widely in many applications including those that operate on Internet. Recently a new line of research namely Web Intell...
详细信息
ISBN:
(纸本)9780769548807
clustering algorithms are an important component of data mining technology which has been applied widely in many applications including those that operate on Internet. Recently a new line of research namely Web Intelligence emerged that demands for advanced analytics and machine learning algorithms for supporting knowledge discovery mainly in the Web environment. The so called Web Intelligence data are known to be dynamic, loosely structured and consists of complex attributes. To deal with this challenge standard clustering algorithms are improved and evolved with optimization ability by swarm intelligence which is a branch of nature-inspired computing. Some examples are PSO clustering (C-PSO) and clustering with Ant Colony Optimization. The objective of this paper is to investigate the possibilities of applying other nature-inspired optimization algorithms (such as Fireflies, Cuckoos, Bats and Wolves) for performing clustering over Web Intelligence data. The efficacies of each new clustering algorithm are reported in this paper, and in general they outperformed C-PSO.
It is well known that the clusters produced by a clustering algorithm depend on the chosen initial centers. In this paper we present a measure for the degree to which a given clustering algorithm depends on the choice...
详细信息
ISBN:
(纸本)9783642030390
It is well known that the clusters produced by a clustering algorithm depend on the chosen initial centers. In this paper we present a measure for the degree to which a given clustering algorithm depends on the choice of initial centers, for a given data set. This measure is calculated for four well-known offline clustering algorithms (k-means Forgy, k-means Hartigan, k-means Lloyd and frizzy c-means), for five benchmark data sets. The measure is also calculated for ECM, an online algorithm that does not require the number of initial centers as input, but for which the resulting clusters can depend oil the order that the input arrives. Our main finding is that this initialization dependence measure call also he used to determine the optimal number of clusters.
In this paper, we analyze some widely employed clustering algorithms to identify duplicated or cloned pages in web applications. Indeed, we consider an agglomerative hierarchical clustering algorithm, a divisive clust...
详细信息
ISBN:
(纸本)9783540735960
In this paper, we analyze some widely employed clustering algorithms to identify duplicated or cloned pages in web applications. Indeed, we consider an agglomerative hierarchical clustering algorithm, a divisive clustering algorithm, k-means partitional clustering algorithm, and a partitional competitive clustering algorithm, namely Winner Takes All (WTA). All the clustering algorithms take as input a matrix of the distances between the structures of the web pages. The distance of two pages is computed applying the Levenshtein edit distance to the strings that encode the sequences of HTML tags of the web pages.
Three approaches to extract clusters sequentially so that the specification of the number of clusters beforehand is unnecessary are introduced and four algorithms are developed. First is derived from possibilistic clu...
详细信息
ISBN:
(纸本)9781424435968
Three approaches to extract clusters sequentially so that the specification of the number of clusters beforehand is unnecessary are introduced and four algorithms are developed. First is derived from possibilistic clustering while the second is a variation of the mountain clustering using medoids as cluster representatives. Moreover an algorithm based on the idea of noise clustering is developed. The last idea is applied to sequential extraction of regression models and we have the fourth algorithm. We compare these algorithms using numerical examples.
In the kernel clustering problem we are given a (large) n x n symmetric positive semidefinite matrix A = (a(ij)) with Sigma(n)(i=1) Sigma(n)(j=1) a(ij) = 0 and a (small) k x k symmetric positive semidefinite matrix B ...
详细信息
ISBN:
(纸本)9780898717013
In the kernel clustering problem we are given a (large) n x n symmetric positive semidefinite matrix A = (a(ij)) with Sigma(n)(i=1) Sigma(n)(j=1) a(ij) = 0 and a (small) k x k symmetric positive semidefinite matrix B = (b(ij)) The goal is to find a partition (S(1), ..., S(k)} of {1, n} which maximizes Sigma(k)(i=1) Sigma(k)(j=1) (Sigma(p, q)is an element of S(1) x S(1) a(pq)) b(ij) We design a polynomial time approximation algorithm that achieves an approximation ratio of R(B)(2)/C(B) where R(B) and C(B) are geometric parameters that depend only on the matrix 13, defined as follows if b(ij) = < v(i), v(j)> is the Gram matrix representation of B lot some v(1), v(k) is an element of R(k) then R(B) is the minimum radius of a Euclidean ball containing the points {v(1), ..., v(k)} The parameter C(B) is defined as the maximum over all measurable partitions {A(1), ..., A(k)} of R(k-1) the quantity Sigma(k)(i=1) Sigma(k)(j=1) b(ij) < z(i), z(j)>, where for i is an element of{1, ..., k} the vector z, is an element of R(k-1) is the Gaussian moment of A(i), i e z, =,1/(2 pi)((k-1)/2) integral A ie-||x||(2)(2)/2dx We also show that for eve!): epsilon> 0, achieving an approximation guarantee of (1 - epsilon) R(B)(2)/C(B) is Unique Games hard
In this paper, a novel probabilistic load modeling approach is presented. The proposed approach starts by grouping the 24 data points representing the hourly loading of each day in one data segment. The resulting 365 ...
详细信息
ISBN:
(纸本)9781479913039
In this paper, a novel probabilistic load modeling approach is presented. The proposed approach starts by grouping the 24 data points representing the hourly loading of each day in one data segment. The resulting 365 data segments representing the whole year loading profile are evaluated for similarities using principle component analysis;then segments with similar principal components are grouped together into one cluster using clustering algorithms. For each cluster a representative segment is selected and its probability of occurrence is computed. The results of the proposed algorithm can be used in different studies to model the long term behavior of electrical loads taking into account their temporal variations. This feature is possible as the selected representative segments cover the whole year. The designated representative segments are assigned probabilistic indices that correspond to their frequency of occurrence, thus preserving the stochastic nature of electrical loads.
This paper proposes a novel framework for speckle noise suppression and edge preservation using clustering algorithms in ultrasound images. The algorithms considered are K-means clustering, fuzzy C-means clustering, p...
详细信息
ISBN:
(纸本)9781450363860
This paper proposes a novel framework for speckle noise suppression and edge preservation using clustering algorithms in ultrasound images. The algorithms considered are K-means clustering, fuzzy C-means clustering, possibilistic C-means, fuzzy possibilistic C-means, and possibilistic fuzzy C-means clustering. This work presents an exhaustive comparative analysis of the above clustering algorithms to consider their suitability for despeckling and identifies the best clustering algorithm. Two types of dataset are considered: medical ultrasound images of the thyroid, and synthetically modelled ultrasound images. The framework consists of several distinct phases - first the edges of the image are identified using the Canny edge operator, and then a clustering algorithm applied on high frequency coefficients extracted using wavelet transform. Finally, the preserved edges are added back to speckle suppressed image. Thus, the proposed clustering method effectively accomplishes both speckle suppression and edge preservation. This paper also presents a quantitative evaluation of results to demonstrate the effectiveness of the clustering approach.
In this paper we present a comparative study of three data stream clustering algorithms: STREAM, CluStream and MR-Stream. We used a total of 90 synthetic data sets generated from spatial point processes following Gaus...
详细信息
ISBN:
(纸本)9783642222467
In this paper we present a comparative study of three data stream clustering algorithms: STREAM, CluStream and MR-Stream. We used a total of 90 synthetic data sets generated from spatial point processes following Gaussian distributions or Mixtures of Gaussians. The algorithms were executed in three main scenarios: 1) low dimensional;2) low dimensional with concept drift and 3) high dimensional with concept drift. In general, CluStream outperformed the other algorithms in terms of clustering quality at a higher execution time cost. Our results are analyzed with the non-parametric Friedman test and post-hoc Nemenyi test, both with alpha = 5%. Recommendations and future research directions are also explored.
Community structure is a feature of complex networks that can be crucial for the understanding of their internal organization. This is particularly true for brain networks, as the brain functioning is thought to be ba...
详细信息
ISBN:
(纸本)9781509028092
Community structure is a feature of complex networks that can be crucial for the understanding of their internal organization. This is particularly true for brain networks, as the brain functioning is thought to be based on a modular organization. In the last decades, many clustering algorithms were developed with the aim to identify communities in networks of different nature. However, there is still no agreement about which one is the most reliable, and to test and compare these algorithms under a variety of conditions would be beneficial to potential users. In this study, we performed a comparative analysis between six different clustering algorithms, analyzing their performances on a ground-truth consisting of simulated networks with properties spanning a wide range of conditions. Results show the effect of factors like the noise level, the number of clusters, the network dimension and density on the performances of the algorithms and provide some guidelines about the use of the more appropriate algorithm according to the different conditions. The best performances under a wide range of conditions were obtained by Louvain and Leicht & Newman algorithms, while Ronhovde and Infomap proved to be more appropriate in very noisy conditions. Finally, as a proof of concept, we applied the algorithms under exam to brain functional connectivity networks obtained from EEG signals recorded during a sustained movement of the right hand, obtaining a clustering of scalp electrodes which agrees with the results of the simulation study conducted.
Many datasets including social media data and bibliographic data can be modeled as graphs. clustering such graphs is able to provide useful insights into the structure of the data. To improve the quality of clustering...
详细信息
ISBN:
(纸本)9781538674741
Many datasets including social media data and bibliographic data can be modeled as graphs. clustering such graphs is able to provide useful insights into the structure of the data. To improve the quality of clustering, node attributes can be taken into account, resulting in attributed graphs. Existing attributed graph clustering methods generally consider attribute similarity and structural similarity separately. In this paper, we represent attributed graphs as star-schema heterogeneous graphs, where attributes are modeled as different types of graph nodes. This enables the use of personalized pagerank (PPR) as a unified distance measure that captures both structural and attribute similarity. We employ DBSCAN for clustering, and we update edge weights iteratively to balance the importance of different attributes. To improve the efficiency of the clustering, we develop two incremental approaches that aim to enable efficient PPR score computation when edge weights are updated. To boost the effectiveness of the clustering, we propose a simple yet effective edge weight update strategy based on entropy. In addition, we present a game theory based method that enables trading efficiency for result quality. Extensive experiments on real-life datasets offer insight into the effectiveness and efficiency of our proposals, compared with existing methods.
暂无评论