The advent of Big data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI,MapReduce,and *** important step for any parallel clusterin...
详细信息
The advent of Big data has led to the rapid growth in the usage of parallel clustering algorithms that work over distributed computing frameworks such as MPI,MapReduce,and *** important step for any parallel clustering algorithm is the distribution of data amongst the cluster *** step governs the methodology and performance of the entire *** typically use random,or a spatial/geometric distribution strategy like kd-tree based partitioning and grid-based partitioning,as per the requirements of the ***,these strategies are generic and are not tailor-made for any specific parallel clustering *** this paper,we give a very comprehensive literature survey of MPI-based parallel clustering algorithms with special reference to the specific data distribution strategies they *** also propose three new data distribution strategies namely Parameterized Dimensional Split for parallel density-based clustering algorithms like DBSCAN and OPTICS,Cell-Based Dimensional Split for dGridSLINK,which is a grid-based hierarchical clustering algorithm that exhibits efficiency for disjoint spatial distribution,and Projection-Based Split,which is a generic distribution *** of these preserve spatial locality,achieve disjoint partitioning,and ensure good data load *** experimental analysis shows the benefits of using the proposed data distribution strategies for algorithms they are designed for,based on which we give appropriate recommendations for their usage.
Over the last one decade or so, Machine Learning has changed the global technology landscape with applications in almost all disciplines and verticals. Mobile and Web Security is an important research area in which re...
详细信息
Federated learning (FL), as a safe distributed training mode, provides strong support for the edge intelligence of the Internet of Vehicles (IoV) to realize efficient collaborative control and safe data sharing. Howev...
详细信息
Big data has significantly increased the dependence of dataanalytics community on High Performance Computing (HPC) systems. However, efficiently programming an HPC system is still a tedious task requiring specialized...
Big data has significantly increased the dependence of dataanalytics community on High Performance Computing (HPC) systems. However, efficiently programming an HPC system is still a tedious task requiring specialized skills in parallelization and the use of platform-specific languages as well as mechanisms. We present a framework for quickly prototyping new/existing density-based clustering algorithms while obtaining low running times and high speedups via automatic parallelization. The user is required only to specify the sequential algorithm in a Domain Specific Language (DSL) for clustering at a very high level of abstraction. The parallelizing compiler for the DSL does the rest to leverage distributed systems - in particular, typical scale-out clusters made of commodity hardware. Our approach is based on recurring, parallelizable programming patterns known as Kernels, which are identified and parallelized by the compiler. We demonstrate the ease of programming and scalable performance for DBSCAN, SNN, and RECOME algorithms. We also establish that the proposed approach can achieve performance comparable to state-of-the-art manually parallelized implementations while requiring minimal programming effort that is several orders of magnitude smaller than those required on other parallel platforms like MPI/Spark.
Hierarchical Agglomerative Clustering (HAC) algorithms are used in many applications where clusters have a hierarchical relationship between them. Their parallelization is challenging due to the dependence of every ag...
详细信息
ISBN:
(数字)9781728108582
ISBN:
(纸本)9781728108599
Hierarchical Agglomerative Clustering (HAC) algorithms are used in many applications where clusters have a hierarchical relationship between them. Their parallelization is challenging due to the dependence of every agglomeration step on all previous agglomerations. Although a few parallel algorithms have been proposed for SLINK HAC algorithm, only limited work has been done to parallelize other HAC algorithms. In this paper, we present a high-level abstraction, which provides a uniform way to specify any HAC algorithm, and a framework for automatic parallelization of the same for distributed memory systems. The abstraction is supported by constructs in a high level, domain specific language, and a compiler translates algorithms expressed in this language to efficient parallel code targeting distributed systems. Our experiments on multiple HAC algorithms proves that the runtime performance achieved is comparable with state-of-the-art manual parallel implementations on Spark and MPI while requiring only a fraction of the programming effort. At runtime, master-slave execution is used, and load is balanced among the slaves in an algorithm-agnostic way, which is a significant contrast to custom load-balancing techniques seen in the literature on parallel HAC algorithms.
The spread of COVID-19 has brought a huge disaster to the world, and the automatic segmentation of infection regions can help doctors to make diagnosis quickly and reduce workload. However, there are several challenge...
详细信息
Clustering is a popular data mining technique which discovers structure in unlabeled data by grouping objects together on the basis of a similarity criterion. Traditional similarity measures lose their meaning as the ...
详细信息
ISBN:
(纸本)9781509054121
Clustering is a popular data mining technique which discovers structure in unlabeled data by grouping objects together on the basis of a similarity criterion. Traditional similarity measures lose their meaning as the number of dimensions increases and as a consequence, distance or density based clustering algorithms become less meaningful. Shared Nearest Neighbor (SNN) is a solution to clustering high-dimensional data with the ability to find clusters of varying density. SNN assigns objects to a cluster, which share a large number of their nearest neighbors. However, SNN is compute and memory intensive for data of large size and/or dimensionality. Nearest neighbor queries are responsible for a major proportion of computations in SNN, resulting in lower efficiency for higher value of number of nearest neighbors (k). The main motivation of this work is to improve the efficiency of SNN and to parallelize it so that it can be used for clustering large high-dimensional datasets and for large values of k. Existing SNN algorithms become inefficient in these situations. In this paper, we present a new sequential SNN algorithm, R-SNN, which uses R-tree for executing neighborhood queries efficiently and exploiting spatial locality to minimize memory usage. R-SNN is benchmarked against the best available implementation of SNN and is found up to 77 times faster when tested on various real datasets. R-SNN is parallelized for distributed memory, shared memory, and hybrid systems. Significant speedup and scalability achieved can be attributed to parallelization and good load balancing strategies and also to exploitation of spatial locality. Experimental results demonstrate the same for datasets of varying dimensionality and size. The maximum speedup achieved for shared, distributed, and hybrid models are 427.19 using 48 threads, 394.24 using 32 processes, and 1380.69 on 32 nodes (with each node spawning 4 threads), respectively. Super-linear speedup for some datasets is attributed
Clustering is a popular data mining and machine learning technique which discovers interesting patterns from unlabeled data by grouping similar objects together. Clustering high-dimensional data is a challenging task ...
详细信息
ISBN:
(纸本)9781509052073
Clustering is a popular data mining and machine learning technique which discovers interesting patterns from unlabeled data by grouping similar objects together. Clustering high-dimensional data is a challenging task as points in high dimensional space are nearly equidistant from each other, rendering commonly used similarity measures ineffective. Subspace clustering has emerged as a possible solution to the problem of clustering high-dimensional data. In subspace clustering, we try to find clusters in different subspaces within a dataset. Many subspace clustering algorithms have been proposed in the last two decades to find clusters in multiple overlapping subspaces of high-dimensional data. Subspace clustering algorithms iteratively find the best subset of dimensions for a cluster from 2d-1 possible combinations in d-dimensional data. Subspace clustering is extremely compute intensive because of exhaustive search of subspaces, especially in the bottom-up subspace clustering algorithms. To address this issue, an efficient parallel framework for grid-based bottom-up subspace clustering algorithms is developed, considering popular algorithms belonging to this category. The framework is implemented for shared memory, distributed memory, and hybrid systems and is tested for three grid-based bottom-up subspace clustering algorithms: CLIQUE, MAFIA, and ENCLUS. All parallel implementations exhibit impressive speedup and scalability on real datasets.
Single linkage (SLINK) hierarchical clustering algorithm is a preferred clustering algorithm over traditional partitioning-based clustering as it does not require the number of clusters as input. But, due to its high ...
详细信息
ISBN:
(纸本)9781509042982
Single linkage (SLINK) hierarchical clustering algorithm is a preferred clustering algorithm over traditional partitioning-based clustering as it does not require the number of clusters as input. But, due to its high time complexity and inherent data dependencies, it does not scale well for large datasets. To the best of our knowledge, all existing parallel SLINK algorithms are based on the traditional SLINK algorithm and thus require large number of computing resources. In this paper, we present a novel optimization of SLINK algorithm, GridSLINK, which is an order of magnitude faster than the existing state-of-the-art implementation. The optimization in GridSLINK comes from reduction in number of distance calculations required by SLINK. This reduction is achieved by exploiting spatial locality of data points and using an adaptive gridding technique. GridSLINK is parallelized for distributed memory systems. Scalable performance is achieved for increasing number of compute nodes. The proposed parallel algorithm, dGridSLINK, is benchmarked against the best existing parallel algorithm in literature and found to outperform the latter for all the real datasets considered. dGridSLINK can cluster millions of data points in few seconds/minutes using a small number of processing elements, without compromising the quality of clustering.
parallelizing data mining algorithms has become a necessity as we try to mine ever increasing volumes of data. Spatial data mining algorithms like Dbscan, Optics, Slink, etc. have been parallelized to exploit a cluste...
详细信息
ISBN:
(纸本)9781467390064
parallelizing data mining algorithms has become a necessity as we try to mine ever increasing volumes of data. Spatial data mining algorithms like Dbscan, Optics, Slink, etc. have been parallelized to exploit a cluster infrastructure. The efficiency achieved by existing algorithms can be attributed to spatial locality preservation using spatial indexing structures like k-d-tree, quad-tree, grid files, etc. for distributing data among cluster nodes. However, these indexing structures are static in nature, i.e., they need to scan the entire dataset to determine the partitioning coordinates. This results in high data distribution cost when the data size is large. In this paper, we propose a dynamic distributed data structure, DD-Rtree, which preserves spatial locality while distributing data across compute nodes in a shared nothing environment. Moreover, DD-Rtree is dynamic, i.e., it can be constructed incrementally making it useful for handling big data. We compare the quality of data distribution achieved by DD-Rtree with one of the recent distributed indexing structure, SD-Rtree. We also compare the efficiency of queries supported by these indexing structures along with the overall efficiency of DBSCAN algorithm. Our experimental results show that DD-Rtree achieves better data distribution and thereby resulting in improved overall efficiency.
暂无评论