DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of DBSCAN is challenging as it exhibits an inherent seque...
详细信息
ISBN:
(纸本)9781467308052;9781467308045
DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of DBSCAN is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency. We present a new parallel DBSCAN algorithm (PDSDBSCAN) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of DBSCAN. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory. Using datasets containing up to several hundred million high-dimensional points, we show that PDSDBSCAN significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.
DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of DBSCAN is challenging as it exhibits an inherent seque...
详细信息
ISBN:
(纸本)9781467308052
DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of DBSCAN is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency. We present a new parallel DBSCAN algorithm (PDSDBSCAN) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of DBSCAN. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory. Using datasets containing up to several hundred million high-dimensional points, we show that PDSDBSCAN significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.
Consider a forest that evolves via link operations that make the root of one tree the child of a node in another tree. Intermixed with link operations are nca operations, which return the nearest common ancestor of tw...
详细信息
Consider a forest that evolves via link operations that make the root of one tree the child of a node in another tree. Intermixed with link operations are nca operations, which return the nearest common ancestor of two given nodes when such exists. This article shows that a sequence of m such nca and link operations on a forest of n nodes can be processed online in time O(m alpha(m, n) + n). This was previously known only for a restricted type of link operation. The special case where a link only extends a tree by adding a new leaf occurs in Edmonds' algorithm for finding a maximum weight matching on a general graph. Incorporating our algorithm into the implementation of Edmonds' algorithm in [9] achieves time O(n(m + n logn)) for weighted matching, an arguably optimum asymptotic bound (n and m are the number of vertices and edges, respectively). Our datastructure also provides a simple alternative implementation of the incremental-tree set merging algorithm of Gabow and Tarjan [11].
DBSCAN is a widely used isodensity-based clustering algorithm for particle data well-known for its ability to isolate arbitrarily-shaped clusters and to filter noise data. The algorithm is super-linear (O(nlogn)) and ...
详细信息
ISBN:
(纸本)9781479955008
DBSCAN is a widely used isodensity-based clustering algorithm for particle data well-known for its ability to isolate arbitrarily-shaped clusters and to filter noise data. The algorithm is super-linear (O(nlogn)) and computationally expensive for large datasets. Given the need for speed, we propose a fast heuristic algorithm for DBSCAN using density based sampling, which performs equally well in quality compared to exact algorithms, but is more than an order of magnitude faster. Our experiments on astrophysics and synthetic massive datasets (8.5 billion numbers) shows that our approximate algorithm is up to 56x faster than exact algorithms with almost identical quality (Omega-Index >= 0.99). We develop a new parallel DBSCAN algorithm, which uses dynamic partitioning to improve load balancing and locality. We demonstrate near-linear speedup on shared memory (15x using 16 cores, single node Intel (R) Xeon (R) processor) and distributed memory (3917x using 4096 cores, multinode) computers, with 2x additional performance improvement using Intel (R) Xeon Phi (TM) coprocessors. Additionally, existing exact algorithms can achieve up to 3.4 times speedup using dynamic partitioning.
OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered chal...
详细信息
ISBN:
(纸本)9781450323789
OPTICS is a hierarchical density-based data clustering algorithm that discovers arbitrary-shaped clusters and eliminates noise using adjustable reachability distance thresholds. Parallelizing OPTICS is considered challenging as the algorithm exhibits a strongly sequential data access order. We present a scalable parallel OPTICS algorithm (POPTICS) designed using graph algorithmic concepts. To break the data access sequentiality, POPTICS exploits the similarities between the OPTICS algorithm and PRIM'S Minimum Spanning Tree algorithm. Additionally, we use the disjoint-set data structure to achieve a high parallelism for distributed cluster extraction. Using high dimensional datasets containing up to a billion floating point numbers, we show scalable speedups of up to 27.5 for our OpenMP implementation on a 40-core shared-memory machine, and up to 3,008 for our MPI implementation on a 4,096-core distributed-memory machine. We also show that the quality of the results given by POPTICS is comparable to those given by the classical OPTICS algorithm.
暂无评论