This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN*). Our approach is based on generating a well-separated pair decompositi...
详细信息
ISBN:
(纸本)9781450383431
This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN*). Our approach is based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We introduce a new notion of well-separation to reduce the work and space of our algorithm for HDBSCAN*. We also give a new parallel divide-and-conquer algorithm for computing the dendrogram and reachability plots, which are used in visualizing clusters of different scale that arise for both EMST and HDBSCAN*. We show that our algorithms are theoretically efficient: they have work (number of operations) matching their sequential counterparts, and polylogarithmic depth (parallel time). We implement our algorithms and propose a memory optimization that requires only a subset of well-separated pairs to be computed and materialized, leading to savings in both space (up to 10x) and time (up to 8x). Our experiments on large real-world and synthetic data sets using a 48-core machine show that our fastest algorithms outperform the best serial algorithms for the problems by 11.13-55.89x, and existing parallel algorithms by at least an order of magnitude.
The Morse-Smale complex is a well studied topological structure that represents the gradient flow behavior of a scalar function. It supports multi-scale topological analysis and visualization of large scientific data....
详细信息
ISBN:
(纸本)9781728180144
The Morse-Smale complex is a well studied topological structure that represents the gradient flow behavior of a scalar function. It supports multi-scale topological analysis and visualization of large scientific data. Its computation poses significant algorithmic challenges when considering large scale data and increased feature complexity. Several parallel algorithms have been proposed towards the fast computation of the 3D Morse-Smale complex. The non-trivial structure of the saddle-saddle connections are not amenable to parallel computation. This paper describes a fine grained parallel method for computing the Morse-Smale complex that is implemented on a GPU. The saddle-saddle reachability is first determined via a transformation into a sequence of vector operations followed by the path traversal, which is achieved via a sequence of matrix operations. Computational experiments show that the method achieves up to 7 x speedup over current sharedmemory implementations.
Deep neural networks (DNN) have recently achieved extraordinary results in domains like computer vision and speech recognition. An essential element for this success has been the introduction of high performance compu...
详细信息
ISBN:
(纸本)9781450340922
Deep neural networks (DNN) have recently achieved extraordinary results in domains like computer vision and speech recognition. An essential element for this success has been the introduction of high performance computing (HPC) techniques in the critical step of training the neural network. This paper describes the implementation and analysis of a network-agnostic and convergence-invariant coarse-grain parallelization of the DNN training algorithm. The coarse-grain parallelization is achieved through the exploitation of the batch-level parallelism. This strategy is independent from the support of specialized and optimized libraries. Therefore, the optimization is immediately available for accelerating the DNN training. The proposal is compatible with multi-GPU execution without altering the algorithm convergence rate. The parallelization has been implemented in Caffe, a state-of-the-art DNN framework. The paper describes the code transformations for the parallelization and we also identify the limiting performance factors of the approach. We show competitive performance results for two state-of-the-art computer vision datasets, MNIST and CIFAR-10. In particular, on a 16-core Xeon E5-2667v2 at 3.30GHz we observe speedups of 8x over the sequential execution, at similar performance levels of those obtained by the GPU optimized Caffe version in a NVIDIA K40 GPU.
Mutual Exclusion is a fundamental problem in distributed computing, and the problem of proving upper and lower bounds on the RMR complexity of this problem has been extensively studied. Here, we give matching lower an...
详细信息
ISBN:
(纸本)9781450312455
Mutual Exclusion is a fundamental problem in distributed computing, and the problem of proving upper and lower bounds on the RMR complexity of this problem has been extensively studied. Here, we give matching lower and upper bounds on how RMR complexity trades off with space. Two implications of our results are that constant RMR complexity is impossible with subpolynomial space and subpolynomial RMR complexity is impossible with constant space for cache-coherent multiprocessors, regardless of how strong the hardware synchronization operations are. To prove these results we show that the complexity of mutual exclusion, which can be "messy" to analyze because of system details such as asynchrony and cache coherence, is captured precisely by a simple and purely combinatorial game that we design. We then derive lower and upper bounds for this game, thereby obtaining corresponding bounds for mutual exclusion. The lower bounds for the game are proved using potential functions.
Matrix factorization is an efficient technique used for disclosing latent features of real-world data. It finds its application in areas such as text mining, image analysis, social network and more recently and popula...
详细信息
Matrix factorization is an efficient technique used for disclosing latent features of real-world data. It finds its application in areas such as text mining, image analysis, social network and more recently and popularly in recommendation systems. Alternating Least Squares (ALS), Stochastic Gradient Descent (SGD) and Coordinate Descent (CD) are among the methods used commonly while factorizing large matrices. SGD-based factorization has proven to be the most successful among these methods after Netflix and KDDCup competitions where the winners' algorithms relied on methods based on SGD. Parallelization of SGD then became a hot topic and studied extensively in the literature in recent years. We focus on parallel SGD algorithms developed for sharedmemory and distributed memory systems. sharedmemory parallelizations include works such as HogWild, FPSGD and MLGF-MF, and distributed memory parallelizations include works such as DSGD, GASGD and NOMAD. We design a survey that contains exhaustive analysis of these studies, and then particularly focus on DSGD by implementing it through message-passing paradigm and testing its performance in terms of convergence and speedup. In contrast to the existing works, many real-wold datasets are used in the experiments that we produce using published raw data. We show that DSGD is a robust algorithm for large-scale datasets and achieves near-linear speedup with fast convergence rates.
We are on the cusp of the emergence of a new wave of nonvolatile memory technologies that are projected to become the dominant type of main memory in the near future. A key property of these new memory technologies is...
详细信息
ISBN:
(纸本)9781450339643
We are on the cusp of the emergence of a new wave of nonvolatile memory technologies that are projected to become the dominant type of main memory in the near future. A key property of these new memory technologies is their asymmetric read-write costs: Writes can be an order of magnitude or more higher energy, higher latency, and lower (per module) bandwidth than reads. This high cost for writes motivates a rethinking of algorithm design towards "write efficient" algorithms and data structures that reduce their number of writes [1, 2, 3, 4, 5, 6]. Many popular techniques for sequential, distributed, and parallel algorithms are tuned to the setting where reads and writes cost the same, and hence need to be revisited. Prior work on reducing writes to contended cache lines in shared memory algorithms can be useful here, but with the new technologies, even writes to uncontended memory is costly. Moreover, the new technologies are unlikely to replace the fastest cache memory, motivating the study of a multi-level memory hierarchy comprised of smaller symmetric level(s) and a larger asymmetric level. Lower bounds, too, need to be revisited in light of asymmetric costs. This talk provides background on these emerging memory technologies, highlights the progress to date on these exciting research questions, and touches on a few of the many open problems.
Triangular meshes of superior quality are important for geometric processing in practical applications. Existing approximative CVT-based remeshing methodology uses planar polygonal facets to fit the original surface, ...
详细信息
Triangular meshes of superior quality are important for geometric processing in practical applications. Existing approximative CVT-based remeshing methodology uses planar polygonal facets to fit the original surface, simplifying the computational complexity. However, they usually do not consider surface curvature. Topological errors and outliers can also occur in the close sheet surface remeshing, resulting in wrong meshes. With this regard, we present a novel method named PowerRTF, an extension of the restricted tangent face (RTF) in conjunction with the power diagram, to better approximate the original surface with curvature adaption. The idea is to introduce a weight property to each sample point and compute the power diagram on the tangent face to produce area-controlled polygonal facets. Based on this, we impose the variable-capacity constraint and centroid constraint to the PowerRTF, providing the trade-off between mesh quality and computational efficiency. Moreover, we apply a normal verification-based inverse side point culling method to address the topological errors and outliers in close sheet surface remeshing. Our method independently computes and optimizes the PowerRTF per sample point, which is efficiently implemented in parallel on the GPU. Experimental results demonstrate the effectiveness, flexibility, and efficiency of our method.
暂无评论