A symmetric matrix is called a Laplacian if it has nonpositive off-diagonal entries and zero row sums. Since the seminal work of Spielman and Teng (2004) on solving Laplacian linear systems in nearly linear time, seve...
详细信息
ISBN:
(纸本)9781450395458
A symmetric matrix is called a Laplacian if it has nonpositive off-diagonal entries and zero row sums. Since the seminal work of Spielman and Teng (2004) on solving Laplacian linear systems in nearly linear time, several algorithms have been designed for the task. Yet, the work of Kyng and Sachdeva (2016) remains the simplest and most practical sequential solver. They presented a solver purely based on random sampling and without graph-theoretic constructions such as low-stretch trees and sparsifiers. In this work, we extend the result of Kyng and Sachdeva to a simple parallel Laplacian solver with O(m log(3) n log log n) or O((m + n log(5) n) log n log log n) work and O(log(2) n log log n) depth using the ideas of block Cholesky factorization from Kyng et al. (2016). Compared to the best known parallel Laplacian solvers that achieve polylogarithmic depth due to Lee et al. (2015), our solver achieves both better depth and, for dense graphs, better work.
Approximate nearest-neighbor search (ANNS) algorithms are a key part of the modern deep learning stack due to enabling efficient similarity search over high-dimensional vector space representations (i.e., embeddings) ...
详细信息
ISBN:
(纸本)9798400704352
Approximate nearest-neighbor search (ANNS) algorithms are a key part of the modern deep learning stack due to enabling efficient similarity search over high-dimensional vector space representations (i.e., embeddings) of data. Among various ANNS algorithms, graph-based algorithms are known to achieve the best throughput-recall tradeoffs. Despite the large scale of modern ANNS datasets, existing parallel graphbased implementations suffer from significant challenges to scale to large datasets due to heavy use of locks and other sequential bottlenecks, which 1) prevents them from efficiently scaling to a large number of processors, and 2) results in nondeterminism that is undesirable in certain applications. In this paper, we introduce ParlayANN, a library of deterministic and parallel graph-based approximate nearest neighbor search algorithms, along with a set of useful tools for developing such algorithms. In this library, we develop novel parallel implementations for four state-of-the-art graph-based ANNS algorithms that scale to billion-scale datasets. Our algorithms are deterministic and achieve high scalability across a diverse set of challenging datasets. In addition to the new algorithmic ideas, we also conduct a detailed experimental study of our new algorithms as well as two existing non-graph approaches. Our experimental results both validate the effectiveness of our new techniques, and lead to a comprehensive comparison among ANNS algorithms on large scale datasets with a list of interesting findings.
Clustering and diversification are two central problems with various applications in machine learning, data mining, and information retrieval. The k-center clustering and k-diversity maximization are two of the most w...
详细信息
ISBN:
(纸本)9781450395458
Clustering and diversification are two central problems with various applications in machine learning, data mining, and information retrieval. The k-center clustering and k-diversity maximization are two of the most well-studied and widely-used problems in this area. Both problems admit sequential algorithms with optimal approximation factors of 2 in any metric space. However, finding distributed algorithms matching the same optimal approximation ratios has been open for more than a decade, with the best current algorithms having factors at least twice the optimal. In this paper, we settle this open problem by presenting constant-round distributed algorithms for k-center clustering and k-diversity maximization in the massively parallel computation (MPC) model, achieving an approximation factor of 2 + epsilon in any metric space for any constant epsilon > 0, which is essentially the best possible considering the lower bound of 2 on the approximability of both these problems. Our algorithms are based on a novel technique for approximating vertex degrees and finding a so-called k-bounded maximal independent set in threshold graphs, using only a constant number of MPC rounds. Other applications of our general technique is also implied, including an almost optimal (3 + epsilon)-approximation algorithm for the k-supplier problem in any metric space in the MPC model.
We study the problem of finding connected components in the Adaptive Massively parallel Computation (AMPC) model. We show that when we require the total space to be linear in the size of the input graph the problem ca...
详细信息
ISBN:
(纸本)9781450395458
We study the problem of finding connected components in the Adaptive Massively parallel Computation (AMPC) model. We show that when we require the total space to be linear in the size of the input graph the problem can be solved in O( log*n) rounds in forests (with high probability) and 2 (O( log*n)) expected rounds in general graphs. This improves upon an existing O(log log(m/n) n) round algorithm. For the case when the desired number of rounds is constant we show that both problems can be solved using Theta(m +n log((k)) n) total space in expectation (in each round), where k is an arbitrarily large constant and log((k)) is the k-th iterate of the log(2) function. This improves upon existing algorithms requiring Omega(m +n log n) total space.
This paper studies parallelalgorithms for the longest increasing subsequence (LIS) problem. Let.. be the input size and k be the LIS length of the input. Sequentially, LIS is a simple problem that can be solved using...
详细信息
ISBN:
(纸本)9781450395458
This paper studies parallelalgorithms for the longest increasing subsequence (LIS) problem. Let.. be the input size and k be the LIS length of the input. Sequentially, LIS is a simple problem that can be solved using dynamic programming (DP) in O(n log n) work. However, parallelizing LIS is a long-standing challenge. We are unaware of any parallel LIS algorithm that has optimal O(n log n) work and non-trivial parallelism (i.e., (O) over tilde (k) or o(n) span). This paper proposes a parallel LIS algorithm that costs O(n log k) work, (O) over tilde (k) span, and O(n) space, and is much simpler than the previous parallel LIS algorithms. We also generalize the algorithm to a weighted version of LIS, which maximizes the weighted sum for all objects in an increasing subsequence. To achieve a better work bound for the weighted LIS algorithm, we designed parallelalgorithms for the van Emde Boas (vEB) tree, which has the same structure as the sequential vEB tree, and supports work-efficient parallel batch insertion, deletion, and range queries. We also implemented our parallel LIS algorithms. Our implementation is light-weighted, efficient, and scalable. On input size 10(9), our LIS algorithm outperforms a highly-optimized sequential algorithm (with O(n log k) cost) on inputs with k <= 3 x 10(5). Our algorithm is also much faster than the best existing parallel implementation by Shen et al. (2022) on all input instances.
Reducing the execution time of ORB-SLAM algorithm is a crucial aspect of autonomous vehicles since it is computationally intensive for embedded boards. We propose a parallel GPU-based implementation, able to run on em...
详细信息
ISBN:
(纸本)9781450395458
Reducing the execution time of ORB-SLAM algorithm is a crucial aspect of autonomous vehicles since it is computationally intensive for embedded boards. We propose a parallel GPU-based implementation, able to run on embedded boards, of the Tracking part of the ORB-SLAM2/3 algorithm. Our implementation is not simply a GPU port of the tracking phase. Instead, we propose a novel method to accelerate image Pyramid construction on GPUs. Comparison against state-of-the-art CPU and GPU implementations, considering both computational time and trajectory errors shows improvement on execution time in well-known datasets, such as KITTI and EuRoC.
Semisort is a fundamental algorithmic primitive widely used in the design and analysis of efficient parallelalgorithms. It takes input as an array of records and a function extracting a key per record, and reorders t...
详细信息
ISBN:
(纸本)9781450395458
Semisort is a fundamental algorithmic primitive widely used in the design and analysis of efficient parallelalgorithms. It takes input as an array of records and a function extracting a key per record, and reorders them so that records with equal keys are contiguous. Since many applications only require collecting equal values, but not fully sorting the input, semisort is broadly applicable, e.g., in string algorithms, graph analytics, and geometry processing, among many other domains. However, despite dozens of recent papers that use semisort in their theoretical analysis and the existence of an asymptotically optimal parallel semisort algorithm, most implementations of these parallelalgorithms choose to implement semisort by using comparison or integer sorting in practice, due to potential performance issues in existing semisort implementations. In this paper, we revisit the semisort problem, with the goal of achieving a high-performance parallel semisort implementation with a flexible interface. Our approach can easily be extended to two related problems, histogram and collect-reduce. Our algorithms achieve strong speedups in practice, and importantly, outperform state-of-the-art parallel sorting and semisorting methods for almost all settings we tested, with varying input sizes, distribution, and key types. On average (geometric means), our semisort implementation is at least 1.27x faster the best of the tested baselines. We also test two important applications with real-world data, and show that our algorithms improve the performance (up to 2.13x) over existing approaches. We believe that many other parallel algorithm implementations can be accelerated using our results.
In this paper, we focus on the parallel communication cost of multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK). SYRK requires half the computation of general matrix multiplication bec...
详细信息
ISBN:
(纸本)9781450395458
In this paper, we focus on the parallel communication cost of multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK). SYRK requires half the computation of general matrix multiplication because of the symmetry of the output matrix. Recent work (Beaumont et al., SPAA '22) has demonstrated that the sequential I/O complexity of SYRK is also a constant factor smaller than that of general matrix multiplication. Inspired by this progress, we establish memory-independent parallel communication lower bounds for SYRK with smaller constants than general matrix multiplication, and we show that these constants are tight by presenting communication-optimal algorithms. The crux of the lower bound proof relies on extending a key geometric inequality to symmetric computations and analytically solving a constrained nonlinear optimization problem. The optimal algorithms use a triangular blocking scheme for parallel distribution of the symmetric output matrix and corresponding computation.
We show how to use parallelization to speed up sampling from an arbitrary distribution mu on a product space [q](n), given oracle access to counting queries: P-X similar to mu[X-S = sigma(S)] for any S subset of [n] a...
详细信息
ISBN:
(纸本)9798400703836
We show how to use parallelization to speed up sampling from an arbitrary distribution mu on a product space [q](n), given oracle access to counting queries: P-X similar to mu[X-S = sigma(S)] for any S subset of [n] and sigma(S) epsilon [q](S). Our algorithm takes O(n(2/3) center dot polylog(n,q)) parallel time, to the best of our knowledge, the first sublinear in n runtime for arbitrary distributions. Our results have implications for sampling in autoregressive models. Our algorithm directly works with an equivalent oracle that answers conditional marginal queries P-X similar to mu[X-i= sigma(i) | X-S= sigma(S)], whose role is played by a trained neural network in autoregressive models. This suggests a roughly n(1/3)-factor speedup is possible for sampling in any-order autoregressive models. We complement our positive result by showing a lower bound of (Omega) over tilde (n(1/3)) for the runtime of any parallel sampling algorithm making at most poly(n) queries to the counting oracle, even for q=2.
Determining the degree of inherent parallelism in classical sequential algorithms and leveraging it for fast parallel execution is a key topic in parallel computing, and detailed analyses are known for a wide range of...
详细信息
ISBN:
(纸本)9781450395458
Determining the degree of inherent parallelism in classical sequential algorithms and leveraging it for fast parallel execution is a key topic in parallel computing, and detailed analyses are known for a wide range of classical algorithms. In this paper, we perform the first such analysis for the fundamental Union-Find problem, in which we are given a graph as a sequence of edges, and must maintain its connectivity structure under edge additions. We prove that classic sequential algorithms for this problem are well-parallelizable under reasonable assumptions, addressing a conjecture by [Blelloch, 2017]. More precisely, we show via a new potential argument that, under uniform random edge ordering, parallel union-find operations are unlikely to interfere: T concurrent threads processing the graph in parallel will encounter memory contention O(T-2 center dot log vertical bar V vertical bar center dot log vertical bar E vertical bar) times in expectation, where vertical bar E vertical bar and vertical bar V vertical bar are the number of edges and nodes in the graph, respectively. We leverage this result to design a new parallel Union-Find algorithm that is both internally deterministic, i.e., its results are guaranteed to match those of a sequential execution, but also work-efficient and scalable, as long as the number of threads T is O(vertical bar E vertical bar(1/3-epsilon)), for an arbitrarily small constant epsilon > 0, which holds for most large real-world graphs. We present lower bounds which show that our analysis is close to optimal, and experimental results suggesting that the performance cost of internal determinism is limited.
暂无评论