The performance of parallelalgorithms for sparse matrixmatrix multiplication is typically determined by the amount of interprocessor communication performed, which in turn depends on the nonzero structure of the inpu...
详细信息
ISBN:
(纸本)9781450335881
The performance of parallelalgorithms for sparse matrixmatrix multiplication is typically determined by the amount of interprocessor communication performed, which in turn depends on the nonzero structure of the input matrices. In this paper, we characterize the communication cost of a sparse matrix-matrix multiplication algorithm in terms of the size of a cut of an associated hypergraph that encodes the computation for a given input nonzero structure. Obtaining an optimal algorithm corresponds to solving a hypergraph partitioning problem. Our hypergraph model generalizes several existing models for sparse matrix-vector multiplication, and we can leverage hypergraph partitioners developed for that computation to improve application-specific algorithms for multiplying sparse matrices.
The authors have simulated several numerical and nonnumerical algorithms on five distributed-memory parallel processors (DMPPs). All five DMPPs have the same topology (a torus) and the same number of nodes. The archit...
详细信息
ISBN:
(纸本)9780897913195
The authors have simulated several numerical and nonnumerical algorithms on five distributed-memory parallel processors (DMPPs). All five DMPPs have the same topology (a torus) and the same number of nodes. The architectures differ only in the speed of communication between neighboring nodes, with the computation unit unchanged. The authors quantify the effect that interprocessor communication speed and synchronization overhead have on the performance of the DMPPs. After introducing their rationale and reviewing related work, the authors present and discuss the results of the simulations.
The randomized incremental convex hull algorithm is one of the most practical and important geometric algorithms in the literature. Due to its simplicity, and the fact that many points or facets can be added independe...
详细信息
ISBN:
(纸本)9781450369350
The randomized incremental convex hull algorithm is one of the most practical and important geometric algorithms in the literature. Due to its simplicity, and the fact that many points or facets can be added independently, it is also widely used in parallel convex hull implementations. However, to date there have been no non-trivial theoretical bounds on the parallelism available in these implementations. In this paper, we provide a strong theoretical analysis showing that the standard incremental algorithm is inherently parallel. In particular, we show that for n points in any constant dimension, the algorithm has O(logn) dependence depth with high probability. This leads to a simple work-optimal parallel algorithm with polylogarithmic span with high probability. Our key technical contribution is a new definition and analysis of the configuration dependence graph extending the traditional configuration space, which allows for asynchrony in adding configurations. To capture the "true" dependence between configurations, we define the support set of configuration c to be the set of already added configurations that it depends on. We show that for problems where the size of the support set can be bounded by a constant, the depth of the configuration dependence graph is shallow ( O(logn) with high probability for input size n). In addition to convex hull, our approach also extends to several related problems, including half-space intersection and finding the intersection of a set of unit circles. We believe that the configuration dependence graph and its analysis is a general idea that could potentially be applied to more problems.
Given a pattern string, we describe a way to preprocess it. We design a constant-time optimal parallel algorithm for finding all occurrences of the (prepro-cessed) pattern in any given text.
ISBN:
(纸本)0897915119
Given a pattern string, we describe a way to preprocess it. We design a constant-time optimal parallel algorithm for finding all occurrences of the (prepro-cessed) pattern in any given text.
There are several fundamental problems for which there are optimal randomized algorithms, but whose deterministic complexity remains unresolved. Among such problems are the minimum spanning tree problem, the set maxim...
详细信息
ISBN:
(纸本)089871513X
There are several fundamental problems for which there are optimal randomized algorithms, but whose deterministic complexity remains unresolved. Among such problems are the minimum spanning tree problem, the set maxima problem, the problem of computing connected components and (minimum) spanning trees in parallel and the problem of performing sensitivity analysis on shortest path trees and minimum spanning trees. For each of these problems there is an optimal randomized algorithm which uses a linear number of random bits. We propose new algorithms (or adapt old ones) for these problems which preserve optimality while saving an exponential number of random bits. In the case of computing minimum spanning trees and MST/SSSP sensitivity analysis, we reduce the dependence on randomness to log* n random bits. We also consider the problem of selection, for which we give two algorithms which make an expected 1.5n + o(n) comparisons;one uses 0 (log n) random bits and is uniform, the other uses 0 (log log n) random bits and is non-uniform.
In this paper we study submodular maximization under a matroid constraint in the adaptive complexity model. This model was recently introduced in the context of submodular optimization to quantify the information theo...
详细信息
ISBN:
(纸本)9781450367059
In this paper we study submodular maximization under a matroid constraint in the adaptive complexity model. This model was recently introduced in the context of submodular optimization to quantify the information theoretic complexity of black-box optimization in a parallel computation model. Informally, the adaptivity of an algorithm is the number of sequential rounds it makes when each round can execute polynomially-many function evaluations in parallel. Since submodular optimization is regularly applied on large datasets we seek algorithms with low adaptivity to enable speedups via parallelization. Consequently, a recent line of work has been devoted to designing constant factor approximation algorithms for maximizing submodular functions under various constraints in the adaptive complexity model. Despite the burst in work on submodular maximization in the adaptive complexity model, the fundamental problem of maximizing a monotone submodular function under a matroid constraint has remained elusive. In particular, all known techniques fail for this problem and there are no known constant factor approximation algorithms whose adaptivity is sublinear in the rank of the matroid k or in the worst case sublinear in the size of the ground set n. In this paper we present an approximation algorithm for the problem of maximizing a monotone submodular function under a matroid constraint in the adaptive complexity model. The approximation guarantee of the algorithm is arbitrarily close to the optimal 1 - 1/e and it has near optimal adaptivity of O(log(n)log(k)). This result is obtained using a novel technique of adaptive sequencing which departs from previous techniques for submodular maximization in the adaptive complexity model. In addition to our main result we show how to use this technique to design other approximation algorithms with strong approximation guarantees and polylogarithmic adaptivity.
We propose the hybrid dataflow/von Neumann vector graph instruction word (VGIW) architecture. This data-parallel architecture concurrently executes each basic block's dataflow graph (graph instruction word) for a ...
详细信息
ISBN:
(纸本)9781450340342
We propose the hybrid dataflow/von Neumann vector graph instruction word (VGIW) architecture. This data-parallel architecture concurrently executes each basic block's dataflow graph (graph instruction word) for a vector of threads, and schedules the different basic blocks based on von Neumann control flow semantics. The VGIW processor dynamically coalesces all threads that need to execute a specific basic block into a thread vector and, when the block is scheduled, executes the entire thread vector concurrently. The proposed control flow coalescing model enables the VGIW architecture to overcome the control flow divergence problem, which greatly impedes the performance and power efficiency of data-parallelarchitectures. Furthermore, using von Neumann control flow semantics enables the VGIW architecture to overcome the limitations of the recently proposed single-graph multiple-flows (SGMF) dataflow GPGPU, which is greatly constrained in the size of the kernels it can execute. Our evaluation shows that VGIW can achieve an average speedup of 3x (up to 11x) over an NVIDIA GPGPU, while providing an average 1.75x better energy efficiency (up to 7x).
The Lovasz Local Lemma (LLL) is a cornerstone principle in the probabilistic method of combinatorics, and a seminal algorithm of Moser & Tardos (2010) provides an efficient randomized algorithm to implement it. Th...
详细信息
ISBN:
(纸本)9781611974782
The Lovasz Local Lemma (LLL) is a cornerstone principle in the probabilistic method of combinatorics, and a seminal algorithm of Moser & Tardos (2010) provides an efficient randomized algorithm to implement it. This algorithm can be parallelized to give an algorithm that uses polynomially many processors and runs in O(log(3) n) time, stemming from O(log n) adaptive computations of a maximal independent set (MIS). Chung et al. (2014) developed faster local and parallelalgorithms, potentially running in time O(log(2) n), but these algorithms work under significantly more stringent conditions than the LLL. We give a new parallel algorithm that works under essentially the same conditions as the original algorithm of Moser & Tardos but uses only a single MIS computation, thus running in O(log(2) n) time. This conceptually new algorithm also gives a clean combinatorial description of a satisfying assignment which might be of independent interest. Our techniques extend to the deterministic LLL algorithm given by Chandrasekaran et al. (2013) leading to an NC-algorithm running in time O(log(2) n) as well. We also provide improved bounds on the runtimes of the sequential and parallel resampling-based algorithms originally developed by Moser & Tardos. Our bounds extend to any problem instance in which the tighter Shearer LLL criterion is satisfied. We also improve on the analysis of Kolipaka & Szegedy (2011) to give tighter concentration results.
We study distributed algorithms for string matching problem in presence of wildcard characters. Given a string T (a text), we look for all occurrences of another string P (a pattern) as a substring of string T. Each w...
详细信息
暂无评论