A crucial problem that needs to be solved is the allocation of memory to processors in a pipeline. Ideally, the processor memories should be totally separate (i.e., one port memories) in order to minimize contention;h...
详细信息
ISBN:
(纸本)9781581138405
A crucial problem that needs to be solved is the allocation of memory to processors in a pipeline. Ideally, the processor memories should be totally separate (i.e., one port memories) in order to minimize contention;however, this minimizes memory sharing. Idealized sharing occurs by using a single shared memory for all processors but this maximizes contention. Instead, in this paper we show that perfect memory sharing of shared memory can be achieved with a collection of *two*-port memories, as long as the number of processors is less than the number of memories. We show that the problem of allocation is NP-complete in general, but has a fast approximation algorithm that comes within a factor of 3/2. The proof utilizes a new bin packing model, which is interesting in its own right. Further, for important special cases that arise in practice the approximation algorithm is indeed optimal. We also describe an incremental memory allocation algorithm that provides good memory utilization while allowing fast updates.
parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplicati...
详细信息
ISBN:
(纸本)9781450312134
parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n = 94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms. Copyright 2012 acm.
algorithms for dynamically maintaining minimum spanning trees (MSTs) have received much attention in both the parallel and sequential settings. While previous work has given optimal algorithms for dense graphs, all ex...
详细信息
ISBN:
(纸本)9781450369350
algorithms for dynamically maintaining minimum spanning trees (MSTs) have received much attention in both the parallel and sequential settings. While previous work has given optimal algorithms for dense graphs, all existing parallel batch-dynamic algorithms perform polynomial work per update in the worst case for sparse graphs. In this paper, we present the first work-efficient parallel batch-dynamic algorithm for incremental MST, which can insert l edges in O(l lg(1 + n/l)) work in expectation and O(polylog(n)) span w.h.p. The key ingredient of our algorithm is an algorithm for constructing a compressed path tree of an edge-weighted tree, which is a smaller tree that contains all pairwise heaviest edges between a given set of marked vertices. Using our batch-incremental MST algorithm, we demonstrate a range of applications that become efficiently solvable in parallel in the sliding-window model, such as graph connectivity, approximate MSTs, testing bipartiteness, k-certificates, cycle-freeness, and maintaining sparsifiers.
We consider the problem of asynchronous execution of parallel programs. The original program is assumed to be designed for a synchronous system, while the actual system may be asynchronous. We seek an automatic execut...
详细信息
ISBN:
(纸本)9780897918091
We consider the problem of asynchronous execution of parallel programs. The original program is assumed to be designed for a synchronous system, while the actual system may be asynchronous. We seek an automatic execution scheme, which allows the asynchronous system to execute the synchronous program. Previous solutions to this problem provide a solution only for the case where the original program is deterministic. Here, we provide the first solution for the nondeterministic case (e.g. randomized programs). Our scheme is based on a novel agreement protocol for this setting. Our protocol allows n asynchronous processors to agree on n word-sized values in O(n log n log log n) total work. Total work is defined to be the summation of the number of steps performed by all processes (including busy waiting).
A family of VLSI architectures for computing an (n 1 × n 2 × • • " × nd)-point multidimensional D1VF (MDDFT) over Zm, the ring of integers modulo M, is presented. These architectures achieve VLSI a...
详细信息
The greedy sequential algorithm for maximal independent set (MIS) loops over the vertices in an arbitrary order adding a vertex to the resulting set if and only if no previous neighboring vertex has been added. In thi...
详细信息
ISBN:
(纸本)9781450312134
The greedy sequential algorithm for maximal independent set (MIS) loops over the vertices in an arbitrary order adding a vertex to the resulting set if and only if no previous neighboring vertex has been added. In this loop, as in many sequential loops, each iterate will only depend on a subset of the previous iterates (i.e. knowing that any one of a vertex's previous neighbors is in the MIS, or knowing that it has no previous neighbors, is sufficient to decide its fate one way or the other). This leads to a dependence structure among the iterates. If this structure is shallow then running the iterates in parallel while respecting the dependencies can lead to an efficient parallel implementation mimicking the sequential algorithm. In this paper, we show that for any graph, and for a random ordering of the vertices, the dependence length of the sequential greedy MIS algorithm is polylogarithmic (O(log2 n) with high probability). Our results extend previous results that show polylogarithmic bounds only for random graphs. We show similar results for greedy maximal matching (MM). For both problems we describe simple linear-work parallelalgorithms based on the approach. The algorithms allow for a smooth tradeoff between more parallelism and reduced work, but always return the same result as the sequential greedy algorithms. We present experimental results that demonstrate efficiency and the tradeoff between work and parallelism. Copyright 2012 acm.
In this paper. we show new parallelalgorithms for a set of classical string comparison problems. computation of string alignments. longest common subsequences (LCS) or edit distances, and longest increasing subsequen...
详细信息
ISBN:
(纸本)9781450300797
In this paper. we show new parallelalgorithms for a set of classical string comparison problems. computation of string alignments. longest common subsequences (LCS) or edit distances, and longest increasing subsequence computation. These problems have a wide range of applications, in particular in computational biology and signal processing. We discuss the scalability of our new parallelalgorithms in computation time, in memory, and in commumcation Our new algorithms are based on an efficient parallel method for (min, +)-multiplication of distance matrices The core result of this paper is a scalable parallel algorithm for multiplying Implicit simple unit-Monge matrices of size n x n on p processors using lime O(n log n/p), communication O(n log p/p) and O( log p) supersteps. This algorithm allows us to implement scalable LCS computation for two strings of length n using time O(n(2)/p) and communication O(n/root P) requiring local memory of size O(n/root P) on each processor Furthermore, our algorithm can be used to obtain the first generally work-scalable algorithm for computing the longest increasing subsequence (LIS) Our algorithm for LIS computation requires computation O(n log(2) n/p), communication O(n log(2) p/p), and O(log(2) p) supersteps for computing the LIS of a sequence of length n This is within a log n factor of work-optimality for the LIS problem. which can be solved sequentially in time O(n log n) in the comparison-based model. Our LIS algorithm is also within a log p-factor of achieving perfectly scalable communication and furthermore has perfectly scalable memory size requirements of O(n/p) per processor
The proceedings contain 44 papers. The topics discussed include: deterministic distributed sparse and ultra-sparse spanners and connectivity certificates;fully polynomial-time distributed computation in low-treewidth ...
ISBN:
(纸本)9781450391467
The proceedings contain 44 papers. The topics discussed include: deterministic distributed sparse and ultra-sparse spanners and connectivity certificates;fully polynomial-time distributed computation in low-treewidth graphs;adaptive massively parallelalgorithms for cut problems;preparing for disaster: leveraging precomputation to efficiently repair graph structures upon failures;the energy complexity of Las Vegas leader election;a fully-distributed peer-to-peer protocol for byzantine-resilient distributed hash tables;brief announcement: the (limited) power of multiple identities: asynchronous byzantine reliable broadcast with improved resilience through collusion;brief announcement: composable dynamic secure emulation;and robust and optimal contention resolution without collision detection.
We use O(log2 n) parallel arithmetic steps and n2 processors to compute the least-squares solution x = A+b to a linear system of equations, Ax = b, given a g × h matrix A and a vector b, both filled with integers...
详细信息
ISBN:
(纸本)0897913701
We use O(log2 n) parallel arithmetic steps and n2 processors to compute the least-squares solution x = A+b to a linear system of equations, Ax = b, given a g × h matrix A and a vector b, both filled with integers or with rational numbers, provided that g + h ≤ 2n and that A is given with its displacement generator of length r = O(1) and thus has displacement rank O(1). For a vector b and for a general p × q matrix A (with p + q ≤ n), we compute A+ and A+b by using O(log2 n) parallel arithmetic steps and n2.851 processors, and we may also compute A+b by using O(n2.376) arithmetic operations.
A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" a...
详细信息
ISBN:
(纸本)9781450307437
A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochoir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm. Pochoir supports general d-dimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user's stencil specification to be executed directly in C++ without the Pochoir compiler (albeit more slowly), which simplifies user debugging and greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallel-loop implementations, typically running 2-10 times faster. The algorithm behind Pochoir improves on prior cache-efficient algorithms on multidimensional grids by making "hyperspace" cuts, which yield asymptotically more parallelism for the same cache efficiency.
暂无评论