This paper shows how n-node, e-edge graphs can be contracted in a manner similar to the parallel tree contraction algorithm due to Miller and Reif. We give an O((n + e)/lgn)-processor deterministic algorithm that cont...
详细信息
We propose a novel computational model for GPU. Known parallel computational models such as the PRAM model are not appropriate for evaluating GPU algorithms. Our model, called AGPU, abstracts the essence of current GP...
详细信息
ISBN:
(纸本)9781479941162
We propose a novel computational model for GPU. Known parallel computational models such as the PRAM model are not appropriate for evaluating GPU algorithms. Our model, called AGPU, abstracts the essence of current GPU architectures such as global and shared memory, memory coalescing and bank conflicts. We can therefore evaluate asymptotic behavior of GPU algorithms more accurately than known models and we can develop algorithms that are efficient on many real architectures. As a showcase, we first analyze known comparison-based sorting algorithms using the AGPU model and show that they are not I/O optimal, that is, the number of global memory accesses is more than necessary. Then we propose a new algorithm which uses an asymptotically optimal number of global memory accesses and whose time complexity is also nearly optimal.
Multithreading has been proposed as an architectural strategy for tolerating latency in multiprocessors and, through limited empirical studies, shown to offer promise. This paper develops an analytical model of multit...
详细信息
ISBN:
(纸本)0897913701
Multithreading has been proposed as an architectural strategy for tolerating latency in multiprocessors and, through limited empirical studies, shown to offer promise. This paper develops an analytical model of multithreaded processor behavior based on a small set of architectural and program parameters. The model gives rise to a large Markov chain, which is solved to obtain a formula for processor efficiency in terms of the number of threads per processor, the remote reference rate, the latency, and the cost of switching between threads. It is shown that a multithreaded processor exhibits three operating regimes: linear (efficiency is proportional to the number of threads), transition, and saturation (efficiency depends only on the remote reference rate and switch cost). Formulae for regime boundaries are derived. The model is embellished to reflect cache degradation due to multithreading, using an analytical model of cache behavior, demonstrating that returns diminish as the number threads becomes large. Predictions from the embellished model correlate well with published empirical measurements. Prescriptive use of the model under various scenarios indicates that multithreading is effective, but the number of useful threads per processor is fairly small.
We consider parallel, or low adaptivity, algorithms for submodular function maximization. This line of work was recently initiated by Balkanski and Singer and has already led to several interesting results on the card...
详细信息
ISBN:
(纸本)9781450367059
We consider parallel, or low adaptivity, algorithms for submodular function maximization. This line of work was recently initiated by Balkanski and Singer and has already led to several interesting results on the cardinality constraint and explicit packing constraints. An important open problem is the classical setting of matroid constraint, which has been instrumental for developments in submodular function maximization. In this paper we develop a general strategy to parallelize the well-studied greedy algorithm and use it to obtain a randomized (1/2 - epsilon)-approximation in O(log(2)(n)/epsilon(2)) rounds of adaptivity. We rely on this algorithm, and an elegant amplification approach due to Badanidiyuru and Vondrak to obtain a fractional solution that yields a near-optimal randomized (1 - 1/e - epsilon)-approximation in O(log(2)(n)/epsilon(3)) rounds of adaptivity. For non-negative functions we obtain a (3 - 2 root 2 - epsilon) - approximation and a fractional solution that yields a (1/e - epsilon)approximation. Our approach for parallelizing greedy yields approximations for intersections of matroids and matchoids, and the approximation ratios are comparable to those known for sequential greedy.
In recent years, reconfigurable technology has emerged as a popular choice for implementing various types of cryptographic functions. Nevertheless, an insufficient amount effort has been placed into fully exploiting t...
详细信息
ISBN:
(纸本)0769524451
In recent years, reconfigurable technology has emerged as a popular choice for implementing various types of cryptographic functions. Nevertheless, an insufficient amount effort has been placed into fully exploiting the tremendous amounts of parallelism intrinsic to FPGAs for this class of algorithms. In this paper, we focus on block cipher architectures and explore design decisions that leverage the multi-grained parallelism inherent in many of these algorithms. We demonstrate the usefulness of this approach with a highly parallel FPGA implementation of the AES standard, and present results detailing the area/delay tradeoffs resulting from our design decisions.
A key capability of data-race detectors is to determine whether one thread executes logically in parallel with another or whether the threads must operate in series. This paper provides two algorithms, one serial and ...
详细信息
ISBN:
(纸本)9781581138405
A key capability of data-race detectors is to determine whether one thread executes logically in parallel with another or whether the threads must operate in series. This paper provides two algorithms, one serial and one parallel, to maintain series-parallel (SP) relationships "on the fly" for fork-join multithreaded programs. The serial SP-order algorithm runs in O(1) amortized time per operation. In contrast, the previously best algorithm requires a time per operation that is proportional to Tarjan's functional inverse of Ackermann's function. SP-order employs an order-maintenance data structure that allows us to implement a more efficient "English-Hebrew" labeling scheme than was used in earlier race detectors, which immediately yields an improved determinacy-race detector. In particular, any fork-join program running in T1 time on a single processor can be checked on the fly for determinacy races in O(T1) time. Corresponding improved bounds can also be obtained for more sophisticated data-race detectors, for example, those that use locks. By combining SP-order with Feng and Leiserson's serial SP-bags algorithm, we obtain a parallel SP-maintenance algorithm, called SP-hybrid. Suppose that a fork-join program has n threads, T1 work, and a critical-path length of T∞. When executed on P processors, we prove that SP-hybrid runs in O((T1/P + PT∞) lg n) expected time. To understand this bound, consider that the original program obtains linear speed-up over a 1-processor execution when P = O(T 1/T∞). In contrast, SP-hybrid obtains linear speed-up when P = O(√T1/T∞), but the work is increased by a factor of O (lg n).
We present a (1 + epsilon) -approximate parallel algorithm for computing shortest paths in undirected graphs, achieving poly(log n) depth and mpoly(log n) work for n-nodes m-edges graphs. Although sequential algorithm...
详细信息
ISBN:
(纸本)9781450369794
We present a (1 + epsilon) -approximate parallel algorithm for computing shortest paths in undirected graphs, achieving poly(log n) depth and mpoly(log n) work for n-nodes m-edges graphs. Although sequential algorithms with (nearly) optimal running time have been known for several decades, near-optimal parallelalgorithms have turned out to be a much tougher challenge. For (1 + epsilon) -approximation, all prior algorithms with poly(log n) depth perform at least Omega(mn(c)) work for some constant c > 0. Improving this long-standing upper bound obtained by Cohen (STOC'94) has been open for 25 years. We develop several new tools of independent interest. One of them is a new notion beyond hopsets - low hop emulator - a poly(log n)-approximate emulator graph in which every shortest path has at most O(log log n) hops (edges). Direct applications of the low hop emulators are parallelalgorithms for poly(log n)-approximate single source shortest path (SSSP), Bourgain's embedding, metric tree embedding, and low diameter decomposition, all with poly(log n) depth and mpoly(log n) work. To boost the approximation ratio to (1 + epsilon), we introduce compressible preconditioners and apply it inside Sherman's framework (SODA'17) to solve the more general problem of uncapacitated minimum cost flow (a.k.a., transshipment problem). Our algorithm computes a (1 + epsilon)-approximate uncapacitated minimum cost flow in poly(log n) depth using mpoly(log n) work. As a consequence, it also improves the state-of-the-art sequential running time from m . 2(O(root log n)) to mpoly(log n).
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on may-core architectures. We show that the NB-FEB primitive is universal, s...
详细信息
ISBN:
(纸本)9781605583976
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on may-core architectures. We show that the NB-FEB primitive is universal, scalable and feasible. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently mitigates performance degradation due to synchronization "hot spots" (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility).
We design a family of parallelalgorithms and GPU implementations for the exact string matching problem, based on Rabin-Karp (RK) randomized string matching. We describe and analyze three primary parallel approaches t...
详细信息
ISBN:
(纸本)9781450342100
We design a family of parallelalgorithms and GPU implementations for the exact string matching problem, based on Rabin-Karp (RK) randomized string matching. We describe and analyze three primary parallel approaches to binary string matching: cooperative (CRK), divide-and-conquer (DRK), and a novel hybrid of both (HRK). The CRK is most effective for large patterns (>8K characters), while the DRK approach is superior for shorter patterns. We then generalize the DRK to support any alphabet size without loss of performance. Our DRK method achieves up to a 64 GB/s processing rate on 8-character patterns from an 8-bit alphabet on an NVIDIA Tesla K40c GPU. We next demonstrate a novel parallel two-stage matching method (DRK-2S), which first skims the text for a smaller subset of the pattern and then verifies all potential matches in parallel. Our DRK-2S method is superior for pattern sizes up to 64k compared to the fastest CPU-based string matching implementations. With an 8-bit alphabet and up to 1k-character patterns, we get a geometric mean speedup of 4.81x against the best CPU methods, and can achieve a processing rate of at least 53 GB/s.
We show that the bipartite perfect matching problem is in quasi-NC2. That is, it has uniform circuits of quasi polynomial size n(O(log n)) and O(log(2) n) depth. Previously, only an exponential upper bound was known o...
详细信息
ISBN:
(纸本)9781450341325
We show that the bipartite perfect matching problem is in quasi-NC2. That is, it has uniform circuits of quasi polynomial size n(O(log n)) and O(log(2) n) depth. Previously, only an exponential upper bound was known on the size of such circuits with poly-logarithmic depth. We obtain our result by an almost complete derandomization of the famous Isolation Lemma when applied to yield an efficient randomized parallel algorithm for the bipartite perfect matching problem.
暂无评论