In this paper, we study parallelalgorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that are scalable with the number of cores. By focusing on private-cache CMPs, w...
详细信息
ISBN:
(纸本)9781595939739
In this paper, we study parallelalgorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that are scalable with the number of cores. By focusing on private-cache CMPs, we show that we can design efficient algorithms that need no additional assumptions about the way cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallelalgorithms. Indeed, we present two sorting algorithms, a distribution sort and a mergesort. Our algorithms are asymptotically optimal in terms of parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks. In addition, we study sorting lower bounds in a computational model, which we call the parallel external-memory (PEM) model, that formalizes the essential properties of our algorithms for private-cache CMPs.
As the gap between the cost of communication (i.e., data movement) and computation continues to grow, the importance of pursuing algorithms which minimize communication also increases. Toward this end, we seek asympto...
详细信息
ISBN:
(纸本)9781450307437
As the gap between the cost of communication (i.e., data movement) and computation continues to grow, the importance of pursuing algorithms which minimize communication also increases. Toward this end, we seek asymptotic communication lower bounds for general memory models and classes of algorithms. Recent work [2] has established lower bounds for a wide set of linear algebra algorithms on a sequential machine and on a parallel machine with identical processors. This work extends these previous bounds to a heterogeneous model in which processors access data and perform floating point operations at differing speeds. We also present an algorithm for dense matrix multiplication which attains the lower bound.
The methods for mitigating the degradation in performance caused by high latencies in parallel and distributed networks were described. Most of the analysis were centered on the simulation of unit-delay rings on netwo...
详细信息
The methods for mitigating the degradation in performance caused by high latencies in parallel and distributed networks were described. Most of the analysis were centered on the simulation of unit-delay rings on networks of workstations (NOWs) with arbitrary delays on the links. Emulations were also derived for the wide variety of other unit-delay network architectures on a NOW with high-latency links. The lower bounds that established limits on the degree to which the high latency links were proven, can be mitigated. These bounds demonstrates that overcoming latencies in dataflow types of computations that require access to large local databases is easier.
The goal of the μDatabase project is to understand the behaviour of data structures and their algorithms, both parallel and sequential, in a memory-mapped environment. Memory mapping allows primary-memory pointer-bas...
详细信息
The goal of the μDatabase project is to understand the behaviour of data structures and their algorithms, both parallel and sequential, in a memory-mapped environment. Memory mapping allows primary-memory pointer-based data structures to be stored on secondary storage, and subsequently traversed and modified, without transforming the embedded pointers. The project incorporates a collaboration between practitioners and theoreticians and has produced a toolkit to support a parallel memory-mapped environment, extensive concurrency tools, an analytical model for the system validated by experiments, and preliminary results involving parallel database join algorithms as well as sequential B-tree and R-tree data structures. Future work includes augmenting the toolkit, developing and testing more parallelalgorithms tuned for efficiency in a memory-mapped environment, and extending the current model to cover more general algorithms and capture more low-level details of the physical machine.
We describe a very simple deterministic parallel algorithm for linear programming in fixed dimension d that takes poly(log log n) time in the common CRCW PRAM model and does optimal O(n) work. Our algorithm is based o...
详细信息
ISBN:
(纸本)9780897918091
We describe a very simple deterministic parallel algorithm for linear programming in fixed dimension d that takes poly(log log n) time in the common CRCW PRAM model and does optimal O(n) work. Our algorithm is based on multidimensional search and an effective use of approximation algorithms to speed-up the basic search in the CRCW model. Our method also yields very fast poly(log log n) algorithm for smallest enclosing sphere and approximate ham-sandwich cuts as well as an O(log n) time work-optimal algorithm for exact ham-sandwich cuts of separable point sets. For all these problems, particularly for the fixed-dimensional linear programming, o(log n) time efficient deterministic PRAM algorithms were not known until very recently.
Consider the following NP-hard problems: Given a graph G, find the minimum 2-edge connected and 2-vertex connected subgraphs spanning all vertices of G. The past few years have produced exciting sequential algorithms ...
详细信息
ISBN:
(纸本)9780897917179
Consider the following NP-hard problems: Given a graph G, find the minimum 2-edge connected and 2-vertex connected subgraphs spanning all vertices of G. The past few years have produced exciting sequential algorithms for approximating such minimum subgraphs [6, 7]. The approximation factors are improved from 2 down to 5/4 and 3/2 respectively. Yet the techniques involved are all based on augmenting depth-first-search trees and no similar progress has been carried to the parallel context. This paper presents NC algorithms to achieve approximation factors of 3/2 + Ε and 7/4 + Ε respectively without computing depth-first-search trees.
This paper provides a new approach to labeling the connected components of an n × n image on a scan line array processor (comprised of n processing elements). Variations of this approach yield an algorithm guaran...
详细信息
ISBN:
(纸本)9780897917179
This paper provides a new approach to labeling the connected components of an n × n image on a scan line array processor (comprised of n processing elements). Variations of this approach yield an algorithm guaranteed to complete in o(n lg n) time as well as algorithms likely to approach O(n) time for all or most images. The best previous solutions require using a more complicated architecture or require Ω(n lg n) time. We also show that on a restricted version of the architecture, any algorithm requires Ω(n lg n) time in the worst case.
Data-parallelarchitectures must provide efficient support for complex control-flow constructs to support sophisticated applications coded in modern single-program multiple-data languages. As these architectures have ...
详细信息
ISBN:
(纸本)9781479969982
Data-parallelarchitectures must provide efficient support for complex control-flow constructs to support sophisticated applications coded in modern single-program multiple-data languages. As these architectures have wide datapaths that process a single instruction across parallel threads, a mechanism is needed to track and sequence threads as they traverse potentially divergent control paths through the program. The design space for divergence management ranges from software-only approaches where divergence is explicitly managed by the compiler, to hardware solutions where divergence is managed implicitly by the microarchitecture. In this paper, we explore this space and propose a new predication-based approach for handling control-flow structures in data-parallelarchitectures. Unlike prior predication algorithms, our new compiler analyses and hardware instructions consider the commonality of predication conditions across threads to improve efficiency. We prototype our algorithms in a production compiler and evaluate the tradeoffs between software and hardware divergence management on current GPU silicon. We show that our compiler algorithms make a predication-only architecture competitive in performance to one with hardware support for tracking divergence.
We present a parallel solution to the Maximum-Flow (Max-Flow) problem, suitable for a modern many-core architecture. We show that by starting from a PRAM algorithm, following an established "programmer's work...
详细信息
ISBN:
(纸本)9781450307437
We present a parallel solution to the Maximum-Flow (Max-Flow) problem, suitable for a modern many-core architecture. We show that by starting from a PRAM algorithm, following an established "programmer's workflow" and targeting XMT, a PRAM-inspired many-core architecture, we achieve significantly higher speed-ups than previous approaches. Comparison with the fastest known serial max-flow implementation on a modern CPU demonstrates for the first time potential for orders-of-magnitude performance improvement for Max-Flow. Using XMT, the PRAM Max-Flow algorithm is also much easier to program than for other parallel platforms, contributing a powerful example toward dual validation of both PRAM algorithmics and XMT.
The idea of dynamic programming (DP), proposed by Bellman in the 1950s, is one of the most important algorithmic techniques. However, in parallel, many fundamental and sequentially simple problems become more challeng...
详细信息
ISBN:
(纸本)9798400704161
The idea of dynamic programming (DP), proposed by Bellman in the 1950s, is one of the most important algorithmic techniques. However, in parallel, many fundamental and sequentially simple problems become more challenging, and open to a (nearly) work-efficient solution (i.e., the work is off by at most a polylogarithmic factor over the best sequential solution). In fact, sequential DP algorithms employ many advanced optimizations such as decision monotonicity or special data structures, and achieve better work than straightforward solutions. Many such optimizations are inherently sequential, which creates extra challenges for a parallel algorithm to achieve the same work bound. The goal of this paper is to achieve (nearly) work-efficient parallel DP algorithms by parallelizing classic, highly-optimized and practical sequential algorithms. We show a general framework called the Cordon Algorithm for parallel DP algorithms, and use it to solve several classic problems. Our selection of problems includes Longest Increasing Subsequence (LIS), sparse Longest Common Subsequence (LCS), convex/concave generalized Least Weight Subsequence (LWS), Optimal Alphabetic Tree (OAT), and more. We show how the Cordon Algorithm can be used to achieve the same level of optimization as the sequential algorithms, and achieve good parallelism. Many of our algorithms are conceptually simple, and we show some experimental results as proofs-of-concept.
暂无评论