Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in...
详细信息
ISBN:
(纸本)9781605586069
Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n(3))) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the cost of various Cholesky decomposition implementations to this lower bound, and draw the following conclusions: (1) "Naive" sequential algorithms for Cholesky attain neither the bandwidth nor latency lower bounds. (2) The sequential blocked algorithm in LAPACK (with the right block size), as well as various recursive algorithms [AP00, GJ01, ACW01, ST04] and one based on work of Toledo [Tol97], can attain the bandwidth lower bound. (3) The LAPACK algorithm can also attain the latency bound if used with blocked data structures rather than column-wise or row-wise matrix data structures, though the Toledo algorithm cannot. (4) The recursive sequential algorithm due to [AP00], attains the bandwidth and latency lower bounds at every level of a multi-level memory hierarchy, in a "cache-oblivious" way. (5) The parallel implementation of Cholesky in the ScaLA-PACK library (again with the right block-size) attains both the bandwidth and latency lower bounds to within a poly-logarithmic factor. Combined with prior results in [DGHL08a, DGHL08b, DGX08] this gives a complete set;of communication-optimal algorithms for O(n(3)) implementations of three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU and QR by optimizing communication for any number of levels of memo
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on may-core architectures. We show that the NB-FEB primitive is universal, s...
详细信息
ISBN:
(纸本)9781605583976
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallel programming on may-core architectures. We show that the NB-FEB primitive is universal, scalable and feasible. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently mitigates performance degradation due to synchronization "hot spots" (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility).
We show the existence of epsilon-nets of size O(1/epsilon log log 1/epsilon) for planar point sets and axis-parallel rectangular ranges. The same bound holds for points in the plane with "fat" triangular ran...
详细信息
ISBN:
(纸本)9781605586137
We show the existence of epsilon-nets of size O(1/epsilon log log 1/epsilon) for planar point sets and axis-parallel rectangular ranges. The same bound holds for points in the plane with "fat" triangular ranges, and for point sets in R-3 and axis-parallel boxes;these are the first, known non-trivial bounds for these range spaces. Our technique also yields improved bounds on the size of epsilon-nets in the more general context considered by Chirkson and Varadarajan. For example. we show the existence of, epsilon-nets, of size O(1/epsilon log log log 1/epsilon) for the dual range space of "fat" regions and planar point sets ( where the regions are the ground objects and the ranges are subsets stabbed by points). Plugging our bounds into the, technique of Bronnimann and Goodrich, we obtain improved approximation factors (computable in randomized polynomial time) for the, HITTING SET or Hie SET COVER problems associated with the corresponding range spaces.
The paper presents distributed and parallel delta-approximation algorithms for covering problems, where delta is the maximum number of variables on which any constraint depends (for example, delta = 2 for VERTEX COVER...
详细信息
ISBN:
(纸本)9781605583969
The paper presents distributed and parallel delta-approximation algorithms for covering problems, where delta is the maximum number of variables on which any constraint depends (for example, delta = 2 for VERTEX COVER). Specific results include the following. For WEIGHTED VERTEX COVER, the first distributed 2-approximation algorithm taking O(log n) rounds and the first parallel 2-approximation algorithm in RNC. The algorithms generalize to covering mixed integer linear programs (CMIP) with two variables per constraint (delta = 2). For any covering problem with monotone constraints and submodular cost, a distributed delta-approximation algorithm taking O(log(2) vertical bar C vertical bar) rounds, where vertical bar C vertical bar is the number of constraints. (Special cases include CMIP, facility location, and probabilistic (two-stage) variants of these problems.)
On Saturday, May 30, one day before the start of the regular STOC 2009 program, a workshop vas held in celebration of Leslie Valiant's 60th birthday. Talks were given by Jin-Yi Cai, Stephen Cook, Vitaly Feldman, M...
详细信息
ISBN:
(纸本)9781605586137
On Saturday, May 30, one day before the start of the regular STOC 2009 program, a workshop vas held in celebration of Leslie Valiant's 60th birthday. Talks were given by Jin-Yi Cai, Stephen Cook, Vitaly Feldman, Mark Jerrum, Michael Kearns, Mike Paterson, Michael Rabin, Rocco Servedio, Paul Valiant, Vijay Vazirani, and Avi Wigderson. The workshop was organized by Michael Kearns, Rocco Servedio, and Salil Vadhan, with support from the STOC local arrangements team and program committee. To accompany the workshop, here we briefly survey Valiant's many fundamental contributions to the theory of computing.
Suppose we have a parallel or distributed system whose nodes have limited capacities, such as processing speed, bandwidth, memory, or disk space. How does the performance of the system depend on the amount of heteroge...
详细信息
Suppose we have a parallel or distributed system whose nodes have limited capacities, such as processing speed, bandwidth, memory, or disk space. How does the performance of the system depend on the amount of heterogeneity of its capacity distribution? We propose a general framework to quantify the worst-case effect of increasing heterogeneity in models of parallel systems. Given a cost function g(C,W) representing the system's performance as a function of its nodes' capacities C and workload W (such as the makespan of an optimum schedule of jobs W on machines C), we say that g has price of heterogeneity alpha when for any workload, cost cannot increase by more than a factor alpha if node capacities become arbitrarily more heterogeneous. The price of heterogeneity also upper bounds the "value of parallelism": the maximum benefit obtained by increasing parallelism at the expense of decreasing processor speed. We give constant or logarithmic bounds on the price of heterogeneity of several well-known job scheduling and graph degree/diameter problems, indicating that in many cases, increasing heterogeneity can never be much of a disadvantage.
Suppose we have a parallel or distributed system whose nodes have limited capacities, such as processing speed, bandwidth, memory, or disk space. How does the performance of the system depend on the amount of heteroge...
详细信息
Suppose we have a parallel or distributed system whose nodes have limited capacities, such as processing speed, bandwidth, memory, or disk space. How does the performance of the system depend on the amount of heterogeneity of its capacity distribution? We propose a general framework to quantify the worst-case effect of increasing heterogeneity in models of parallel systems. Given a cost function g(C,W) representing the system's performance as a function of its nodes' capacities C and workload W (such as the makespan of an optimum schedule of jobs W on machines C), we say that g has price of heterogeneity alpha when for any workload, cost cannot increase by more than a factor alpha if node capacities become arbitrarily more heterogeneous. The price of heterogeneity also upper bounds the "value of parallelism": the maximum benefit obtained by increasing parallelism at the expense of decreasing processor speed. We give constant or logarithmic bounds on the price of heterogeneity of several well-known job scheduling and graph degree/diameter problems, indicating that in many cases, increasing heterogeneity can never be much of a disadvantage.
parallelarchitectures are the way of the future, but are notoriously difficult to program. In addition to the low-level constructs they often present (e.g., locks, DMA, and non-sequential memory models), most paralle...
详细信息
ISBN:
(纸本)9781605581668
parallelarchitectures are the way of the future, but are notoriously difficult to program. In addition to the low-level constructs they often present (e.g., locks, DMA, and non-sequential memory models), most parallel programming environments admit data races: the environment may make nondeterministic scheduling choices that can change the function of the program. We believe the solution is model-based design, where the programmer is presented with a constrained higher-level language that prevents certain unwanted behavior. In this paper, we describe a compiler for the SHIM scheduling-independent concurrent language that generates code for the Cell Broadband heterogeneous multicore processor. The complexity of the code our compiler generates relative to the source illustrates how difficult it is to manually write code for the Cell. We demonstrate the efficacy of our compiler on two examples. While the SHIM language is (by design) not ideal for every algorithm, it works well for certain applications and simplifies the parallel programming process, especially on the Cell architecture. Copyright 2009 acm.
We present a randomized coloring algorithm for the unstructured radio network model, a model comprising autonomous nodes, asynchronous wake-up, no collision detection and an unknown but geometric network topology. The...
详细信息
ISBN:
(纸本)9781605583969
We present a randomized coloring algorithm for the unstructured radio network model, a model comprising autonomous nodes, asynchronous wake-up, no collision detection and an unknown but geometric network topology. The current state-of-the-art coloring algorithm needs with high probability O(Delta . log n) time and uses O(Delta) colors, where n and Delta are the number of nodes in the network and the maximum degree, respectively;this algorithm requires knowledge of a linear bound on n and Delta. We improve this result in three ways: Firstly, we improve the time complexity, instead of the logarithmic factor we just need a polylogarithmic additive term, more specifically, our time complexity is O(Delta + log Delta . log n) given an estimate of n and Delta, and O(Delta + log(2) n) without knowledge of Delta. Secondly, our vertex coloring algorithm needs Delta + 1 colors only. Thirdly, our algorithm manages to do a distance-d coloring with asymptotically optimal O(Delta) colors for a constant d.
Experimental studies have shown that electing a leader based on measurements of the underlying communication network can be beneficial. We use this approach to study the problem of electing a leader that is eventually...
详细信息
ISBN:
(纸本)9781605581668
Experimental studies have shown that electing a leader based on measurements of the underlying communication network can be beneficial. We use this approach to study the problem of electing a leader that is eventually not only correct (as captured by the Ω failure detector abstraction), but also optimal with respect to the transmission delays to its peers. We give the definitions of this problem and a suitable model, thus allowing us to make an analytical analysis of the problem, which is in contrast to previous work on that topic. Copyright 2009 acm.
暂无评论