The proceedings contain 40 papers. The topics discussed include: time vs. space trade-offs for rendezvous in trees;allowing each node to communicate only once in a distributed system: shared whiteboard models;optimal ...
ISBN:
(纸本)9781450312134
The proceedings contain 40 papers. The topics discussed include: time vs. space trade-offs for rendezvous in trees;allowing each node to communicate only once in a distributed system: shared whiteboard models;optimal and competitive runtime bounds for continuous, local gathering of mobile robots;online multi-robot exploration of grid graphs with rectangular obstacles;in search of parallel dimensions;delegation and nesting in best-effort hardware transactional memory;design, verification and applications of a new read-write lock algorithm;a lock-free B+tree;brief announcement: the problem based benchmark suite;brief announcement: subgraph isomorphism on a multithreaded shared memory architecture;efficient cache oblivious algorithms for randomized divide-and-conquer on the multicore model;a scalable framework for heterogeneous GPU-based clusters;and faster and simpler width-independent parallelalgorithms for positive semidefinite programming.
parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplicati...
详细信息
ISBN:
(纸本)9781450312134
parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n = 94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms. Copyright 2012 acm.
The greedy sequential algorithm for maximal independent set (MIS) loops over the vertices in an arbitrary order adding a vertex to the resulting set if and only if no previous neighboring vertex has been added. In thi...
详细信息
ISBN:
(纸本)9781450312134
The greedy sequential algorithm for maximal independent set (MIS) loops over the vertices in an arbitrary order adding a vertex to the resulting set if and only if no previous neighboring vertex has been added. In this loop, as in many sequential loops, each iterate will only depend on a subset of the previous iterates (i.e. knowing that any one of a vertex's previous neighbors is in the MIS, or knowing that it has no previous neighbors, is sufficient to decide its fate one way or the other). This leads to a dependence structure among the iterates. If this structure is shallow then running the iterates in parallel while respecting the dependencies can lead to an efficient parallel implementation mimicking the sequential algorithm. In this paper, we show that for any graph, and for a random ordering of the vertices, the dependence length of the sequential greedy MIS algorithm is polylogarithmic (O(log2 n) with high probability). Our results extend previous results that show polylogarithmic bounds only for random graphs. We show similar results for greedy maximal matching (MM). For both problems we describe simple linear-work parallelalgorithms based on the approach. The algorithms allow for a smooth tradeoff between more parallelism and reduced work, but always return the same result as the sequential greedy algorithms. We present experimental results that demonstrate efficiency and the tradeoff between work and parallelism. Copyright 2012 acm.
This paper studies the problem of finding a (1+Ε)-approximate solution to positive semidefinite programs. These are semidefinite programs in which all matrices in the constraints and objective are positive semidefini...
详细信息
ISBN:
(纸本)9781450312134
This paper studies the problem of finding a (1+Ε)-approximate solution to positive semidefinite programs. These are semidefinite programs in which all matrices in the constraints and objective are positive semidefinite and all scalars are nonnegative. At FOCS'11, Jain and Yao gave an NC algorithm that requires O(1/Ε13 log13 mlog n) iterations on input n constraint matrices of dimension m-by-m, where each iteration performs at least Δ(mω) work since it involves computing the spectral decomposition. We present a simpler NC parallel algorithm that on input with n constraint matrices, requires O(1/Ε4 log4 n log(1/Ε )) iterations, each of which involves only simple matrix operations and computing the trace of the product of a matrix exponential and a positive semidefinite matrix. Further, given a positive SDP in a factorized form, the total work of our algorithm is nearly-linear in the number of non-zero entries in the factorization. Our algorithm can be viewed as a generalization of Young's algorithm and analysis techniques for positive linear programs (Young, FOCS'01 ) to the semidefinite programming setting. Copyright 2012 acm.
Programs written for an architecture with n processors require a re-write when migrated to an m processor architecture {(m > n)} to benefit from additional resources. Compiler based solutions do not match manual op...
详细信息
The erratic memory access pattern of graph algorithms makes it hard to optimize on cache-based architectures. While multithreading hides memory latency, it is unclear how hardware threads combined with caches impact t...
详细信息
ISBN:
(纸本)9780769546759
The erratic memory access pattern of graph algorithms makes it hard to optimize on cache-based architectures. While multithreading hides memory latency, it is unclear how hardware threads combined with caches impact the performance of typical graph workload. As modern architectures strike different balances between caching and multithreading, it remains an open question whether the benefit of optimizing locality behavior outweighs the cost. We study parallel graph algorithms on two different multi-threaded, multi-core platforms, that is, IBM Power7 and Sun Niagara2. Our experiments first demonstrate their performance advantage over prior architectures. We find nonetheless the number of hardware threads in either platform is not sufficient to fully mask memory latency. Our cache-friendly scheduling of memory accesses improves performance by up to 2.6 times on Power7 and prior cache-based architectures, yet the same technique significantly degrades performance on Niagara2. Software prefetching and manipulating the storage of the input to improve spatial locality improve performance by up to 2.1 times and 1.3 times on both platforms. Our study reveals interesting interplay between architecture and algorithm.
This paper presents a novel approach to design four and eight parallel pipelined fast Fourier transform (FFT) architectures using folding transformation. The approach is based on use of decimation in time algorithms w...
详细信息
ISBN:
(纸本)9781450312448
This paper presents a novel approach to design four and eight parallel pipelined fast Fourier transform (FFT) architectures using folding transformation. The approach is based on use of decimation in time algorithms which reduce the number of delay elements by 33% compared to the decimation in frequency based designs. The number of delay elements required for an N-point FFT architecture is N -4 which is comparable to that of delay feedback schemes. The number of complex adders required is only 50% of those in the delay feedback designs. The proposed approach can be extended to any radix-2n based FFT algorithms. The proposed architectures are feed-forward designs and can be pipelined by more stages to increase the throughput. Further, a novel four parallel 128-point FFT architecture is derived using the proposed approach. It is shown that a radix- 24 4-parallel 128-point design requires 124 delay elements, 28 complex adders, and four full complex multipliers. Copyright 2012 acm.
We propose a general formal model of isolated hierarchical parallel computations, and identify several fragments to match the concurrency constructs present in real-world programming languages such as Cilk and X10. By...
详细信息
ISBN:
(纸本)9781450310833
We propose a general formal model of isolated hierarchical parallel computations, and identify several fragments to match the concurrency constructs present in real-world programming languages such as Cilk and X10. By associating fundamental formal models (vector addition systems with recursive transitions) to each fragment, we provide a common platform for exposing the relative difficulties of algorithmic reasoning. For each case we measure the complexity of deciding state-reachability for finite-data recursive programs, and propose algorithms for the decidable cases. The complexities which include PTIME, NP, EXPSPACE, and 2EXPTIME contrast with undecidable state-reachability for recursive multi-threaded programs.
Graph algorithms are notorious for not getting good speedup on parallelarchitectures. These algorithms tend to suffer from irregular dependencies and a high synchronization cost that prevent an efficient execution on...
详细信息
ISBN:
(纸本)9780769546766
Graph algorithms are notorious for not getting good speedup on parallelarchitectures. These algorithms tend to suffer from irregular dependencies and a high synchronization cost that prevent an efficient execution on distributed memory machines. Hence such algorithms are mostly parallelized on shared memory machines. However, current commodity shared memory machines do not typically offer enough parallelism to process these problems. In this paper, we are presenting an early investigation of the scalability of such algorithms on Intel's upcoming Many Integrated Core (Intel MIC) architecture which, when it will be released in 2012, is expected to provide more than 50 physical cores with SMT capability. The Intel MIC architecture can be programmed through many programming models, here we investigate the three most popular of these models namely OpenMP, Cilk Plus and Intel's TBB. We present scalability results of a parallel graph coloring algorithm, three variations of a breadth-first search algorithm and a microbenchmark for irregular computations using these three programming models. Our results on a prototype board show that the multi-threaded architecture of Intel MIC can be effectively used for hiding latencies in irregular applications to achieve almost perfect speedup.
Adaptive bitonic sort is a well known merge-based parallel sorting algorithm. It achieves optimal complexity using a complex tree-like data structure called a bitonic tree. Due to this, using adaptive bitonic sort tog...
详细信息
ISBN:
(纸本)9780769546759
Adaptive bitonic sort is a well known merge-based parallel sorting algorithm. It achieves optimal complexity using a complex tree-like data structure called a bitonic tree. Due to this, using adaptive bitonic sort together with other algorithms usually implies converting bitonic trees to arrays and vice versa. This makes adaptive bitonic sort inappropriate in the context of hybrid sorting algorithms where frequent switches between algorithms are performed. In this article we present a novel optimal sorting algorithm that is based on an approach similar to adaptive bitonic sort. Our approach does not use bitonic trees but uses the input array together with some additional information. Using this approach it is trivial to switch between adaptive bitonic sort and other algorithms. We present an implementation of a hybrid algorithm for GPUs based on bitonic sort and our novel algorithm. This implementation turns out to be the fastest comparison-based sorting algorithm for GPUs found in literature.
暂无评论