A malleable task is a computational unit which may be executed on any arbitrary number of processors, its execution time depending on the amount of resources allotted to it. According to the standard behavior of paral...
详细信息
A malleable task is a computational unit which may be executed on any arbitrary number of processors, its execution time depending on the amount of resources allotted to it. According to the standard behavior of parallel applications, we assume that the malleable tasks are monotonic, i.e. that the execution tune is decreasing with the number of processors while the computational work increases. This paper presents a new approach for scheduling a set of independent malleable tasks which leads to a worst case guarantee of √3 for the minimization of the parallel execution time, or makespan. It improves all other existing practical results including the two-phases method introduced by Turek et al. The main idea is to transfer the difficulty of a two phases method from the scheduling part to the allotment selection. We show how to formulate this last problem as a knapsack optimization problem. Then, the scheduling problem is solved by a dual-approximation which leads to a simple structure of two consecutive shelves.
Known polylog parallelalgorithms for the solution of linear systems and related problems require computation of the characteristic polynomial or related forms, which are known to be highly unstable in practice. Howev...
详细信息
ISBN:
(纸本)9780897916714
Known polylog parallelalgorithms for the solution of linear systems and related problems require computation of the characteristic polynomial or related forms, which are known to be highly unstable in practice. However, matrix factorizations of various types, bypassing computation of the characteristic polynomial, are used extensively in sequential numerical computations and are essential in many *** paper gives new parallel methods for various exact factorizations of several classes of matrices. We assume the input matrices are n × n with either integer entries of size ≤ 2no(1). We make no other assumption on the input. We assume the arithmetic PRAM model of parallel computation. Our main result is a reduction of the known parallel time bounds for these factorizations from O(log3n) to O(log2n). Our results are work efficient; we match the best known work bounds of parallelalgorithms with polylog time bounds, and are within a log n factor of the work bounds for the best known sequential algorithms for the same *** exact factorizations we compute for symmetric positive definite matrices include: 1. recursive factorization sequences and trees,2. LU factorizations,3. QR factorizations, and4. reduction to upper Hessenberg *** classes of matrices for which we can efficiently compute these factorizations include:1. dense matrices, in time O(log2n) with processor bound P(n) (the number of processors needed to multiply two n × n matrices in O(log n time),2. block diagonal matrices, in time O(log2b with P(b)n/b processors,3. sparse matrices which are s(n)-separable (recursive factorizations only), in time O(log2n) with P(s(n)) processors where s(n) is of the form n γ for 0 < γ < 1, and4. banded matrices, in parallel time O((logn) log b) with P(b)n/b *** factorizations also provide us similarly efficient algorithms for exact computation (given arbitrary rational matrices that need not be symmetric positive definite) of the following: 1
We show how to compute multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system with distributed memory when problem sizes are so large that the data do not fit in the memory of the entire system. In...
详细信息
We show how to compute multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system with distributed memory when problem sizes are so large that the data do not fit in the memory of the entire system. Instead, data reside on a parallel disk system and are brought into memory in sections. We use the parallel Disk Model for implementation and analysis. Our method is a straightforward out-of-core variant of a well-known method for in-core, multidimensional FFTs. It performs 1-dimensional FFT computations on each dimension in turn. This method is easy to generalize to any number of dimensions, and it also readily permits the individual dimensions to be of any sizes that are integer powers of 2. The key step is an out-of-core transpose operation that places the data along each dimension into contiguous positions on the parallel disk system so that the data for the 1-dimensional FFTs are contiguous. We present an I/O complexity analysis for this method as well as empirical results for a DEC 2100 server, an SGI Origin 2000, and a Beowulf cluster, each of which has a parallel disk system.
Performance matters. But how we improve it is changing. Historically, transparent hardware improvements would mean software just ran faster. That may not necessarily be true in the future. To continue the pace of inno...
详细信息
ISBN:
(纸本)9781450312134
Performance matters. But how we improve it is changing. Historically, transparent hardware improvements would mean software just ran faster. That may not necessarily be true in the future. To continue the pace of innovation, the future will need to be increasingly parallel- involving parallelism across data, threads, cores, and nodes. This talk will explore some of the dimensions of parallelism, and the opportunities and challenges they pose.
Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false...
详细信息
Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, traditional column-major or row-major array layouts incur high variability in memory system performance as matrix size varies. This paper investigates the use of recursive array layouts for improving the performance of parallel recursive matrix multiplication algorithms. We extend previous work by Frens and Wise on recursive matrix multiplication to examine several recursive array layouts and three recursive algorithms: standard matrix multiplication, and the more complex algorithms of Strassen and Winograd. We show that while recursive array layouts significantly outperform traditional layouts (reducing execution times by a factor of 1.2-2.5) for the standard algorithm, they offer little improvement for Strassen's and Winograd's algorithms;we provide an algorithmic explanation of this phenomenon. We demonstrate that carrying the recursive layout down to the level of individual matrix elements is counter-productive, and that a combination of recursive layouts down to canonically ordered matrix tiles instead yields higher performance. We evaluate five recursive layouts with successively increasing complexity of address computation, and show that addressing overheads can be kept in control even for the most computationally demanding of these layouts. Finally, we provide a critique of the Cilk system that we used to parallelize our code.
暂无评论