parallelalgorithms for sparse matrix-matrix multiplication typically spend most of their time on inter-processor communication rather than on computation, and hardware trends predict the relative cost of communicatio...
详细信息
ISBN:
(纸本)9781450315722
parallelalgorithms for sparse matrix-matrix multiplication typically spend most of their time on inter-processor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize communication costs in order to scale to large processor counts. In this paper, we consider multiplying sparse matrices corresponding to Erdo″s- Rényi random graphs on distributedmemory parallel machines. We prove a new lower bound on the expected communication cost for a wide class of algorithms. Our analysis of existing algorithms shows that, while some are optimal for a limited range of matrix density and number of processors, none is optimal in general. We obtain two new parallelalgorithms and prove that they match the expected communication cost lower bound, and hence they are optimal.
High-end shared storage systems serving multiple independent work-loads must assure that concurrently executing clients will receive a fair or agreed-upon share of system I/O resources. In a parallel I/O system an app...
详细信息
ISBN:
(纸本)9781581139860
High-end shared storage systems serving multiple independent work-loads must assure that concurrently executing clients will receive a fair or agreed-upon share of system I/O resources. In a parallel I/O system an application makes requests for specific disks at different steps of its computation depending on the data layout and its computational state. Different applications contend for disk access making the problem of maintaining fair allocation challenging. We propose a model for differentiated disk bandwidth allocation based on lexicographic minimization, and provide new efficient scheduling algorithms to allocate the I/O bandwidth fairly among contending applications. A major contribution of our model is its ability to handle multiple parallel disks and contention for disks among the concurrent applications. Analysis and simulation-based evaluation shows that our algorithms provide performance isolation, weighted allocation of resources, and are work conserving. The solutions are also applicable to other shared resource environments dealing with non-uniform heterogeneous servers. Copyright 2005 acm.
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consist...
详细信息
ISBN:
(纸本)9780897917179
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the trade-off between the amount of local computation and the amount of inter-processor communication required for parallel sorting algorithms. We prove a lower bound of Ω(n log m/m) on the time to sort n numbers in an exclusive-read variant of the PRAM (m) model. We show that Leighton's Columnsort can be used to give an asymptotically matching upper bound in the case where m grows as a fractional power of n. The bounds are of a surprising form, in that they have little dependence on the parameter p. This implies that attempting to distribute the workload across more processors while holding the problem size and the size of the shared memory fixed will not improve the optimal running time of sorting in this model. We also show that both the upper and the lower bound can be adapted to bridging models that address the issue of limited communication bandwidth: the LogP model and the BSP model. The lower bounds provide convincing evidence that efficient parallelalgorithms for sorting rely strongly on high communication bandwidth.
Some parallelalgorithms have the property that, as they are allowed to take more time, the total work that they do is reduced. This paper describes three such algorithms that find the strongly connected components of...
详细信息
We present Dreadlocks. an efficient new shared-memory spin lock that actively detects deadlocks. Instead of spinning on a Boolean value, each thread spins on the lock owner's per-thread digest, a compact represent...
详细信息
ISBN:
(纸本)9781595939739
We present Dreadlocks. an efficient new shared-memory spin lock that actively detects deadlocks. Instead of spinning on a Boolean value, each thread spins on the lock owner's per-thread digest, a compact representation of a portion of the lock's waits-for graph. Digests can be implemented either as bit vectors (for small numbers of threads) or as Bloom filters (for larger numbers of threads). Updates to digests are propagated dynamically as locks are acquired and released. Dreadlocks can be applied to any spin lock algorithm that allows threads to time out. Experimental results show that Dreadlocks outperform timeouts under many circumstances, and almost never do worse.
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of whic...
详细信息
ISBN:
(纸本)9780897918091
This paper addresses optimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referred to as a data parallel pipeline, is common in several application domains including digital signal processing, image processing, and computer vision. The parameters of the performance of stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which the data sets are processed). These two criterion are distinct since multiple data sets can be pipelined or processed in parallel. We present a new algorithm to determine a processor mapping of a chain of tasks that optimizes the latency in the presence of throughput constraints, and discuss optimization of the throughput with latency constraints. The problem formulation uses a general and realistic model of inter-task communication, and addresses the entire problem of mapping, which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in the number of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran.
This paper presents a comparison of the pragmatic aspects of some parallelalgorithms for finding connected components, together with optimizations on these algorithms. The algorithms being compared are two similar al...
详细信息
We present an algorithm for matrix inversion that combines the practical requirement of an optimal number of arithmetic operations and the theoretical goal of a polylogarithmic critical path length. The algorithm redu...
详细信息
ISBN:
(纸本)9781450315722
We present an algorithm for matrix inversion that combines the practical requirement of an optimal number of arithmetic operations and the theoretical goal of a polylogarithmic critical path length. The algorithm reduces inversion to matrix multiplication. It uses Strassen's recursion scheme but on the critical path, it breaks the recursion early switching to an asymptotically inefficient yet fast use of Newton's method. We also show that the algorithm is numerically stable. Overall, we get a candidate for a massively parallel algorithm that scales to exascale systems even on relatively small inputs. Preliminary experiments on multicore machines give the surprising result that even on such moderately parallel machines the algorithm outperforms Intel's Math Kernel Library and that Strassen's algorithm seems to be numerically more stable than one might expect.
Dynamic programming is an important combinatorial optimization technique that has been widely used in various fields such as control theory, operations research, computational biology and computer science. Many author...
详细信息
Dynamic programming is an important combinatorial optimization technique that has been widely used in various fields such as control theory, operations research, computational biology and computer science. Many authors have described parallel dynamic programming algorithms for the family of multistage problems. More scarce is the literature for the more general class of problems where dependences appear between non-consecutive stages. Among the important problems falling in this class is the RNA base pairing problem. In this study we propose a new parallel scheme for a large class of recurrences with triangular iteration space and nonuniform dependences that includes the RNA base pairing problem. We derive two different instances of this scheme that correspond to an horizontal and a vertical traverse of the iteration domain. We develop and extend the tiling approach for this particular class. We formulate and analytically solve the optimization problem determining the tile size that minimizes the total execution time of the tiled program on a distributed memory parallel machine. Our analyze is based on the BSP model, which assures the portability of the obtained results. The computational experiments carried out on the CRAY T3E behave according to the predictions of our theoretical model.
We describe in this paper a new method for building an efficient algorithm for scheduling jobs in a cluster. Jobs are considered as parallel tasks (PT) which can be scheduled on any number of processors. The main feat...
详细信息
ISBN:
(纸本)9781581138405
We describe in this paper a new method for building an efficient algorithm for scheduling jobs in a cluster. Jobs are considered as parallel tasks (PT) which can be scheduled on any number of processors. The main feature is to consider two criteria that are optimized together. These criteria are the makespan and the weighted minimal average completion time (minsum). They are chosen for their complementarity, to be able to represent both user-oriented objectives and system administrator objectives. We propose an algorithm based on a batch policy with increasing batch sizes, with a smart selection of jobs in each batch. This algorithm is assessed by intensive simulation results, compared to a new lower bound (obtained by a relaxation of ILP) of the optimal schedules for both criteria separately. It is currently implemented in an actual real-size cluster platform.
暂无评论