Sparse matrix computations are ubiquitous in science and engineering. In particular, solving a sparse linear system of equations is the bottleneck in many computations. Of the many available algorithms, some are easy ...
详细信息
Sparse matrix computations are ubiquitous in science and engineering. In particular, solving a sparse linear system of equations is the bottleneck in many computations. Of the many available algorithms, some are easy to parallelize, because they reduce to sparse-matrix-vector multiplication, for which good algorithms based on graph partitioning exist. But others, which are widely used sequentially because of their numerical properties, remain hard to parallelize scalably. In this talk we highlight the challenges of parallelizing two widely used methods: sparse Gaussian Elimination with pivoting, and multigrid for linear systems arising from solid mechanics problems on irregular meshes. In both cases our current algorithms are among the fastest available (the first one getting parallel efficiencies up to 20% on 512 processors, and the latter up to 50% on 960 processors), but many algorithm design and performance analysis problems remain. Both these problems are much more challenging to parallelize than their simpler counterparts (sparse Cholesky, and multigrid on regular meshes, which we will briefly review), and to some extent the parallelizability of these simpler algorithms provides `upper bounds' for us. For both our algorithms we exploit certain tradeoffs between scalability and `accuracy,' so that the parallelalgorithms are not ones we would run sequentially. Here are some open problems that we plan to discuss. In the case of sparse Gaussian elimination.
The papers submitted to the Tenth annualacmsymposium on parallelalgorithms and architectures are presented. The issues considered include network protocols, parallelalgorithms, multiprocessing systems, parallel pr...
详细信息
The papers submitted to the Tenth annualacmsymposium on parallelalgorithms and architectures are presented. The issues considered include network protocols, parallelalgorithms, multiprocessing systems, parallel processing systems, distributed computer systems, program compilers, sorting, computer science, data storage equipment, video signal processing.
In this paper we develop models for and analyze several randomized work stealing algorithms in a dynamic setting. Our models represent the limiting behavior of systems as the number of processors grows to infinity usi...
详细信息
ISBN:
(纸本)9780897919890
In this paper we develop models for and analyze several randomized work stealing algorithms in a dynamic setting. Our models represent the limiting behavior of systems as the number of processors grows to infinity using differential equations. The advantages of this approach include the ability to model a large variety of systems and to provide accurate numerical approximations of system behavior even when the number of processors is relatively small. We show how this approach can yield significant intuition about the behavior of work stealing algorithms in realistic settings.
This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP);that is, Explicit Multi-Threading (X...
详细信息
ISBN:
(纸本)9780897919890
This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP);that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm covering the spectrum from algorithms through architecture to implementation is introduced;new elements are added where needed.
We present lower bounds for time needed to solve basic problems on three general-purpose models of parallel computation: the shared-memory models QSM and s-QSW, and the distributed-memory model, the BSP. For each of t...
详细信息
ISBN:
(纸本)9780897919890
We present lower bounds for time needed to solve basic problems on three general-purpose models of parallel computation: the shared-memory models QSM and s-QSW, and the distributed-memory model, the BSP. For each of these models, we also obtain lower bounds for the number of rounds needed to solve these problems using a randomized algorithm on a p-processor machine. Our results on 'rounds' is of special interest in the context of designing work-efficient algorithms on a machine where latency and synchronization costs are high. Many of our lower bound results are complemented by upper bounds that match the lower bound or are close to it.
This paper analyzes job scheduling for parallel computers by using theoretical and experimental means. Based on existing architectures we first present a machine and a job model. Then, we propose a simple on-line algo...
详细信息
This paper analyzes job scheduling for parallel computers by using theoretical and experimental means. Based on existing architectures we first present a machine and a job model. Then, we propose a simple on-line algorithm employing job preemption without migration and derive theoretical bounds for the performance of the algorithm. The algorithm is experimentally evaluated with trace data from a large computing facility. These experiments show that the algorithm is highly sensitive on parameter selection and that substantial performance improvements over existing (non-preemptive) scheduling methods are possible.
暂无评论