This paper develops a highly accurate LogGP model of a complex wavefront application that uses MPI communication on the IBM SP/2. Key features of the model include: (1) elucidation of the principal wavefront synchroni...
详细信息
ISBN:
(纸本)9781581131000
This paper develops a highly accurate LogGP model of a complex wavefront application that uses MPI communication on the IBM SP/2. Key features of the model include: (1) elucidation of the principal wavefront synchronization structure, and (2) explicit high-fidelity models of the MPI-send and MPI-receive primitives. The MPI-send/receive models are used to derive L, o, and G from simple two-node micro-benchmarks. Other model parameters are obtained by measuring small application problem sizes on four SP nodes. Results show that the LogGP model predicts, in seconds and with a high degree of accuracy, measured application execution time for large problems running on 128 nodes. Detailed performance projections are provided for very large future processor configurations that are expected to be available to the application developers. These results indicate that scaling beyond one or two thousand nodes yields greatly diminished improvements in execution time, and that synchronization delays are a principal factor limiting the scalability of the application.
The proceedings contains 25 papers. Topics discussed include data and task parallelism, irregular applications, coherence protocols, shared memory, compilers and performances issue.
The proceedings contains 25 papers. Topics discussed include data and task parallelism, irregular applications, coherence protocols, shared memory, compilers and performances issue.
We present a tool for mesh-partitioning parallelization of numerical programs working iteratively on an unstructured mesh. This conventional method splits a mesh into sub-meshes, adding some overlap on the boundaries ...
详细信息
ISBN:
(纸本)9780897919067
We present a tool for mesh-partitioning parallelization of numerical programs working iteratively on an unstructured mesh. This conventional method splits a mesh into sub-meshes, adding some overlap on the boundaries of the sub-meshes. The program is then run in SPMD mode on a parallel architecture with distributed memory. It is necessary to add calls to communication routines at a few carefully selected locations in the code. The tool presented here uses the data-dependence information to mechanize the placement of these synchronizations. Additionally, we see that there is not a unique solution for placing these synchronizations, and performance depends on this choice.
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation unde...
详细信息
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. The irregular parallelism is modeled by task dependence graphs with mixed granularities. The trade-off in achieving both time and space efficiency is investigated. The main difficulty of designing efficient run-time system support is caused by the use of fast communication primitives available on modern parallel architectures. A run-time active memory management scheme and new scheduling techniques are proposed to improve memory utilization while retaining good time efficiency, and a theoretical analysis on correctness and performance is provided. This work is implemented in the context of RAPID system [5] which provides run-time support for parallelizing irregular code on distributed memory machines and the effectiveness of the proposed techniques is verified on sparse Cholesky and LU factorization with partial pivoting. The experimental results on Cray-T3D show that solvable problem sizes can be increased substantially under limited memory capacities and the loss of execution efficiency caused by the extra memory managing overhead is reasonable.
This paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn parallel Performance Tools running with the Blizzard fine-grain distributed s...
详细信息
ISBN:
(纸本)9780897919067
This paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn parallel Performance Tools running with the Blizzard fine-grain distributed shared memory system. This approach exploits the underlying system's cache coherence protocol to detect data sharing patterns that indicate potential performance bottlenecks and presents performance measurements in a data-centric manner. As a demonstration, Paradyn helped us improve the performance of a new shared-memory application program by a factor of four.
With the increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verificatio...
详细信息
With the increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verification. This paper presents a parallel algorithm for BDD construction targeted at shared memory multiprocessors and distributed shared memory systems. This algorithm focuses on improving memory access locality through specialized memory managers and partial breadth-first expansion, and on improving processor utilization through dynamic load balancing. The results on a shared memory system show speedups of over two on four processors and speedups of up to four on eight processors. The measured results clearly identify the main source of bottlenecks and point out some interesting directions for further improvements.
Many of today's high level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than...
详细信息
Many of today's high level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an efficient scheduling algorithm is required to assign computations to processors at runtime. Besides having low overheads and good load balancing, it is important for the scheduling algorithm to minimize the space usage of the parallel program. This paper presents a scheduling algorithm that is provably space-efficient and time-efficient for nested parallel languages. In addition to proving the space and time bounds of the parallel schedule generated by the algorithm, we demonstrate that it is efficient in practice. We have implemented a runtime system that uses our algorithm to schedule parallel threads. The results of executing parallel programs on this system show that our scheduling algorithm significantly reduces memory usage compared to previous techniques, without compromising performance.
We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel im...
详细信息
We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel implementation (in HPF) of graph partitioning algorithms applied to the linearized data structure, (3) techniques for expressing irregular communication and nonuniform computations associated with the elements of linearized data structures. We demonstrate and evaluate our formulation on a parallel, hierarchical N-body method for the evaluation of potentials and forces of nonuniform particle distributions. Our experimental results demonstrate that efficient data parallel (HPF) implementations of highly nonuniform problems are feasible with the proper language/compiler/runtime support. Our data parallel N-body code provides a much needed 'benchmark' code for evaluating and improving HPF compilers.
High Performance Fortran (HPF) has emerged as a standard language for data parallel computing. However, a wide variety of scientific applications are best programmed by a combination of task and data parallelism. Ther...
详细信息
High Performance Fortran (HPF) has emerged as a standard language for data parallel computing. However, a wide variety of scientific applications are best programmed by a combination of task and data parallelism. Therefore, a good model of task parallelism is important for continued success of HPF for parallelprogramming. This paper presents a task parallelism model that is simple, elegant, and relatively easy to implement in an HPF environment. Task parallelism is exploited by mechanisms for dividing processors into sub-groups and mapping computations and data onto processor subgroups. This model of task parallelism has been implemented in the Fx compiler at Carnegie Mellon University. The paper addresses the main issues in compiling integrated task and data parallel programs and reports on the use of this model for programming various flat and nested task structures. Performance results are presented for a set of programs spanning signal processing, image processing, computer vision and environment modeling. A variant of this task model is a new approved extension of HPF and this paper offers insight into the power of expression and ease of implementation of this extension.
Multimedia applications operate on downstreams. A large class of multimedia applications is described by the macro-dataflow graph model. This study attempted to examine how such multimedia applications can be compiled...
详细信息
ISBN:
(纸本)9780897919067
Multimedia applications operate on downstreams. A large class of multimedia applications is described by the macro-dataflow graph model. This study attempted to examine how such multimedia applications can be compiled to run efficiently on parallel machines, by optimizing both throughput (T) and latency (L), using two techniques based on task speedup functions. The first step chooses an appropriate pipeline structure for the system while the second exploits the dataset parallelism intrinsic in the period datastream, and runs multiple datasets in parallel (task/cluster multiplicity) for each clustering. Both techniques were used to compile real-time image-processing problems on an NCUBE-2 multiprocessor. The two techniques showed substantial performance gains.
暂无评论