We present a tool for mesh-partitioning parallelization of numerical programs working iteratively on an unstructured mesh. this conventional method splits a mesh into sub-meshes, adding some overlap on the boundaries ...
详细信息
ISBN:
(纸本)9780897919067
We present a tool for mesh-partitioning parallelization of numerical programs working iteratively on an unstructured mesh. this conventional method splits a mesh into sub-meshes, adding some overlap on the boundaries of the sub-meshes. the program is then run in SPMD mode on a parallel architecture with distributed memory. It is necessary to add calls to communication routines at a few carefully selected locations in the code. the tool presented here uses the data-dependence information to mechanize the placement of these synchronizations. Additionally, we see that there is not a unique solution for placing these synchronizations, and performance depends on this choice.
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation unde...
详细信息
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. the irregular parallelism is modeled by task dependence graphs with mixed granularities. the trade-off in achieving both time and space efficiency is investigated. the main difficulty of designing efficient run-time system support is caused by the use of fast communication primitives available on modern parallel architectures. A run-time active memory management scheme and new scheduling techniques are proposed to improve memory utilization while retaining good time efficiency, and a theoretical analysis on correctness and performance is provided. this work is implemented in the context of RAPID system [5] which provides run-time support for parallelizing irregular code on distributed memory machines and the effectiveness of the proposed techniques is verified on sparse Cholesky and LU factorization with partial pivoting. the experimental results on Cray-T3D show that solvable problem sizes can be increased substantially under limited memory capacities and the loss of execution efficiency caused by the extra memory managing overhead is reasonable.
this paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn parallel Performance Tools running withthe Blizzard fine-grain distributed s...
详细信息
ISBN:
(纸本)9780897919067
this paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn parallel Performance Tools running withthe Blizzard fine-grain distributed shared memory system. this approach exploits the underlying system's cache coherence protocol to detect data sharing patterns that indicate potential performance bottlenecks and presents performance measurements in a data-centric manner. As a demonstration, Paradyn helped us improve the performance of a new shared-memory application program by a factor of four.
Withthe increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verificatio...
详细信息
Withthe increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verification. this paper presents a parallel algorithm for BDD construction targeted at shared memory multiprocessors and distributed shared memory systems. this algorithm focuses on improving memory access locality through specialized memory managers and partial breadth-first expansion, and on improving processor utilization through dynamic load balancing. the results on a shared memory system show speedups of over two on four processors and speedups of up to four on eight processors. the measured results clearly identify the main source of bottlenecks and point out some interesting directions for further improvements.
We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel im...
详细信息
We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel implementation (in HPF) of graph partitioning algorithms applied to the linearized data structure, (3) techniques for expressing irregular communication and nonuniform computations associated withthe elements of linearized data structures. We demonstrate and evaluate our formulation on a parallel, hierarchical N-body method for the evaluation of potentials and forces of nonuniform particle distributions. Our experimental results demonstrate that efficient data parallel (HPF) implementations of highly nonuniform problems are feasible withthe proper language/compiler/runtime support. Our data parallel N-body code provides a much needed 'benchmark' code for evaluating and improving HPF compilers.
Many of today's high level parallel languages support dynamic, fine-grained parallelism. these languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than...
详细信息
Many of today's high level parallel languages support dynamic, fine-grained parallelism. these languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an efficient scheduling algorithm is required to assign computations to processors at runtime. Besides having low overheads and good load balancing, it is important for the scheduling algorithm to minimize the space usage of the parallel program. this paper presents a scheduling algorithm that is provably space-efficient and time-efficient for nested parallel languages. In addition to proving the space and time bounds of the parallel schedule generated by the algorithm, we demonstrate that it is efficient in practice. We have implemented a runtime system that uses our algorithm to schedule parallelthreads. the results of executing parallel programs on this system show that our scheduling algorithm significantly reduces memory usage compared to previous techniques, without compromising performance.
Multimedia applications operate on downstreams. A large class of multimedia applications is described by the macro-dataflow graph model. this study attempted to examine how such multimedia applications can be compiled...
详细信息
ISBN:
(纸本)9780897919067
Multimedia applications operate on downstreams. A large class of multimedia applications is described by the macro-dataflow graph model. this study attempted to examine how such multimedia applications can be compiled to run efficiently on parallel machines, by optimizing boththroughput (T) and latency (L), using two techniques based on task speedup functions. the first step chooses an appropriate pipeline structure for the system while the second exploits the dataset parallelism intrinsic in the period datastream, and runs multiple datasets in parallel (task/cluster multiplicity) for each clustering. Both techniques were used to compile real-time image-processing problems on an NCUBE-2 multiprocessor. the two techniques showed substantial performance gains.
High Performance Fortran (HPF) has emerged as a standard language for data parallel computing. However, a wide variety of scientific applications are best programmed by a combination of task and data parallelism. ther...
详细信息
High Performance Fortran (HPF) has emerged as a standard language for data parallel computing. However, a wide variety of scientific applications are best programmed by a combination of task and data parallelism. therefore, a good model of task parallelism is important for continued success of HPF for parallelprogramming. this paper presents a task parallelism model that is simple, elegant, and relatively easy to implement in an HPF environment. Task parallelism is exploited by mechanisms for dividing processors into sub-groups and mapping computations and data onto processor subgroups. this model of task parallelism has been implemented in the Fx compiler at Carnegie Mellon University. the paper addresses the main issues in compiling integrated task and data parallel programs and reports on the use of this model for programming various flat and nested task structures. Performance results are presented for a set of programs spanning signal processing, image processing, computer vision and environment modeling. A variant of this task model is a new approved extension of HPF and this paper offers insight into the power of expression and ease of implementation of this extension.
this paper presents a new parallel volume rendering algorithm and implementation, based on shear warp factorization, for shared address space multiprocessors. Starting from an existing parallel shear-warp renderer, we...
详细信息
ISBN:
(纸本)9780897919067
this paper presents a new parallel volume rendering algorithm and implementation, based on shear warp factorization, for shared address space multiprocessors. Starting from an existing parallel shear-warp renderer, we use increasingly detailed performance measurements on real machines and simulators to understand performance bottlenecks. this leads us to a new parallel implementation that substantially outperforms and out-scales the old one on a range of shared address space platforms, from bus-based centralized memory machine to hardware-coherent distributed memory machines to networks of computers connected by page-based shared virtual memory. the results demonstrate that real time volume rendering is promising on general purpose multiprocessors, and illustrate the utility of tool hierarchies in conjunction with algorithmic and application knowledge to understand memory system interactions and improve parallel algorithms.
parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. this paper introduces the LoPC model, which is inspired by the Log...
详细信息
parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. this paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel algorithms on a multiprocessor or network of workstations. LoPC takes the L, o and P parameters directly from the LogP model and uses them to predict the cost of contention, C. this paper defines the LoPC model and derives the general form of the model for parallel applications that communicate via active messages. Model modifications for systems that implement coherent shared memory abstractions are also discussed. We carry out the analysis for two important classes of applications that have irregular communication. In the case of parallel applications with homogeneous all-to-any communication, such as sparse matrix computations, the analysis yields a simple rule of thumb and insight into contention costs. In the case of parallel client-server algorithms, the LoPC analysis provides a simple and accurate calculation of the optimal allocation of nodes between clients and servers. the LoPC estimates for these applications are shown to be accurate when compared against event driven simulation and against a sparse matrix computation on the MIT Alewife multiprocessor.
暂无评论