Accurate simulation of large parallel applications can be facilitated with the use of direct execution and parallel discrete event simulation. This paper describes the use of COMPASS, a direct execution-driven, parall...
详细信息
ISBN:
(纸本)9781581131000
Accurate simulation of large parallel applications can be facilitated with the use of direct execution and parallel discrete event simulation. This paper describes the use of COMPASS, a direct execution-driven, parallel simulator for performance prediction of programs that include both communication and I/O intensive applications. The simulator has been used to predict the performance of such applications on both distributed memory machines like the IBM SP and shared-memory machines like the SGI Origin 2000. The paper illustrates the usefulness of COMPASS as a versatile performance prediction tool. We use both real-world applications and synthetic benchmarks to study application scalability, sensitivity to communication latency, and the interplay between factors like communication pattern and parallel file system caching on application performance. We also show that the simulator is accurate in its predictions and that it is also efficient in its ability to use parallel simulation to reduce its own execution time which, in some cases, has yielded a nearlinear speedup.
Traditional compiler techniques developed for sequential programs do not guarantee the correctness (sequential consistency) of compiler transformations when applied to parallel programs. This is because traditional co...
ISBN:
(纸本)9781581131000
Traditional compiler techniques developed for sequential programs do not guarantee the correctness (sequential consistency) of compiler transformations when applied to parallel programs. This is because traditional compilers for sequential programs do not account for the updates to a shared variable by different threads. We present a concurrent static single assignment (CSSA) form for parallel programs containing cobegin/coend and parallel do constructs and post/wait synchronization primitives. Based on the CSSA form, we present copy propagation and dead code elimination techniques. Also, a global value numbering technique that detects equivalent variables in parallel programs is presented. By using global value numbering and the CSSA form, we extend classical common subexpression elimination, redundant load/store elimination, and loop invariant detection to parallel programs without violating sequential consistency. These optimization techniques are the most commonly used techniques for sequential programs. By extending these techniques to parallel programs, we can guarantee the correctness of the optimized program and maintain single processor performance in a multiprocessor environment.
In comparison to automatic parallelization, which is thoroughly studied in the literature [31, 33], classical analyses and optimizations of explicitly parallel programs were more or less neglected. This may be due to ...
详细信息
ISBN:
(纸本)9781581131000
In comparison to automatic parallelization, which is thoroughly studied in the literature [31, 33], classical analyses and optimizations of explicitly parallel programs were more or less neglected. This may be due to the fact that naive adaptations of the sequential techniques fail [24], and their straightforward correct ones have unacceptable costs caused by the interleavings, which manifest the possible executions of a parallel program. Recently, however, we showed that unidirectional bitvector analyses can be performed for parallel programs as easily and as efficiently as for sequential ones [17], a necessary condition for the successful transfer of the classical optimizations to the parallel *** this article we focus on possible subsequent code motion transformations, which turn out to require much more care than originally conjectured [17]. Essentially, this is due to the fact that interleaving semantics, although being adequate for correctness considerations, fails when it comes to reasoning about efficiency of parallel programs. This deficiency, however, can be overcome by strengthening the specific treatment of synchronization points.
Realistic interactive multimedia involving vision, animation, and multimedia collaboration is likely to become an important aspect of future computer applications. The scalable parallelism inherent in such application...
ISBN:
(纸本)9781581131000
Realistic interactive multimedia involving vision, animation, and multimedia collaboration is likely to become an important aspect of future computer applications. The scalable parallelism inherent in such applications coupled with their computational demands make them ideal candidates for SMPs and clusters of SMPs. These applications have novel requirements that offer new kinds of challenges for parallel system *** have designed a programming system called Stampede that offers many functionalities needed to simplify development of such applications (such as high-level data sharing abstractions, dynamic cluster-wide threads, and multiple address spaces). We have built Stampede and it runs on clusters of SMPs. To date we have implemented two applications on Stampede, one of which is discussed *** this paper we describe a part of Stampede called Space-Time Memory (STM). It is a novel data sharing abstraction that enables interactive multimedia applications to manage a collection of time-sequenced data items simply, efficiently, and transparently across a cluster. STM relieves the application programmer from low level synchronization and data communication by providing a high level interface that subsumes buffer management, inter-thread synchronization, and location transparency for data produced and accessed anywhere in the cluster. STM also automatically handles garbage collection of data items that will no longer be accessed by any of the application threads. We discuss ease of use issues for developing applications using STM, and present preliminary performance results to show that STM's overhead is low.
This paper develops a highly accurate LogGP model of a complex wavefront application that uses MPI communication on the IBM SP/2. Key features of the model include: (1) elucidation of the principal wavefront synchroni...
详细信息
ISBN:
(纸本)9781581131000
This paper develops a highly accurate LogGP model of a complex wavefront application that uses MPI communication on the IBM SP/2. Key features of the model include: (1) elucidation of the principal wavefront synchronization structure, and (2) explicit high-fidelity models of the MPI-send and MPI-receive primitives. The MPI-send/receive models are used to derive L, o, and G from simple two-node micro-benchmarks. Other model parameters are obtained by measuring small application problem sizes on four SP nodes. Results show that the LogGP model predicts, in seconds and with a high degree of accuracy, measured application execution time for large problems running on 128 nodes. Detailed performance projections are provided for very large future processor configurations that are expected to be available to the application developers. These results indicate that scaling beyond one or two thousand nodes yields greatly diminished improvements in execution time, and that synchronization delays are a principal factor limiting the scalability of the application.
The proceedings contains 25 papers. Topics discussed include data and task parallelism, irregular applications, coherence protocols, shared memory, compilers and performances issue.
The proceedings contains 25 papers. Topics discussed include data and task parallelism, irregular applications, coherence protocols, shared memory, compilers and performances issue.
We present a tool for mesh-partitioning parallelization of numerical programs working iteratively on an unstructured mesh. This conventional method splits a mesh into sub-meshes, adding some overlap on the boundaries ...
详细信息
ISBN:
(纸本)9780897919067
We present a tool for mesh-partitioning parallelization of numerical programs working iteratively on an unstructured mesh. This conventional method splits a mesh into sub-meshes, adding some overlap on the boundaries of the sub-meshes. The program is then run in SPMD mode on a parallel architecture with distributed memory. It is necessary to add calls to communication routines at a few carefully selected locations in the code. The tool presented here uses the data-dependence information to mechanize the placement of these synchronizations. Additionally, we see that there is not a unique solution for placing these synchronizations, and performance depends on this choice.
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation unde...
详细信息
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. The irregular parallelism is modeled by task dependence graphs with mixed granularities. The trade-off in achieving both time and space efficiency is investigated. The main difficulty of designing efficient run-time system support is caused by the use of fast communication primitives available on modern parallel architectures. A run-time active memory management scheme and new scheduling techniques are proposed to improve memory utilization while retaining good time efficiency, and a theoretical analysis on correctness and performance is provided. This work is implemented in the context of RAPID system [5] which provides run-time support for parallelizing irregular code on distributed memory machines and the effectiveness of the proposed techniques is verified on sparse Cholesky and LU factorization with partial pivoting. The experimental results on Cray-T3D show that solvable problem sizes can be increased substantially under limited memory capacities and the loss of execution efficiency caused by the extra memory managing overhead is reasonable.
This paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn parallel Performance Tools running with the Blizzard fine-grain distributed s...
详细信息
ISBN:
(纸本)9780897919067
This paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn parallel Performance Tools running with the Blizzard fine-grain distributed shared memory system. This approach exploits the underlying system's cache coherence protocol to detect data sharing patterns that indicate potential performance bottlenecks and presents performance measurements in a data-centric manner. As a demonstration, Paradyn helped us improve the performance of a new shared-memory application program by a factor of four.
With the increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verificatio...
详细信息
With the increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verification. This paper presents a parallel algorithm for BDD construction targeted at shared memory multiprocessors and distributed shared memory systems. This algorithm focuses on improving memory access locality through specialized memory managers and partial breadth-first expansion, and on improving processor utilization through dynamic load balancing. The results on a shared memory system show speedups of over two on four processors and speedups of up to four on eight processors. The measured results clearly identify the main source of bottlenecks and point out some interesting directions for further improvements.
暂无评论