this paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based...
详细信息
ISBN:
(纸本)9781450332057
this paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based on OpenACC. It allows programmers to write programs for multiple accelerators using a uniform programming model whether they are in shared or distributed memory systems. We implement a prototype of our model and evaluate its performance with a GPU-based supercomputer using three benchmark applications.
Withthe increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verificatio...
详细信息
Withthe increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verification. this paper presents a parallel algorithm for BDD construction targeted at shared memory multiprocessors and distributed shared memory systems. this algorithm focuses on improving memory access locality through specialized memory managers and partial breadth-first expansion, and on improving processor utilization through dynamic load balancing. the results on a shared memory system show speedups of over two on four processors and speedups of up to four on eight processors. the measured results clearly identify the main source of bottlenecks and point out some interesting directions for further improvements.
We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel im...
详细信息
We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel implementation (in HPF) of graph partitioning algorithms applied to the linearized data structure, (3) techniques for expressing irregular communication and nonuniform computations associated withthe elements of linearized data structures. We demonstrate and evaluate our formulation on a parallel, hierarchical N-body method for the evaluation of potentials and forces of nonuniform particle distributions. Our experimental results demonstrate that efficient data parallel (HPF) implementations of highly nonuniform problems are feasible withthe proper language/compiler/runtime support. Our data parallel N-body code provides a much needed 'benchmark' code for evaluating and improving HPF compilers.
Many of today's high level parallel languages support dynamic, fine-grained parallelism. these languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than...
详细信息
Many of today's high level parallel languages support dynamic, fine-grained parallelism. these languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an efficient scheduling algorithm is required to assign computations to processors at runtime. Besides having low overheads and good load balancing, it is important for the scheduling algorithm to minimize the space usage of the parallel program. this paper presents a scheduling algorithm that is provably space-efficient and time-efficient for nested parallel languages. In addition to proving the space and time bounds of the parallel schedule generated by the algorithm, we demonstrate that it is efficient in practice. We have implemented a runtime system that uses our algorithm to schedule parallelthreads. the results of executing parallel programs on this system show that our scheduling algorithm significantly reduces memory usage compared to previous techniques, without compromising performance.
作者:
Siegl, Kurt
Johannes Kepler University LinzA-4040 Austria
||MAPLE|| (speak: parallel Maple) is a portable system for parallel symbolic computation. the system is built as an interface between the parallel declarative programming language Strand and the sequential computer al...
详细信息
ISBN:
(纸本)0897915895
||MAPLE|| (speak: parallel Maple) is a portable system for parallel symbolic computation. the system is built as an interface between the parallel declarative programming language Strand and the sequential computer algebra system Maple, thus providing the elegance of Strand and the power of the existing sequential algorithms in Maple. the implementation of difFerent parallelprogramming paradigms shows that it is fairly easy to parallelize even complex algebraic algorithms using this system. Sample applications (among them algorithms solving multivariate nonlinear equation systems) are implemented on various parallel architectures. For example a straightforward parallelization of the complex and important problem of real root isolation has been parallelized using a generic Strand program of fewer than 20 Unes of code and a slight modification of 5 lines in the original sequential Maple source. Even with such a simple modification we gained a speed-up of 5 times, that is better than those reported by others in the literature.
Map-Reduce is a programming model that has its roots in early functional programming. In addition to producing short and elegant code for problems involving lists or collections, this model has proven very useful for ...
详细信息
In this paper, we describe a parallel Branch-and-Bound (B&B) algorithm with a history-based domination technique, and we apply it to the Sequential Ordering Problem (SOP). To the best of our knowledge, the propose...
详细信息
ISBN:
(纸本)9781450392044
In this paper, we describe a parallel Branch-and-Bound (B&B) algorithm with a history-based domination technique, and we apply it to the Sequential Ordering Problem (SOP). To the best of our knowledge, the proposed algorithm is the first parallel B&B algorithm that includes a history-based domination technique and is the first parallel B&B algorithm for solving the SOP using a pure B&B approach. the proposed algorithm takes a pool-based approach and employs a collection of novel techniques that we have developed to achieve effective parallel exploration of the solution space, including parallel history domination, history table memory management, and a thread restart technique. the proposed algorithm was experimentally evaluated using the SOPLIB and TSPLIB benchmarks. the results show that using ten threads with a time limit of one hour on the medium-difficulty instances, the proposed algorithm gives a geometric-mean speedup of 19.9 on SOPLIB and 10.23 on TSPLIB, with super-linear speedups up to 65x seen on 17 instances.
Multimedia applications operate on downstreams. A large class of multimedia applications is described by the macro-dataflow graph model. this study attempted to examine how such multimedia applications can be compiled...
详细信息
ISBN:
(纸本)9780897919067
Multimedia applications operate on downstreams. A large class of multimedia applications is described by the macro-dataflow graph model. this study attempted to examine how such multimedia applications can be compiled to run efficiently on parallel machines, by optimizing boththroughput (T) and latency (L), using two techniques based on task speedup functions. the first step chooses an appropriate pipeline structure for the system while the second exploits the dataset parallelism intrinsic in the period datastream, and runs multiple datasets in parallel (task/cluster multiplicity) for each clustering. Both techniques were used to compile real-time image-processing problems on an NCUBE-2 multiprocessor. the two techniques showed substantial performance gains.
Scalable busy-wait synchronization algorithms are essential for achieving good parallel program performance on large scale multiprocessors. Such algorithms include mutual exclusion locks, reader-writer locks, and barr...
详细信息
Scalable busy-wait synchronization algorithms are essential for achieving good parallel program performance on large scale multiprocessors. Such algorithms include mutual exclusion locks, reader-writer locks, and barrier synchronization. Unfortunately, scalable synchronization algorithms are particularly sensitive to the effects of multiprogramming: their performance degrades sharply when processors are shared among different applications, or even among processes of the same application. In this paper we describe the design and evaluation of scalable scheduler-conscious mutual exclusion locks, reader-writer locks, and barriers, and show that by sharing information across the kernel/application interface we can improve the performance of scheduler-oblivious implementations by more than an order of magnitude.
A major challenge in fine-grained computing is achieving locality without excessive scheduling overhead. We built two J-Machine implementations of a fine-grained programming model, the Berkeley threaded Abstract Machi...
详细信息
A major challenge in fine-grained computing is achieving locality without excessive scheduling overhead. We built two J-Machine implementations of a fine-grained programming model, the Berkeley threaded Abstract Machine. One implementation takes an Active Messages approach, maintaining a scheduling hierarchy in software in order to improve data cache performance. Another approach relies on the J-Machine's message queues and fast task switch, lowering the control costs at the expense of data locality. Our analysis measures the costs and benefits of each approach, for a variety of programs and cache configurations. the Active Messages implementation is strongest when miss penalties are high and for the finest-grained programs. the hardware-buffered implementation is strongest in direct-mapped caches, where it achieves substantially better instruction cache performance.
暂无评论