In this paper, we present the design and evaluation of a compiler system, called APE, for automatic parallelization of scientific and engineering applications on distributed memory computers. APE is built on top of SU...
详细信息
In this paper, we present the design and evaluation of a compiler system, called APE, for automatic parallelization of scientific and engineering applications on distributed memory computers. APE is built on top of SUIF compiler. It extends SUIF with capabilities in parallelizing loops with non-uniform cross-iteration dependencies, and in handling loops that have indirect access patterns. We have evaluated the effectiveness of SUIF with several CFD test codes, and found that SUIF handles uniform loops over dense and regular data structures very well. For non-uniform loops, an innovative and efficient parallelization approach based on convex theory have been proposed and is being implemented. We have also presented a class of scalable algorithms for parallel distribution and redistribution of unstructured data structures during parallelizing irregular loops.
A complete implementation of MPI for the Fujitsu AP1000+ is presented. The library can employ a number of different mechanisms in implementing the send and receive message passing operations. The method of detecting t...
详细信息
A complete implementation of MPI for the Fujitsu AP1000+ is presented. The library can employ a number of different mechanisms in implementing the send and receive message passing operations. The method of detecting the arrival of new messages can be realized through interrupt-driven and polling techniques. Transferring message data is achieved by either sending the message data directly to the receiver "in-place", or using a rendezvous method which allows the use of a fast noncopying nonblocking remote-fetching operation. The MPI library exhibits good performance compared to the native message passing library, and allows the user to decide at runtime which mechanisms will be used in order to achieve the best performance on a per-application basis.
PASM is a concept for a parallel processing system that allows experimentation with different architectural design alternatives. PASM is dynamically reconfigurable along three dimensions: partitionability into indepen...
详细信息
PASM is a concept for a parallel processing system that allows experimentation with different architectural design alternatives. PASM is dynamically reconfigurable along three dimensions: partitionability into independent or communicating submachines, variable interprocessor connections, and mixed-mode SIMD/MIMD parallelism. With mixed-mode parallelism, a program can switch between SIMD (synchronous) and MIMD (asynchronous) parallelism at instruction-level granularity, allowing the use of both modes in a single machine. The PASM concept is presented, showing the ways in which reconfiguration can be accomplished. Trade-offs among SIMD/MIMD, and mixed-mode parallelism are explored. The small-scale PASM prototype with 16 processing elements is described. The ELP mixed-mode programming language used on the prototype is discussed. An example of a prototype-based study that demonstrates the potential of mixed-mode parallelism is given.
In this paper, we describe an object-based distributed shared memory called Adsmith. In an object-based DSM, the shared memory consists of many shared objects, through which the shared memory is accessed. Adsmith is b...
详细信息
ISBN:
(纸本)0818674601
In this paper, we describe an object-based distributed shared memory called Adsmith. In an object-based DSM, the shared memory consists of many shared objects, through which the shared memory is accessed. Adsmith is built on top of PVM at the library layer using C++. PVM is used as the communication subsystem, because it is a de facto standard and encapsulates many system related details. Several mechanisms are used to improve the performance of Adsmith, such as release memory consistency, load/store-like memory accesses, nonblocking accesses, and atomic operations, etc. Performance results show that even though Adsmith is implemented on top of PVM, programs running on Adsmith can achieve a performance comparable with those running directly on PVM.
A concurrent partitioner for partitioning unstructured finite element meshes on distributed memory architectures is developed. The partitioner uses an element-based partitioning strategy. Its main advantage over the m...
详细信息
A concurrent partitioner for partitioning unstructured finite element meshes on distributed memory architectures is developed. The partitioner uses an element-based partitioning strategy. Its main advantage over the more conventional node-based partitioning strategy is its modular programming approach to the development of parallel applications. The partitioner first partitions element centroids using a recursive inertial bisection algorithm. Elements and nodes then migrate according to the partitioned centroids, using a data request communication template for unpredictable incoming messages. Our scalable implementation is contrasted to a non-scalable implementation which is a straightforward parallelization of a sequential partitioner. The algorithms adopted in the partitioner scale logarithmically, as confirmed by actual timing measurements on the Intel Delta on up to 512 processors for scaled size problems.
Evaluates the High Performance Fortran (HPF) language for the compact expression and efficient implementation of conjugate-gradient iterative matrix-solvers on high-performance computing and communications (HPCC) plat...
详细信息
ISBN:
(纸本)9780818675829
Evaluates the High Performance Fortran (HPF) language for the compact expression and efficient implementation of conjugate-gradient iterative matrix-solvers on high-performance computing and communications (HPCC) platforms. We discuss the use of intrinsic functions, data distribution directives and explicitly parallel constructs to optimize performance by minimizing communications requirements in a portable manner. We focus on implementations using the existing HPF definitions but also discuss issues arising that may influence a revised definition for HPF-2. Some of the codes discussed are available on the World Wide Web at http://***/hpfa/, along with other educational and discussion material related to applications in HPF.
We study scalable parallel computational geometry algorithms for the coarse grained multicomputer model: p processors solving a problem on n data items, were each processor has O(n/p) much greater than O(1) local memo...
详细信息
We study scalable parallel computational geometry algorithms for the coarse grained multicomputer model: p processors solving a problem on n data items, were each processor has O(n/p) much greater than O(1) local memory and all processors are connected via some arbitrary interconnection network (e.g. mesh, hypercube, fat tree). We present O(T-sequential/p + T-s(n,p)) time scalable parallelalgorithms for several computational geometry problems. T-s(n,p) refers to the time of a global sort operation. Our results are independent of the multicomputer's interconnection network. Their time complexities become optimal when T-sequential/p dominates T-s(n,p) or when T-s(n,p) is optimal. This is the case for several standard architectures, including meshes and hypercubes, and a wide range of ratios n/p that include many of the currently available machine configurations. Our methods also have some important practical advantages: For interprocessor communication, they use only a small fixed number of one global routing operation, global sort, and all other programming is in the sequential domain. Furthermore, our algorithms use only a small number of very large messages, which greatly reduces the overhead for the communication protocol between processors. (Note however, that our time complexities account for the lengths of messages.) Experiments show that our methods are easy to implement and give good timing results.
The paper describes-from a software engineering perspective-a framework for the formal development of parallelalgorithms on arbitrary architectures. The algorithms are synthesised in a transformational way, i.e. by a...
详细信息
The proceedings contain 42 papers. The topics discussed include: improvement of duplication scheduling heuristic algorithm with nonstrict triggering of program graph nodes;cohesion : an efficient distributed shared me...
ISBN:
(纸本)081867038X
The proceedings contain 42 papers. The topics discussed include: improvement of duplication scheduling heuristic algorithm with nonstrict triggering of program graph nodes;cohesion : an efficient distributed shared memory system supporting multiple memory consistency models;supercompilers for massively parallelarchitectures;investigation of some hardware accelerators for relational algebra operations;implementing higher-order gamma on MasPar: a case study;a framework for visual parallelprogramming;parallelizing a PDE solver: experiences with PISCES-MP;efficient scalable mesh algorithms for merging, sorting and selection;and constructing parallel implement at ions with algebraic programming tools.
作者:
Bode, ArndtInstitut für Informatik
Lehrstuhl für Rechnertechnik und Rechnerorganisation Technische Universitat Munchen MunchenD-80290 Germany
This article covers research at Technische Universität München on distributed and parallelarchitectures and applications. First, an overview on the parallel processing research organization is given. The se...
详细信息
暂无评论