In this paper, we analyse and compare different parallel implementations of the Boundary Element Method on distributed memory computers. We deal with the computation of two-dimensional magnetostatic problems. The resu...
详细信息
In this paper, we analyse and compare different parallel implementations of the Boundary Element Method on distributed memory computers. We deal with the computation of two-dimensional magnetostatic problems. The resulting linear system will be solved using Householder transformation and Gaussian elimination. Experimental results are obtained on a Meiko Computing Surface with 32 T800 transputers.
The parallel 'Deutschland-Modell' and its implementation on distributedmemory parallel computers using the message passing library PARMACS 6.0 is described. Performance results on a Gray T3D are given and the...
详细信息
The parallel 'Deutschland-Modell' and its implementation on distributedmemory parallel computers using the message passing library PARMACS 6.0 is described. Performance results on a Gray T3D are given and the problem of dynamical load imbalances is addressed. (C) 1997 Elsevier Science B.V.
In this paper a set of programming constructs for the implementation of data parallel algorithms on distributedmemory parallel computers is proposed. The load balancing problem for data parallel programs is cast in a...
详细信息
In this paper a set of programming constructs for the implementation of data parallel algorithms on distributedmemory parallel computers is proposed. The load balancing problem for data parallel programs is cast in a special from. Its relation to the general load balancing problem is analyzed. The applicability of these constructs is asserted for a number of grid-oriented numerical applications. A software tool provides run-time support for data parallel programs based on the proposed constructs. While the application - according to the data parallel programming paradigm - partitions the grid, the tool assigns the partitions to the processors, using built-in mapping algorithms. The approach is general enough to accommodate for data parallel algorithms with varying communication structure and variable calculation requirements using pseudo-dynamic load balancing strategies.
Runtime Incremental Parallel Scheduling (RIPS) is an alternative strategy to the commonly used dynamic scheduling. in this scheduling strategy, the system scheduling activity alternates with the underlying computation...
详细信息
Runtime Incremental Parallel Scheduling (RIPS) is an alternative strategy to the commonly used dynamic scheduling. in this scheduling strategy, the system scheduling activity alternates with the underlying computation work. RIPS utilizes the advanced parallel scheduling technique to produce a low-overhead, high-quality load balancing as well as adapting to irregular applications. This paper presents methods for scheduling a single job on a dedicated parallel machine.
The internal representation of numerical data, their speed of manipulation to generate the desired result through efficient utilisation of central processing unit, memory, and communication links are essential steps o...
详细信息
The internal representation of numerical data, their speed of manipulation to generate the desired result through efficient utilisation of central processing unit, memory, and communication links are essential steps of all high performance scientific computations. Machine parameters, in particular, reveal accuracy and error bounds of computation, required for performance tuning of codes. This paper reports diagnosis of machine parameters, measurement of computing power of several workstations, serial and parallel computers, and a component-wise test procedure for distributed memory computers. Hierarchical memory structure is illustrated by block copying and unrolling techniques. Locality of reference for cache reuse of data is amply demonstrated by fast Fourier transform codes. Cache and register-blocking technique results in their optimum utilisation with consequent gain in throughput during vector-matrix operations. Implementation of these memory management techniques reduces cache inefficiency loss, which is known to be proportional to the number of processors. Of the two Linux clusters-ANUP16, HPC22 and HPC64, it has been found from the measurement of intrinsic parameters and from application benchmark of multi-block Euler code test run that ANUP16 is suitable for problems that exhibit fine-grained parallelism. The delivered performance of ANUP16 is of immense utility for developing high-end PC clusters like HPC64 and customised parallel computers with added advantage of speed and high degree of parallelism.
This paper describes the parallel implementation of algorithms requiring run-time load redistribution with the aid of the parallel programming library LOCO. As a typical application, a 2D finite volume multiblock Eule...
详细信息
This paper describes the parallel implementation of algorithms requiring run-time load redistribution with the aid of the parallel programming library LOCO. As a typical application, a 2D finite volume multiblock Euler/Navier-Stokes code with block-wise adaptive mesh refinement is discussed. The LOCO software handles the communication between blocks and the distribution of blocks among the processors, thereby performing automatic load balancing at run-time. The LOCO library is interfaced with both the native NX communication primitives on Intel iPSC hypercubes and the PVM software on workstation clusters. The parallel performance of the code on the Intel iPSC/860 and on a DEC Alpha workstation cluster is discussed. In particular the effects of mesh refinement on the load balance are investigated.
In this paper we study possibilities for the reduction of communication overhead introduced by inner products in the iterative solution methods CG and GMRES(m). The performance of these methods on massively parallel d...
详细信息
In this paper we study possibilities for the reduction of communication overhead introduced by inner products in the iterative solution methods CG and GMRES(m). The performance of these methods on massively parallel distributedmemory machines is often limited because of the global communication required for the inner products. We investigate two ways of improvement. One is to assemble the results of a number of inner products collectively. The other is to create situations where communication can be overlapped with computation. The matrix-vector products may also introduce some communication overhead, but for many relevant problems this involves only communication with a few nearby processors that is easily overlapped as well. So this may, but does not necessarily, further degrade the performance of the algorithm.
This paper presents an overview of MPI, a proposed standard message passing interface for MIMD distributedmemory concurrent computers. The design of MPI has been a collective effort involving researchers in the Unite...
详细信息
This paper presents an overview of MPI, a proposed standard message passing interface for MIMD distributedmemory concurrent computers. The design of MPI has been a collective effort involving researchers in the United States and Europe from many organizations and institutions. MPI includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies. While making use of new ideas where appropriate, the MPI standard is based largely on current practice.
A directive-based parallelization tool called the Scalable Modeling System (SMS) is described. The user inserts directives in the form of comments into existing Fortran code. SMS translates the code and directives int...
详细信息
A directive-based parallelization tool called the Scalable Modeling System (SMS) is described. The user inserts directives in the form of comments into existing Fortran code. SMS translates the code and directives into a parallel version that runs efficiently on shared and distributedmemory high-performance computing platforms including the SGI Origin, IBM SP2, Cray T3E, Sun, and Alpha and Intel clusters. Twenty directives are available to support operations including array re-declarations, inter-process communications, loop translations, and parallel I/O operations. SMS also provides tools to support incremental parallelization and debugging that significantly reduces code parallelization. time from months to weeks of effort. SMS is intended for applications using regular structured grids that are solved using finite difference approximation or spectral methods. It has been used to parallelize 10 atmospheric and oceanic models, but the tool is sufficiently general that it can be applied to other structured grids codes. Recent performance comparisons demonstrate that the Eta, Hybrid Coordinate Ocean model and Regional Ocean Modeling System model, parallelized using SMS, perform as well or better than their OpenMP or Message Passing Interface counterparts. (C) 2003 Elsevier B.V. All rights reserved.
In this papper we consider the problem of solving 3D diffusion problems on distributed memory computers. We present a parallel algorithm that is suitable for the number of processors less or equal 8. The pipelining me...
详细信息
In this papper we consider the problem of solving 3D diffusion problems on distributed memory computers. We present a parallel algorithm that is suitable for the number of processors less or equal 8. The pipelining method is used to enlarge the number of processors till 64. The computational grid decomposition method is proposed for heterogenous clusters of workstations which preserves the load balancing of computers. The numerical results for two clusters of workstations are given.
暂无评论