In this paper we study possibilities for the reduction of communication overhead introduced by inner products in the iterative solution methods CG and GMRES(m). The performance of these methods on massively parallel d...
详细信息
In this paper we study possibilities for the reduction of communication overhead introduced by inner products in the iterative solution methods CG and GMRES(m). The performance of these methods on massively parallel distributedmemory machines is often limited because of the global communication required for the inner products. We investigate two ways of improvement. One is to assemble the results of a number of inner products collectively. The other is to create situations where communication can be overlapped with computation. The matrix-vector products may also introduce some communication overhead, but for many relevant problems this involves only communication with a few nearby processors that is easily overlapped as well. So this may, but does not necessarily, further degrade the performance of the algorithm.
Computational methods based on the use of adaptively constructed nonuniform meshes reduce the amount of computation and storage necessary to perform many scientific calculations. The adaptive construction of such nonu...
详细信息
Computational methods based on the use of adaptively constructed nonuniform meshes reduce the amount of computation and storage necessary to perform many scientific calculations. The adaptive construction of such nonuniform meshes is an important part of these methods. In this paper, we present a parallel algorithm for adaptive mesh refinement that is suitable for implementation on distributed-memory parallel computers. Experimental results obtained on the Intel DELTA are presented to demonstrate that for scientific computations involving the finite element method, the algorithm exhibits scalable performance and has a small run time in comparison with other aspects of the scientific computations examined. It is also shown that the algorithm has a fast expected running time under the parallel random access machine (PRAM) computation model.
Parallel scheduling is a new approach for load balancing. In parallel scheduling, all processors cooperate to schedule work. Parallel scheduling is able to accurately balance the load by using global load information ...
详细信息
Parallel scheduling is a new approach for load balancing. In parallel scheduling, all processors cooperate to schedule work. Parallel scheduling is able to accurately balance the load by using global load information at compile-time or runtime. It provides high-quality load balancing. This paper presents an overview of the parallel scheduling technique. Scheduling algorithms for tree, hypercube, and mesh networks are presented. These algorithms can fully balance the load and maximize locality at runtime. Communication costs are significantly reduced compared to other existing algorithms.
In this papper we consider the problem of solving 3D diffusion problems on distributed memory computers. We present a parallel algorithm that is suitable for the number of processors less or equal 8. The pipelining me...
详细信息
In this papper we consider the problem of solving 3D diffusion problems on distributed memory computers. We present a parallel algorithm that is suitable for the number of processors less or equal 8. The pipelining method is used to enlarge the number of processors till 64. The computational grid decomposition method is proposed for heterogenous clusters of workstations which preserves the load balancing of computers. The numerical results for two clusters of workstations are given.
This paper describes the parallel implementation of algorithms requiring run-time load redistribution with the aid of the parallel programming library LOCO. As a typical application, a 2D finite volume multiblock Eule...
详细信息
This paper describes the parallel implementation of algorithms requiring run-time load redistribution with the aid of the parallel programming library LOCO. As a typical application, a 2D finite volume multiblock Euler/Navier-Stokes code with block-wise adaptive mesh refinement is discussed. The LOCO software handles the communication between blocks and the distribution of blocks among the processors, thereby performing automatic load balancing at run-time. The LOCO library is interfaced with both the native NX communication primitives on Intel iPSC hypercubes and the PVM software on workstation clusters. The parallel performance of the code on the Intel iPSC/860 and on a DEC Alpha workstation cluster is discussed. In particular the effects of mesh refinement on the load balance are investigated.
This paper presents an overview of MPI, a proposed standard message passing interface for MIMD distributedmemory concurrent computers. The design of MPI has been a collective effort involving researchers in the Unite...
详细信息
This paper presents an overview of MPI, a proposed standard message passing interface for MIMD distributedmemory concurrent computers. The design of MPI has been a collective effort involving researchers in the United States and Europe from many organizations and institutions. MPI includes point-to-point and collective communication routines, as well as support for process groups, communication contexts, and application topologies. While making use of new ideas where appropriate, the MPI standard is based largely on current practice.
作者:
Lanteri, SINRIA
2004 Route des Lucioles B.P. 93 06902 Sophia-Antipolis Cedex France
Defining a good strategy for the parallelisation of an unstructured mesh based solver is a challenge, particularly when one aims at reaching a high level of performance while maintaining portability of the source code...
详细信息
Defining a good strategy for the parallelisation of an unstructured mesh based solver is a challenge, particularly when one aims at reaching a high level of performance while maintaining portability of the source code between scalar, vector and parallel machines. In this paper, we present parallel solutions of realistic three-dimensional flows obtained on the Intel Paragon, the Cray T3D and the IBM SP2 MPPs (Massively Parallel Processors). The solver under consideration is a representative subset of an existing industrial code, N3S-MUSCL which implements a mixed finite element/finite volume formulation on unstructured tetrahedral meshes. The adopted parallelisation strategy combines mesh partitioning techniques and a message-passing programming model. We compare in details performance results obtained with parallel solution strategies based on overlapping and non-overlapping mesh partitions.
In this paper a set of programming constructs for the implementation of data parallel algorithms on distributedmemory parallel computers is proposed. The load balancing problem for data parallel programs is cast in a...
详细信息
In this paper a set of programming constructs for the implementation of data parallel algorithms on distributedmemory parallel computers is proposed. The load balancing problem for data parallel programs is cast in a special from. Its relation to the general load balancing problem is analyzed. The applicability of these constructs is asserted for a number of grid-oriented numerical applications. A software tool provides run-time support for data parallel programs based on the proposed constructs. While the application - according to the data parallel programming paradigm - partitions the grid, the tool assigns the partitions to the processors, using built-in mapping algorithms. The approach is general enough to accommodate for data parallel algorithms with varying communication structure and variable calculation requirements using pseudo-dynamic load balancing strategies.
The efficient parallel execution of grid-oriented scientific calculations requires a partitioning of the grid that minimises both load imbalance and interprocessor communication. For unstructured static grids, good pa...
详细信息
The efficient parallel execution of grid-oriented scientific calculations requires a partitioning of the grid that minimises both load imbalance and interprocessor communication. For unstructured static grids, good partitions are obtained with the recursive spectral bisection heuristic, applied to the interdependency graph of the grid. We will describe an alternative spectral bisection algorithm that yields better partitions than the standard algorithm, especially for interdependency graphs with a large variation in the weights of the edges. We will further describe how even in case of dynamically changing grids, grid-oriented problems can be formulated as graph partitioning problems for the purpose of load balancing. We will then partition these dynamically changing grids with the alternative spectral algorithm.
In this paper, we analyse and compare different parallel implementations of the Boundary Element Method on distributed memory computers. We deal with the computation of two-dimensional magnetostatic problems. The resu...
详细信息
In this paper, we analyse and compare different parallel implementations of the Boundary Element Method on distributed memory computers. We deal with the computation of two-dimensional magnetostatic problems. The resulting linear system will be solved using Householder transformation and Gaussian elimination. Experimental results are obtained on a Meiko Computing Surface with 32 T800 transputers.
暂无评论