The implementation of algorithms on distributed-memory multiprocessors requires regular exchange of certain intermediate results between the parallel processes. The less data that must be moved the more efficient the ...
详细信息
The implementation of algorithms on distributed-memory multiprocessors requires regular exchange of certain intermediate results between the parallel processes. The less data that must be moved the more efficient the parallelization is. In this paper, concepts for efficient implementation of multigrid methods with regular grid structure are presented for the example of the SUPRENUM supercomputer. The main idea is the introduction of an optimized 'multicolor' relaxation scheme, combined with an adapted agglomeration technique. The speedup to be expected on SUPRENUM is discussed for the example of the solution of the Poisson equation in boundary-fitted coordinates.
A parallel algorithm that makes use of the classical three-term recursion formula to construct an orthogonal family of polynomials with respect to a discrete inner product is proposed. The algorithm requires O(N log N...
详细信息
A parallel algorithm that makes use of the classical three-term recursion formula to construct an orthogonal family of polynomials with respect to a discrete inner product is proposed. The algorithm requires O(N log N) parallel arithmetic steps on a distributed-memory multiprocessor with N + 1 processors to construct the polynomials p(i)(x) for 0 less-than-or-equal-to i less-than-or-equal-to N. If hypercube topology is assumed, the algorithm can be implemented with the additional overhead of O(N log N) routing steps. In this case the implementation is quite simple, requiring only scalar single node broadcast and accumulation procedures together with a Gray code mapping. The limited processor version of the algorithm requires O(N2/p + N log p) arithmetic and O(N log p) routing steps on a hypercube with p less-than-or-equal-to N + 1 nodes. We present some experimental results obtained on an Intel cube.
Ray tracing is a well known technique to generate life-like images based on models of light shading, reflection, and refraction. The massive computation and memory demands of ray tracing complex scenes have long motiv...
详细信息
Ray tracing is a well known technique to generate life-like images based on models of light shading, reflection, and refraction. The massive computation and memory demands of ray tracing complex scenes have long motivated researchers to use parallel processing in reducing the ray tracing time. This paper gives a study of parallel implementation of a ray tracing algorithm on a distributedmemory parallel computer. The computational cost of rendering pixels and patterns of data access can not be predicted until runtime. To efficiently parallelize such an application, the issues of database partition, data management and load balancing must be addressed. In this paper, we discuss the ways of database partition and propose a dynamic data management scheme which can exploit image coherence to reduce data. communication time. A global load balancing mechanism is presented to ensure a good load balance among processors during ray tracing time. The success of our implementation depends crucially on a number of parameters which are experimentally evaluated. (C) 1997 Published by Elsevier Science B.V.
We present a parallel algorithm for an exact solution of an integer linear system of equations using the single modulus p-adic expansion technique. More specifically, we parallelize an algorithm of Dixon, and present ...
详细信息
We present a parallel algorithm for an exact solution of an integer linear system of equations using the single modulus p-adic expansion technique. More specifically, we parallelize an algorithm of Dixon, and present our implementation results on a distributed-memory multiprocessor. The parallel algorithm presented here can be used together with the multiple moduli algorithms and parallel Chinese remainder algorithms for fast computation of the exact solution of a system of linear equations with integer entries. (C) 1997 Elsevier Science B.V.
We consider the problem of dynamic load balancing for multiprocessors, for which a typical application is a parallel finite element solution method using non-structured grids and adaptive grid refinement. This type of...
详细信息
We consider the problem of dynamic load balancing for multiprocessors, for which a typical application is a parallel finite element solution method using non-structured grids and adaptive grid refinement. This type of application requires communication between the subproblems which arises from the interdependencies in the data. A load balancing algorithm should ideally not make any assumptions about the physical topology of the parallel machine. Further requirements are that the procedure should be both fast and accurate. An new multi-level algorithm is presented for solving the dynamic load balancing problem which has these properties and whose parallel complexity is logarithmic in the number of processors used in the computation.
In this paper, we consider the problem of reducing the communication cost for the parallel factorization of a sparse symmetric positive definite matrix on a distributed-memory multiprocessor. We define a parallel comm...
详细信息
In this paper, we consider the problem of reducing the communication cost for the parallel factorization of a sparse symmetric positive definite matrix on a distributed-memory multiprocessor. We define a parallel communication cost function and show that, with a contrived example, simply minimizing the height of the elimination tree is ineffective for exploiting minimum communication cost and the discrepancy may grow infinitely. We propose an algorithm to find an ordering such that the communication cost to complete the parallel Cholesky factorization is minimum among all equivalent reorderings. Our algorithm consumes O(n log n + m) in time, where n is the number of nodes and m the sum of all maximal clique sizes in the filled graph. (C) 1999 Elsevier Science B.V. All rights reserved.
In this paper, we consider the symbolic factorization step in computing the Cholesky factorization of a sparse symmetric positive definite matrix on distributed-memory multiprocessor systems. By exploiting the superno...
详细信息
In this paper, we consider the symbolic factorization step in computing the Cholesky factorization of a sparse symmetric positive definite matrix on distributed-memory multiprocessor systems. By exploiting the supernodal structure in the Cholesky factor, the performance of a previous parallel symbolic factorization algorithm is improved. Empirical tests demonstrate that there can be drastic reduction in the execution time required by the new algorithm on an Intel iPSC/2 hypercube.
As parallel implementation of complex applications is becoming popular, the need for a high performance interprocessor communication system becomes imminent, especially in loosely coupled distributed-memory multiproce...
详细信息
As parallel implementation of complex applications is becoming popular, the need for a high performance interprocessor communication system becomes imminent, especially in loosely coupled distributed-memory multiprocessor networks. An important factor in the efficiency of these networks is the effectiveness of the message-passing system which manages the data exchanges among the processors of the network. This paper presents the modeling and performance evaluation of a new Message-Passing System (MPS) for distributedmultiprocessor networks without shared-memory and where the processors or Processing Elements (PEs) are connected to each other by point-to-point communication links. For maximum performance, the MPS manages the communication and the synchronization between the different tasks of an application by means of three approaches. One is an asynchronous send/receive approach which handles efficiently server like tasks. the second is a synchronous send/receive approach which handles efficiently streaming communication mode and the third is a virtual channel approach which minimizes the overhead of the synchronization mechanism, efficiently handling the burst mode of heavy communication between tasks. The developed models of the MPS approaches enable the determination of analytical expressions for different performances and a comparison between analytical and experimental performances reveals that the models predict the MPS performance with high accuracy. The MPS written in Parallel ANSI C, is studied on a mesh topology network of 16 transputers T800. The MPS performances for each approach are studied and presented in terms of communication latency, throughput, computation efficiency and memory consumption.
In this paper we propose a new medium-grain parallel algorithm for computing a matrix inverse on a hypercube multiprocessor. The algorithm implements Gauss-Jordan inversion with column interchanges. The hypercube netw...
详细信息
In this paper we propose a new medium-grain parallel algorithm for computing a matrix inverse on a hypercube multiprocessor. The algorithm implements Gauss-Jordan inversion with column interchanges. The hypercube network is configured as a two-dimensional subcube-grid to support submatrix partitionings. For some algorithms on some types of hypercubes, submatrix partitionings are known to have communication advantages not shared by partitions limited to rows or columns We show that such advantages can be extended to Gauss-Jordan inversion on an Intel iPSC/860, the most current third-generation of hypercubes, and that there is little extra programming effort to include it in the subcube-grid library used in various other matrix computations. An actual aggregate execution rate of 200 MFLOPS (Million Floating-point Operation Per Second) is achieved when inverting a 2000 X 2000 matrix (in double-precision Fortran 77) using 64 iPSC/860 processors configured as an 8 X 8 subcube-grid.
A parallel algorithm is presented for the LU decomposition of a general sparse matrix on a distributed-memory MIMD multiprocessor with a square mesh communication network. In the algorithm, matrix elements are assigne...
详细信息
A parallel algorithm is presented for the LU decomposition of a general sparse matrix on a distributed-memory MIMD multiprocessor with a square mesh communication network. In the algorithm, matrix elements are assigned to processors according to the grid distribution. Each processor represents the nonzero elements of its part of the matrix by a local, ordered, two-dimensional linked-list data structure. The complexity of important operations on this data structure and on several others is analysed. At each step of the algorithm, a parallel search for a set of m compatible pivot elements is performed. The Markowitz counts of the pivot elements are close to minimum, to preserve the sparsity of the matrix. The pivot elements also satisfy a threshold criterion, to ensure numerical stability. The compatibility of the m pivots enables the simultaneous elimination of m pivot rows and m pivot columns in a rank-m update of the reduced matrix. Experimental results on a network of 400 transputers are presented for a set of test matrices from the Harwell-Boeing sparse matrix collection.
暂无评论