The conjugate gradient squared (CGS) algorithm is a Krylov subspace algorithm that can be used to obtain fast solutions for linear systems (Ax=b) with complex nonsymmetric, very large, and very sparse coefficient matr...
详细信息
The conjugate gradient squared (CGS) algorithm is a Krylov subspace algorithm that can be used to obtain fast solutions for linear systems (Ax=b) with complex nonsymmetric, very large, and very sparse coefficient matrices (A). By considering electromagnetic scattering problems as examples, a study of the performance and scalability of this algorithm on two MIMD machines is presented. A modified CGS (MCGS) algorithm, where the synchronization overhead is effectively reduced by a factor of two, is proposed in this paper. This is achieved by changing the computation sequence in the CGS algorithm. Both experimental and theoretical analyses are performed to investigate the impact of this modification on the overall execution time. From the theoretical and experimental analysis it is found that CGS is faster than MCGS for smaller number of processors and MCGS outperforms CGS as the number of processors increases. Based on this observation, a set of algorithms approach is proposed, where either CGS or MGS is selected depending on the values of the dimension of the A matrix (N) and number of processors (P). The set approach provides an algorithm that is more scalable than either the CGS or MCGS algorithms. The experiments performed on a 128-processor mesh Intel Paragon and on a 16-processor IBM SP2 with multistage network indicate that MCGS is approximately 20% faster than CGS.
A 512 node IBM Scalable POWERParallel Systems SP2 was installed at the Cornell Theory Center in October 1994. During the past couple of months we have Seen porting and optimizing code for carrying out lattice QCD calc...
详细信息
ISBN:
(纸本)0897918622
A 512 node IBM Scalable POWERParallel Systems SP2 was installed at the Cornell Theory Center in October 1994. During the past couple of months we have Seen porting and optimizing code for carrying out lattice QCD calculations. Present performance is far from ideal, however, and optimization efforts are still under way. The rate limiting step in our code involves a rather generic inversion of. a large, sparse system, based on a partial differential equation in a multidimensional space. The insights we have gained so far may be useful in diagnosing performance in a wide class of applications. Copyright 1995 by the Association for Computing Machinery, Inc. (ACM).
暂无评论