Preconditioned iterative methods based on the Krylov subspace technique are widely employed in various scientific and technical computing. When utilizing large-scale parallel computing systems, the communication overh...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
Preconditioned iterative methods based on the Krylov subspace technique are widely employed in various scientific and technical computing. When utilizing large-scale parallel computing systems, the communication overhead tends to increase with the growth in the number of nodes, making its reduction a crucial challenge. In parallel finite element methods (FEM) and finite volume methods (FVM), halo communication and computationoverlapping (CC-overlapping) are commonly employed, often in conjunction with the dynamic loop scheduling feature of OpenMP. This approach has been primarily applied to sparse matrix-vector products (SpMV) and explicit solvers. Previous studies by the author have proposed reordering techniques for applying CC-overlapping to processes involving global data dependencies, such as the Conjugate Gradient method preconditioned by Incomplete Cholesky Factorization (ICCG). Successful implementations on massively parallel supercomputers demonstrated high parallel performance, but the application of CC-overlapping was limited to SpMV. In the present work, the author proposes a method to apply CC-overlapping to the forward and backward substitutions of the IC(0) smoother of the parallel Conjugate Gradient method preconditioned by Multigrid (MGCG). Using up to 4,096 nodes on Wisteria/EMEC-01 (Odyssey) with A64FX, performance improvement of approximately 40+% was achieved compared to the original implementation, while improvement was 20+% on 1,024 nodes of Oakbridge-CX system with Intel Xeon CPU's.
Preconditioned parallel solvers based on the Krylov iterative method are widely used in scientific and engineering applications. communication overhead is a critical issue when executing these solvers on large-scale m...
详细信息
ISBN:
(纸本)9781538610442
Preconditioned parallel solvers based on the Krylov iterative method are widely used in scientific and engineering applications. communication overhead is a critical issue when executing these solvers on large-scale massively parallel supercomputers. In this work, we introduced communication-computation (CC) overlapping with dynamic loop scheduling of OpenMP to the sparse matrix-vector multiplication (SpMV) process of a parallel iterative solver. We then used the solver to evaluate the performance of a parallel finite element application (GeoFEM/Cube) on multicore and manycore clusters. The dynamic loop scheduling of OpenMP improved the efficiency of CC overlapping in halo exchanges, and the developed method attained a significant performance improvement of 40-50% for parallel iterative solvers in strong scaling using up to 16,384 cores of a Fujitsu PRIMEHPC FX10 supercomputer and an Intel Xeon Phi (KNL) cluster. Finally, the developed method was applied to GeoFEM/Cube using a parallel BiCGSTAB solver with sparse approximate inverse (SAI) preconditioning, and a 15-20% performance improvement was obtained on 12,288 cores of the Fujitsu FX10 and the KNL cluster.
Distributed memory systems (DMS) (clusters) are one of the tools being used by researchers to solve a wide spectrum of computational intensive problems in a fraction of the time of a sequential approach. The nature of...
详细信息
ISBN:
(纸本)9781467397957
Distributed memory systems (DMS) (clusters) are one of the tools being used by researchers to solve a wide spectrum of computational intensive problems in a fraction of the time of a sequential approach. The nature of a DMS does not enforce intense data sharing among computational nodes, this will occur if the problem under analysis happens to be data dependent in nature. The latency associated with dynamic data sharing in a DMS is well known to increase the total execution time. One of the possible techniques that can be used to reduce the negative effects associated with this latency is overlapping. In this paper we show why a characterization of the overlapping capabilities of a cluster is important to justify results.
A GPU-accelerated Conjugate Gradient solver is tested on eight different matrices with different structural and numerical characteristics. The first four matrices are obtained by discretizing the 3D Poisson's equa...
详细信息
A GPU-accelerated Conjugate Gradient solver is tested on eight different matrices with different structural and numerical characteristics. The first four matrices are obtained by discretizing the 3D Poisson's equation, which arises in many fields such as computational fluid dynamics, heat transfer and so on. Their relatively low bandwidth and low condition numbers makes them ideal targets for GPU acceleration. We chose another four matrices from the other end of the spectrum, both ill-conditioned and with very large bandwidth. This paper concentrates on the computational aspects related to running the solver on multiple GPUs. We develop a fast distributed sparse-matrix vector multiplication routine using optimized data formats that allows the overlapping of communication with computation and, at the same time, the sharing of some of the work with the CPU. By a thorough analysis of the time spent in communication and computation, we show that the proposed overlapped implementation outperforms the non-overlapped one by a large margin and provides almost perfect strong scalability for large Poisson-type matrices. We then benchmark the performance of the entire solver, using both double precision and single precision combined with iterative refinement and report up to 22x acceleration when using three GPUs as compared with one of the most powerful Intel Nehalem CPUs available today. Finally, we show that using GPUs as accelerators not only brings an order of magnitude speedup but also up to 5x increase in power efficiency and over 10x increase in cost effectiveness. Copyright (C) 2010 John Wiley & Sons, Ltd.
暂无评论