The multigrid method with OpenMP/MPI hybrid parallel programming model is expected to play an important role in large-scale scientific computing on post-peta/exa-scale supercomputer systems. Because the multigrid meth...
详细信息
ISBN:
(数字)9783642387180
ISBN:
(纸本)9783642387180;9783642387173
The multigrid method with OpenMP/MPI hybrid parallel programming model is expected to play an important role in large-scale scientific computing on post-peta/exa-scale supercomputer systems. Because the multigrid method includes various choices of parameters, selecting the optimum combination of these is a critical issue. In the present work, we focus on the selection of single-threading or multi-threading in the procedures of parallel multigrid solvers using OpenMP/MPI parallelhybridprogrammingmodels. We propose a simple empirical method for automatic tuning (AT) of related parameters. The performance of the proposed method is evaluated on the T2K Open Supercomputer (T2K/Tokyo), the Cray XE6, and the Fujitsu FX10 using up to 8,192 cores. The proposed method for AT is effective, and the automatically tuned code provides twice the performance of the original one.
This paper reports on an investigation into large-scale parallel time-harmonic electromagnetic field analysis based on the finite element method. The parallel geometric multigrid preconditioned iterative solver for th...
详细信息
This paper reports on an investigation into large-scale parallel time-harmonic electromagnetic field analysis based on the finite element method. The parallel geometric multigrid preconditioned iterative solver for the resulting linear system was developed on a cluster of shared memory parallel computers. We propose a hybridparallel ordering method for the parallelization of a multiplicative Schwarz smoother, which is a key component of the multigrid solver for electromagnetic field analysis. The method, using domain decomposition ordering for multi-process parallelism and introducing block multi-color ordering for multi-thread parallel processing, attains a high convergence rate with a small number of message passing interface communications and thread synchronizations. The numerical test confirms that the proposed method attains a solver performance more than twice as good as the conventional method based on multi-color ordering. Furthermore, an approximately 800 million degrees of freedom problem is successfully solved on 256 quad-core processors. (c) 2012 Elsevier B.V. All rights reserved.
A multi-GPU implementation of the multilevel fast multipole algorithm (MLFMA) based on the hybrid OpenMP-CUDA parallelprogrammingmodel (OpenMP-CUDA-MLFMA) is presented for computing electromagnetic scattering of a t...
详细信息
A multi-GPU implementation of the multilevel fast multipole algorithm (MLFMA) based on the hybrid OpenMP-CUDA parallelprogrammingmodel (OpenMP-CUDA-MLFMA) is presented for computing electromagnetic scattering of a three-dimensional conducting object. The proposed hierarchical parallelization strategy ensures a high computational throughput for the GPU calculation. The resulting OpenMP-based multi-GPU implementation is capable of solving real-life problems with over one million unknowns with a remarkable speed-up. The radar cross sections of a few benchmark objects are calculated to demonstrate the accuracy of the solution. The results are compared with those from the CPU-based MLFMA and measurements. The capability and efficiency of the presented method are analyzed through the examples of a sphere, an aerocraft, and a missile-like object. Compared with the 8-threaded CPU-based MLFMA, the OpenMP-CUDA-MLFMA method can achieve from 5 to 20 total speed-up ratios.
Preconditioned parallel solvers based on the Krylov iterative method are widely used in scientific and engineering applications. Communication overhead is a critical issue when executing these solvers on large-scale m...
详细信息
ISBN:
(纸本)9781538610442
Preconditioned parallel solvers based on the Krylov iterative method are widely used in scientific and engineering applications. Communication overhead is a critical issue when executing these solvers on large-scale massively parallel supercomputers. In this work, we introduced communication-computation (CC) overlapping with dynamic loop scheduling of OpenMP to the sparse matrix-vector multiplication (SpMV) process of a parallel iterative solver. We then used the solver to evaluate the performance of a parallel finite element application (GeoFEM/Cube) on multicore and manycore clusters. The dynamic loop scheduling of OpenMP improved the efficiency of CC overlapping in halo exchanges, and the developed method attained a significant performance improvement of 40-50% for parallel iterative solvers in strong scaling using up to 16,384 cores of a Fujitsu PRIMEHPC FX10 supercomputer and an Intel Xeon Phi (KNL) cluster. Finally, the developed method was applied to GeoFEM/Cube using a parallel BiCGSTAB solver with sparse approximate inverse (SAI) preconditioning, and a 15-20% performance improvement was obtained on 12,288 cores of the Fujitsu FX10 and the KNL cluster.
Rapidly changing computer architectures, such as those found at high-performance computing (HPC) facilities, present the need for mini-applications (miniapps) that capture essential algorithms used in large applicatio...
详细信息
ISBN:
(纸本)9781665422871
Rapidly changing computer architectures, such as those found at high-performance computing (HPC) facilities, present the need for mini-applications (miniapps) that capture essential algorithms used in large applications to test program performance and portability, aiding transitions to new systems. The COVID-19 pandemic has fueled a flurry of activity in computational drug discovery, including the use of supercomputers and GPU acceleration for massive virtual screens for therapeutics. Recent work targeting COVID-19 at the Oak Ridge Leadership Computing Facility (OLCF) used the GPU-accelerated program AutoDock-GPU to screen billions of compounds on the Summit supercomputer. In this paper we present the development of a new miniapp, miniAutoDock-GPU, that can be used to evaluate the performance and portability of GPU-accelerated prote-inligand docking programs on different computer architectures. These tests are especially relevant as facilities transition from petascale systems and prepare for upcoming exascale systems that will use a variety of GPU vendors. The key calculations, namely, the Lamarckian genetic algorithm combined with a local search using a Solis-Wets based random optimization algorithm, are implemented. We developed versions of the miniapp using several different programmingmodels for GPU acceleration, including a version using the CUDA runtime API for NVIDIA GPUs, and the Kokkos middle-ware API which is facilitated by C++ template libraries. A third version, currently in progress, uses the HIP programmingmodel. These efforts will help facilitate the transition to exascale systems for this important emerging HPC application, as well as its use on a wide range of heterogeneous platforms.
Alternating direction method of multipliers (ADMM) is an efficient algorithm to solve large- scale machine learning problems in a distributed environment. To make full use of the hierarchical memory model in modern hi...
详细信息
ISBN:
(纸本)9781665435741
Alternating direction method of multipliers (ADMM) is an efficient algorithm to solve large- scale machine learning problems in a distributed environment. To make full use of the hierarchical memory model in modern highperformance computing systems, this paper implements a hybrid MPI/OpenMP parallelization of the asynchronous ADMM algorithm (AH-ADMM). The AH-ADMM algorithm updates local variables in parallel by OpenMP threads and exchanges information between MPI processes, which relieves memory and communication pressure by replacing multiprocessing with multi- threading. Furthermore, for the SVM problem, the AH-ADMMalgorithm speeds up the calculation of sub- problems through an efficient parallel optimization strategy. This paper effectively combines the features of both algorithm design and programmingmodel. Experiments on the Ziqiang4000 high-performance cluster demonstrate that the AH- ADMM algorithm scales better and run faster than the existing distributed ADMM algorithms implemented by pure MPI. The AH-ADMM can reduce the communication overhead by up to 91.8% and increase the convergence rate by up to 36x. For large datasets, the AH-ADMM can scale well on the cluster which over 129 cores.
暂无评论