Modern heterogeneous high-end computing systems featuring many-core co-processors/accelerators pose tough challenges to parallelize and optimize real-world scientific codes. In this paper, we demonstrate highly scalab...
详细信息
ISBN:
(纸本)9781728143286
Modern heterogeneous high-end computing systems featuring many-core co-processors/accelerators pose tough challenges to parallelize and optimize real-world scientific codes. In this paper, we demonstrate highly scalable 3D Lattice Boltzmann multiphase flow simulations on the heterogeneous Tianhe-2 supercomputer using MPI+openmp+SIMD. We highlight the use of the openmp4.5 accelerator programming model to collaborate CPUs and Intel Many Integrated Cores (MIC) co-processors, and present a range of optimizations to exploit hierarchical parallelism on the heterogeneous architecture. With SIMD-friendly data structures, computation reordering, cache blocking and auto-vectorization, we dramatically optimize the single-thread performance for typical LBM computational kernels on a CPU and MIC core. After implementation of shared-memory openmp threading, we further improve the openmp performance by enabling huge paging and openmp collapse clause on the thread-rich MIC architecture. To enhance the collaborative efficiency among intra-node CPUs and co-processors, we propose a flexible load balance model with heterogeneous domain decomposition for CPU-MIC task allocation, as well as asynchronous offloading to overlap operations of CPUs and multiple MICs. We effectively overlap all levels of CPU-MIC computation/communication and minimize the cost of halo exchanges as far as possible in large-scale scenario. 3D liquid and gases multi-phase cases simulating drop impact with gravity effect using D3Q19 Lattice Boltzmann discretization and Shan-Chen BGK collision model are presented, achieving a weak parallel efficiency of above 80% in going from 128 to 2048 compute nodes.
暂无评论