The continuous casting bloom is characterized by a large size and long process, leading to tremendous calculation. It takes a long time to simulate the solidification structure by the traditional sequential algorithm ...
详细信息
The continuous casting bloom is characterized by a large size and long process, leading to tremendous calculation. It takes a long time to simulate the solidification structure by the traditional sequential algorithm on the CPU which cannot satisfy the industrial demand for guiding the process. This study developed a multi-GPU-based cellular automaton model to accelerate the calculation. Firstly, a heterogeneous GPU-CA parallelalgorithm was developed to optimize the calculation parallelism by eliminating the data dependency and data race among cells, where the capture process adopted a random-principle-based arbitration mechanism to determine which neighbor obtains the final capture right. Then, the multi-stream communication scheme was developed to overlap the calculation of the inner region and the data transferring and calculation of the halo region, hiding the overhead of data exchange between GPUs. Finally, the present model was validated by the analytical LGK model value, and it was applied to simulate the solidification structure of GCr15 in a certain steel plant. The simulation result shows a clear solidification structure with different crystal zone of columnar, equiaxed, and where the columnar transfers into equiaxed (CET). The proportion of crystal zone agrees well with the low-power images from field experiments with relative errors of 0.032%, 0.013%, and 0.025%. Also, the multi-GPU application can calculate the temperature distribution during the solidification process with the maximum relative error of 0.013% compared to the field data. Furthermore, in the case of owning the almost same calculation precision as a single-core CPU, the speedup of the present model is up to 700x, whereas the speedup of the CPU with 20 cores is only about 14.2x.
Scalable parallelalgorithm for particle transport is one of the main application fields in high-performance computing. Discrete ordinate method (S-n) is one of the most popular deterministic numerical methods for sol...
详细信息
Scalable parallelalgorithm for particle transport is one of the main application fields in high-performance computing. Discrete ordinate method (S-n) is one of the most popular deterministic numerical methods for solving particle transport equations. In this paper, we introduce a new method of large-scale heterogeneous computing of one energy group time-independent deterministic discrete ordinates neutron transport in 3D Cartesian geometry (Sweep3D) on Tianhe-2A supercomputer. In heterogeneous programming, we use customized Basic Communication Library (BCL) and Accelerated Computing Library (ACL) to control and communicate between CPU and the Matrix2000 accelerator. We use OpenMP instructions to exploit the parallelism of threads based on Matrix 2000. The test results show that the optimization of applying OpenMP on particle transport algorithm modified by our method can get 11.3 times acceleration at most. On Tianhe-2A supercomputer, the parallel efficiency of 1.01 million cores compared with 170 thousand cores is 52%.
暂无评论