This paper proposes a scalable two-level parallelization method for distributed hydrological models that can use parallelizability at both the sub-basin level and the basic simulation-unit level (e.g., grid cell) simu...
详细信息
This paper proposes a scalable two-level parallelization method for distributed hydrological models that can use parallelizability at both the sub-basin level and the basic simulation-unit level (e.g., grid cell) simultaneously. This approach first uses the message-passing programming model to dispatch parallel tasks at the sub-basin level to different nodes with multi-core CPUs in the cluster. Each node is responsible for some of the sub-basins. Parallel tasks for each sub-basin at the basic simulation-unit level are then dispatched to multiple cores within each node using the shared-memory programming model. A grid-based distributed hydrological model was parallelized to demonstrate the performance of the proposed method, which was tested in different scenarios (e.g., different data volume, different numbers of sub-basins). Results show that the proposed two-level parallelization method had better scalability than the parallel computation at sub-basin level alone, and the parallel performance increased with data volume and the number of sub-basins. (C) 2016 Elsevier Ltd. All rights reserved.
We present a novel approach utilizing two-level dynamic load balancing for p-adaptive discontinuous Galerkin (DG) methods in compressible Computational Fluid Dynamics (CFD) simulations. The high-order explicit first s...
详细信息
We present a novel approach utilizing two-level dynamic load balancing for p-adaptive discontinuous Galerkin (DG) methods in compressible Computational Fluid Dynamics (CFD) simulations. The high-order explicit first stage, specifically the singly diagonal implicit Runge-Kutta (ESDIRK) method, is employed for time integration, where the pseudo-transient continuation is integrated with the restarted generalized minimal residual (GMRES) method to handle the solution of nonlinear equations at each stage of ESDIRK, excluding the initial stage. Relying on smoothness indicators, we carry out the refinement/coarsening process for p-adaptation with dynamic load balancing. This approach involves a coarse level (distributed memory) decomposition based on MPI paradigm and a fine level (shared memory) decomposition based on OpenMP paradigm, enhancing parallel efficiency. Dynamic load balancing is achieved by computing weights based on degrees of freedom, ensuring balanced computational loads across processors. The parallel computing framework adopts either a graph-based type (ParMETIS and Zoltan) or space-filling curves type (GeMPa) for coarse level partitioning, and a graph-based type (METIS and Zoltan) for fine level partitioning. The effectiveness of the method is demonstrated through numerical examples, highlighting its potential to significantly improve the scalability and efficiency of compressible flow simulations. The numerical simulations were conducted using the CODA flow solver, a state-of-the-art tool developed collaboratively by the French National Aerospace Center (ONERA), the German Aerospace Center (DLR), and Airbus.
In this paper, we present a scalable three dimensional parallel Delaunay image-to-mesh conversion algorithm. A nested master worker model is used to simultaneously explore process- and thread-levelparallelization. Th...
详细信息
In this paper, we present a scalable three dimensional parallel Delaunay image-to-mesh conversion algorithm. A nested master worker model is used to simultaneously explore process- and thread-levelparallelization. The mesh generation includes two stages: coarse and fine meshing. First, a coarse mesh is constructed in parallel by the threads of the master process. Then the coarse mesh is partitioned. Finally, the fine mesh refinement procedure is executed until all the elements in the mesh satisfy the quality and fidelity criteria. The communication and computation are separated during the fine mesh refinement procedure. The master thread of each process that initializes the MPI environment is in charge of the inter-node MPI communication for data (submesh) movement while the worker threads of each process are responsible for the local mesh refinement within the node. We conducted a set of experiments to test the performance of the algorithm on distributed memory clusters and observed that the granularity of coarse level data decomposition, which affects the coarse level concurrency, has a significant influence on the performance of the algorithm. With the proper value of granularity, the algorithm is scalable to 45 distributed memory compute nodes (900 cores). (C) 2017 Elsevier Ltd. All rights reserved.
This paper presents the recent improvements in the DeCART code for HTGR analysis. A new 190-group DeCART cross-section library based on ENDF/B-VII.0 was generated using the KAERI library processing system for HTGR. Tw...
详细信息
This paper presents the recent improvements in the DeCART code for HTGR analysis. A new 190-group DeCART cross-section library based on ENDF/B-VII.0 was generated using the KAERI library processing system for HTGR. two methods for the eigen-mode adjoint flux calculation were implemented. An azimuthal angle discretization method based on the Gaussian quadrature was implemented to reduce the error from the azimuthal angle discretization. A two-level parallelization using MPI and OpenMP was adopted for massive parallel computations. A quadratic depletion solver was implemented to reduce the error involved in the Gd depletion. A module to generate equivalent group constants was implemented for the nodal codes. The capabilities of the DeCART code were improved for geometry handling including an approximate treatment of a cylindrical outer boundary, an explicit border model, the R-G-B checkerboard model, and a super-cell model for a hexagonal geometry. The newly improved and implemented functionalities were verified against various numerical benchmarks such as OECD/MHTGR-350 benchmark phase III problems, two-dimensional high temperature gas cooled reactor benchmark problems derived from the MHTGR-350 reference design, and numerical benchmark problems based on the compact nuclear power source experiment by comparing the DeCART solutions with the Monte-Carlo reference solutions obtained using the McCARD code. (C) 2018 Korean Nuclear Society, Published by Elsevier Korea LLC.
In this paper, we present a scalable three dimensional hybrid MPI+ Threads parallel Delaunay image-to-mesh conversion algorithm. A nested master-worker communication model for parallel mesh generation is implemented w...
详细信息
In this paper, we present a scalable three dimensional hybrid MPI+ Threads parallel Delaunay image-to-mesh conversion algorithm. A nested master-worker communication model for parallel mesh generation is implemented which simultaneously explores process-levelparallelization and thread-levelparallelization: inter-node communication using MPI and inter-core communication inside one node using threads. In order to overlap the communication (task request and data movement) and computation (parallel mesh refinement), the inter-node MPI communication and intra-node local mesh refinement is separated. The master thread that initializes the MPI environment is in charge of the inter-node MPI communication while the worker threads of each process are only responsible for the local mesh refinement within the node. We conducted a set of experiments to test the performance of the algorithm on Turing, a distributed memory cluster at Old Dominion University High Performance Computing Center and observed that the granularity of coarse level data decomposition, which affects the coarse level concurrency, has a significant influence on the performance of the algorithm. With the proper value of granularity, the algorithm expresses impressive performance potential and is scalable to 30 distributed memory compute nodes with 20 cores each (the maximum number of nodes available for us in the experiments). (C) 2016 The Authors. Published by Elsevier Ltd.
In this paper, we present a scalable three dimensional hybrid MPI+Threads parallel Delaunay image-to-mesh conversion algorithm. A nested master-worker communication model for parallel mesh generation is implemented wh...
详细信息
In this paper, we present a scalable three dimensional hybrid MPI+Threads parallel Delaunay image-to-mesh conversion algorithm. A nested master-worker communication model for parallel mesh generation is implemented which simultaneously explores process-levelparallelization and thread-levelparallelization: inter-node communication using MPI and inter-core communication inside one node using threads. In order to overlap the communication (task request and data movement) and computation (parallel mesh refinement), the inter-node MPI communication and intra-node local mesh refinement is separated. The master thread that initializes the MPI environment is in charge of the inter-node MPI communication while the worker threads of each process are only responsible for the local mesh refinement within the node. We conducted a set of experiments to test the performance of the algorithm on Turing, a distributed memory cluster at Old Dominion University High Performance Computing Center and observed that the granularity of coarse level data decomposition, which affects the coarse level concurrency, has a significant influence on the performance of the algorithm. With the proper value of granularity, the algorithm expresses impressive performance potential and is scalable to 30 distributed memory compute nodes with 20 cores each (the maximum number of nodes available for us in the experiments).
暂无评论