Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both wit...
详细信息
Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems, it is therefore considered critical that strategies are developed to make software resilient against failures. In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain. We demonstrate the efficiency and scalability of the checkpoint strategy for simulations with up to 40 billion computational cells executing on more than 400 billion floating point values. A checkpoint creation is shown to require only a few seconds and the new checkpointing scheme scales almost perfectly up to more than 260, 000 (2(18)) processes. To recover from a diskless checkpoint during runtime, we realize the recovery algorithms using ULFM MPI. The checkpointing mechanism is fully integrated in a state-of-the-art high-performance multi-physics simulation framework. We demonstrate the efficiency and robustness of the method with a realistic phase-field simulation originating in the material sciences and with a lattice Boltzmann method implementation.
In the last decades, simulations have been established in several fields of science and industry to study various phenomena by solving, inter alia, partial differential equations. For an efficient use of current and f...
详细信息
In the last decades, simulations have been established in several fields of science and industry to study various phenomena by solving, inter alia, partial differential equations. For an efficient use of current and future high performance computing systems, with many thousands of computation ranks, high node-level performance, scalable communication, and the omission of unnecessary calculations are of high priority in the development of new solvers. The challenge of contemporary simulation applications is to bridge the gap between the scales of the various physical processes. We introduce the NAStJA framework, a block-based MPI parallel solver for arbitrary algorithms, based on stencil code or other regular grid methods. NAStJA decomposes the domain of spatially complex structures into small cuboid blocks. A special feature of NAStJA is the dynamic block adaption which modifies the calculation domain around the region where the computation currently takes place, and hence avoids unnecessary calculations. This often occurs, inter alia, in phase-field simulations. Block creation and deletion is managed autonomously within local neighborhoods. A basic load balancing mechanism allows a re-distribution of newly created blocks to the involved computing ranks. The use of a multi-hop network, to distribute information to the entire domain, avoids collective all-gather communications. Thus, we can demonstrate excellent scaling. The present scaling tests substantiate the enormous advantage of this adaptive method. For certain simulation scenarios, we can show that the calculation effort and memory consumption can be reduced to only 3.5 percent, compared to the classical full-domain reference simulation. The overhead of 70-100 percent for the dynamic adapting block creation is significantly lower than the gain. The approach is not restricted to phase-field simulations, and can be employed in other domains of computational science to exploit sparsity of computing regions.
In this article, we present a novel approach for block-structured adaptive mesh refinement (AMR) that is suitable for extreme-scale parallelism. All data structures are designed such that the size of the metadata in e...
详细信息
In this article, we present a novel approach for block-structured adaptive mesh refinement (AMR) that is suitable for extreme-scale parallelism. All data structures are designed such that the size of the metadata in each distributed processor memory remains bounded independent of the processor number. In all stages of the AMR process, we use only distributed algorithms. No central resources such as a master process or replicated data are employed, so that an unlimited scalability can be achieved. For the dynamic load balancing in particular, we propose to exploit the hierarchical nature of the block-structured domain partitioning by creating a lightweight, temporary copy of the core data structure. This copy acts as a local and fully distributed proxy data structure. It does not contain simulation data but only provides topological information about the domain partitioning into blocks. Ultimately, this approach enables an inexpensive, local, diffusion-based dynamic load balancing scheme. We demonstrate the excellent performance and the full scalability of our new AMR implementation for two architecturally different petascale supercomputers. Benchmarks on an IBM Blue Gene/Q system with a mesh containing 3.7 trillion unknowns distributed to 458,752 processes confirm the applicability for future extreme-scale parallel machines. The algorithms proposed in this article operate on blocks that result from the domain partitioning. This concept and its realization support the storage of arbitrary data. In consequence, the software framework can be used for different simulation methods, including mesh-based and meshless methods. In this article, we demonstrate fluid simulations based on the lattice Boltzmann method.
To efficiently perform collective communications in current high-performance computing systems is a time-consuming task. With future exascale systems, this communication time will be increased further. However, global...
详细信息
ISBN:
(纸本)9781728101767
To efficiently perform collective communications in current high-performance computing systems is a time-consuming task. With future exascale systems, this communication time will be increased further. However, global information is frequently required in various physical models. By exploiting domain knowledge of the model behaviors globally needed information can be distributed more efficiently, using only peer-to-peer communication which spread the information to all processes asynchronous during multiple communication steps. In this article, we introduce a multi-hop based Manhattan Street Network (MSN) for global information exchange and show the conditions under which a local neighbor exchange is sufficient for exchanging distributed information. Besides the MSN, in various models, global information is only needed in a spatially limited region inside the simulation domain. Therefore, a second network is introduced, the local exchange network, to exploit this spatial assumption. Both non-collective global exchange networks are implemented in the massively parallel NAStJA framework. Based on two models, a phase-field model for droplet simulations and the cellular Potts model for biological tissue simulations, we exemplary demonstrate the wide applicability of these networks. Scaling tests of the networks demonstrate a nearly ideal scaling behavior with an efficiency of over 90%. Theoretical prediction of the communication time on future exascale systems shows an enormous advantage of the presented exchange methods of O(1) by exploiting the domain knowledge.
Moving contact line problem plays an important role in fluid-fluid interface motion on solid surfaces. The problem can be described by a phase-field model consisting of the coupled Cahn-Hilliard and Navier-Stokes equa...
详细信息
Moving contact line problem plays an important role in fluid-fluid interface motion on solid surfaces. The problem can be described by a phase-field model consisting of the coupled Cahn-Hilliard and Navier-Stokes equations with the generalized Navier boundary condition (GNBC). Accurate simulation of the interface and contact line motion requires very fine meshes, and the computation in 3D is even more challenging. Thus, the use of high performance computers and scalable parallel algorithms are indispensable. In this paper, we generalize the GNBC to surfaces with complex geometry and introduce a finite element method on unstructured 3D meshes with a semi-implicit time integration scheme. A highly parallel solution strategy using different solvers for different components of the discretization is presented. More precisely, we apply a restricted additive Schwarz preconditioned GMRES method to solve the systems arising from implicit discretization of the Cahn-Hilliard equation and the velocity equation, and an algebraic multigrid preconditioned CG method to solve the pressure Poisson system. Numerical experiments show that the strategy is efficient and scalable for 3D problems with complex geometry and on a supercomputer with a large number of processors.
The lattice Boltzmann method exhibits excellent scalability on current supercomputing systems and has thus increasingly become an alternative method for large-scale nonstationary flow simulations, reaching up to a tri...
详细信息
The lattice Boltzmann method exhibits excellent scalability on current supercomputing systems and has thus increasingly become an alternative method for large-scale nonstationary flow simulations, reaching up to a trillion (10(12)) grid nodes. Additionally, grid refinement can lead to substantial savings in memory and compute time. These savings, however, come at the cost of much more complex data structures and algorithms. In particular, the interface between subdomains with different grid sizes must receive special treatment. In this article, we present parallelalgorithms, distributed data structures, and communication routines that are implemented in the software framework WALBERLA in order to support large-scale, massively parallel lattice Boltzmann based simulations on nonuniform grids. Additionally, we evaluate the performance of our approach on two current petascale supercomputers. On an IBM Blue Gene/Q system, the largest weak scaling benchmarks with refined grids are executed with almost 2 million threads, demonstrating not only near-perfect scalability but also an absolute performance of close to a trillion lattice Boltzmann cell updates per second. On an Intel-based system, the strong scaling of a simulation with refined grids and a total of more than 8.5 million cells is demonstrated to reach a performance of less than 1 millisecond per time step. This enables simulations with complex, nonuniform grids and 4 million time steps per hour compute time.
Astrophysics is the branch of astronomy that employs the principles of physics and chemistry "to ascertain the nature of the heavenly bodies, rather than their positions or motions in space". We describe new...
详细信息
ISBN:
(纸本)9781509008551
Astrophysics is the branch of astronomy that employs the principles of physics and chemistry "to ascertain the nature of the heavenly bodies, rather than their positions or motions in space". We describe new version of our AstroPhi code for simulation of astrophysical objects dynamics and other physical processes on hybrid supercomputers equipped with Intel Xeon Phi accelerators. New version of AstroPhi code was rewritten in accordance to co-design technique. It means that we use latest knowledge about last Intel Xeon Phi generation during code development. The results of simulation by means AstroPhi code for Intel Xeon Phi based massive parallel supercomputer are presented in this paper. The RSC PetaStream architecture is used for astrophysical problems simulation in high resolution. The are some galaxies collision with chemodynamics problems and spiral galaxy formation tests are presented as a demonstration of AstroPhi code.
We present the development of a scalableparallel algorithm and solver for computational electromagnetics based on a double higher order method of moments in the surface integral equation formulation in conjunction wi...
详细信息
ISBN:
(纸本)9781509028863
We present the development of a scalableparallel algorithm and solver for computational electromagnetics based on a double higher order method of moments in the surface integral equation formulation in conjunction with a direct hierarchically semiseparable structures solver. Multiscale modeling using the new method, for electrically very large structures that also include electrically very small details, is discussed, with several advancement strategies.
AstroPhi code is designed for simulation of astrophysical objects dynamics on hybrid supercomputers equipped with Intel Xenon Phi computation accelerators. New RSC PetaStream massively parallel architecture used for s...
详细信息
ISBN:
(纸本)9781479980062
AstroPhi code is designed for simulation of astrophysical objects dynamics on hybrid supercomputers equipped with Intel Xenon Phi computation accelerators. New RSC PetaStream massively parallel architecture used for simulation. The results of AstroPhi acceleration for Intel Xeon Phi native and offload execution modes are presented in this paper. RSC PetaStream architecture gives possibility of astrophysical problems simulation in high resolution. AGNES simulation tool was used for scalability simulation of AstroPhi code. The are some gravitational collapse problems presented as demonstration of AstroPhi code.
parallel marking algorithms use multiple threads to walk through the object heap graph and mark each reachable object as live. parallel marker threads mark an object "live" by atomically setting a bit in a m...
详细信息
ISBN:
(纸本)9781450313506
parallel marking algorithms use multiple threads to walk through the object heap graph and mark each reachable object as live. parallel marker threads mark an object "live" by atomically setting a bit in a mark-bitmap or a bit in the object header. Most of these parallelalgorithms strive to improve the marking throughput by using work-stealing algorithms for load-balancing and to ensure that all participating threads are kept busy. A purely "processor-centric" load-balancing approach in conjunction with a need to atomically set the mark bit, results in significant contention during parallel marking. This limits the scalability and throughput of parallel marking algorithms. We describe a new non-blocking and lock-free, work-sharing algorithm, the primary goal being to reduce contention during atomic updates of the mark-bitmap by parallel task-threads. Our work-sharing mechanism uses the address of a word in the mark-bitmap as the key to stripe work among parallel task-threads, with only a subset of the task-threads working on each stripe. This filters out most of the contention during parallel marking with similar to 20% improvements in performance. In case of concurrent and on-the-fly collector algorithms, mutator threads also generate marking-work for the marking task-threads. In these schemes, mutator threads are also provided with thread-local marking stacks where they collect references to potentially "gray" objects, i.e., objects that haven't been "marked-through" by the collector. We note that since this work is generated by mutators when they reference these objects, there is a high likelihood that these objects continue to be present in the processor cache. We describe and evaluate a scheme to distribute mutator generated marking work among the collector's task-threads that is cognizant of the processor and cache topology. We prototype both our algorithms within the C4 [28] collector that ships as part of an industrial strength JVM for the Linux-X86 platfor
暂无评论