检索结果-内蒙古大学图书馆

A scalable and extensible checkpointing scheme for massively parallel simulations

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2019年第4期33卷 571-589页

作者： Kohl, Nils Hoetzer, Johannes Schornbaum, Florian Bauer, Martin Godenschwager, Christian Koestler, Harald Nestler, Britta Ruede, Ulrich Friedrich Alexander Univ Erlangen Nurnberg Chair Syst Simulat Cauerstr 11 D-91058 Erlangen Germany Karlsruhe Univ Appl Sci Inst Mat & Proc Karlsruhe Germany Karlsruhe Inst Technol Inst Appl Mat Karlsruhe Germany CERFACS Parallel Algorithms Project Toulouse France

Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems, it is therefore considered critical that strategies are developed to make software resilient against failures. In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain. We demonstrate the efficiency and scalability of the checkpoint strategy for simulations with up to 40 billion computational cells executing on more than 400 billion floating point values. A checkpoint creation is shown to require only a few seconds and the new checkpointing scheme scales almost perfectly up to more than 260, 000 (2(18)) processes. To recover from a diskless checkpoint during runtime, we realize the recovery algorithms using ULFM MPI. The checkpointing mechanism is fully integrated in a state-of-the-art high-performance multi-physics simulation framework. We demonstrate the efficiency and robustness of the method with a realistic phase-field simulation originating in the material sciences and with a lattice Boltzmann method implementation.

关键词： Resilience checkpoint-restart supercomputing scalable parallel algorithms parallel performance HPC ULFM MPI

来源：评论

学校读者我要写书评

暂无评论

Massively parallel Stencil Code Solver with Autonomous Adaptive Block Distribution

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2018年第10期29卷 2282-2296页

作者： Berghoff, Marco Kondov, Ivan Hoetzer, Johannes KIT SCC Hermann von Helmholtz Pl 1 D-76344 Eggenstein Leopoldshafen Germany KIT IAM Str Forum 7 D-76131 Karlsruhe Germany Karlsruhe Univ Appl Sci Inst Mat & Proc Moltkestr 30 D-76133 Karlsruhe Germany

In the last decades, simulations have been established in several fields of science and industry to study various phenomena by solving, inter alia, partial differential equations. For an efficient use of current and future high performance computing systems, with many thousands of computation ranks, high node-level performance, scalable communication, and the omission of unnecessary calculations are of high priority in the development of new solvers. The challenge of contemporary simulation applications is to bridge the gap between the scales of the various physical processes. We introduce the NAStJA framework, a block-based MPI parallel solver for arbitrary algorithms, based on stencil code or other regular grid methods. NAStJA decomposes the domain of spatially complex structures into small cuboid blocks. A special feature of NAStJA is the dynamic block adaption which modifies the calculation domain around the region where the computation currently takes place, and hence avoids unnecessary calculations. This often occurs, inter alia, in phase-field simulations. Block creation and deletion is managed autonomously within local neighborhoods. A basic load balancing mechanism allows a re-distribution of newly created blocks to the involved computing ranks. The use of a multi-hop network, to distribute information to the entire domain, avoids collective all-gather communications. Thus, we can demonstrate excellent scaling. The present scaling tests substantiate the enormous advantage of this adaptive method. For certain simulation scenarios, we can show that the calculation effort and memory consumption can be reduced to only 3.5 percent, compared to the classical full-domain reference simulation. The overhead of 70-100 percent for the dynamic adapting block creation is significantly lower than the gain. The approach is not restricted to phase-field simulations, and can be employed in other domains of computational science to exploit sparsity of computing regions.

关键词： Stencil code distributed memory scalable parallel algorithms massively parallel performance multi-hop network load balancing partial differential equation phase-field method

来源：评论

学校读者我要写书评

暂无评论

EXTREME-SCALE BLOCK-STRUCTURED ADAPTIVE MESH REFINEMENT

引用

SIAM JOURNAL ON SCIENTIFIC COMPUTING 2018年第3期40卷 C358-C387页

作者： Schornbaum, Florian Ruede, Ulrich Friedrich Alexander Univ Erlangen Nurnberg Chair Syst Simulat D-91058 Erlangen Germany CERFACS Parallel Algorithms Project Toulouse France

In this article, we present a novel approach for block-structured adaptive mesh refinement (AMR) that is suitable for extreme-scale parallelism. All data structures are designed such that the size of the metadata in each distributed processor memory remains bounded independent of the processor number. In all stages of the AMR process, we use only distributed algorithms. No central resources such as a master process or replicated data are employed, so that an unlimited scalability can be achieved. For the dynamic load balancing in particular, we propose to exploit the hierarchical nature of the block-structured domain partitioning by creating a lightweight, temporary copy of the core data structure. This copy acts as a local and fully distributed proxy data structure. It does not contain simulation data but only provides topological information about the domain partitioning into blocks. Ultimately, this approach enables an inexpensive, local, diffusion-based dynamic load balancing scheme. We demonstrate the excellent performance and the full scalability of our new AMR implementation for two architecturally different petascale supercomputers. Benchmarks on an IBM Blue Gene/Q system with a mesh containing 3.7 trillion unknowns distributed to 458,752 processes confirm the applicability for future extreme-scale parallel machines. The algorithms proposed in this article operate on blocks that result from the domain partitioning. This concept and its realization support the storage of arbitrary data. In consequence, the software framework can be used for different simulation methods, including mesh-based and meshless methods. In this article, we demonstrate fluid simulations based on the lattice Boltzmann method.

关键词： adaptive mesh refinement dynamic load balancing supercomputing scalable parallel algorithms parallel performance lattice Boltzmann method high performance computing

来源：评论

学校读者我要写书评

暂无评论

Non-collective scalable Global Network Based on Local Communications 9

Non-collective Scalable Global Network Based on Local Commun...

引用

9th IEEE/ACM Workshop on Latest Advances in scalable algorithms for Large-Scale Systems (scalA)

作者： Berghoff, Marco Kondov, Ivan Karlsruhe Inst Technol Steinbuch Ctr Comp Karlsruhe Germany

ISBN: (纸本)9781728101767

To efficiently perform collective communications in current high-performance computing systems is a time-consuming task. With future exascale systems, this communication time will be increased further. However, global information is frequently required in various physical models. By exploiting domain knowledge of the model behaviors globally needed information can be distributed more efficiently, using only peer-to-peer communication which spread the information to all processes asynchronous during multiple communication steps. In this article, we introduce a multi-hop based Manhattan Street Network (MSN) for global information exchange and show the conditions under which a local neighbor exchange is sufficient for exchanging distributed information. Besides the MSN, in various models, global information is only needed in a spatially limited region inside the simulation domain. Therefore, a second network is introduced, the local exchange network, to exploit this spatial assumption. Both non-collective global exchange networks are implemented in the massively parallel NAStJA framework. Based on two models, a phase-field model for droplet simulations and the cellular Potts model for biological tissue simulations, we exemplary demonstrate the wide applicability of these networks. Scaling tests of the networks demonstrate a nearly ideal scaling behavior with an efficiency of over 90%. Theoretical prediction of the communication time on future exascale systems shows an enormous advantage of the presented exchange methods of O(1) by exploiting the domain knowledge.

关键词： peer-to-peer communication distributed memory scalable parallel algorithms massive-parallel performance network protocol stencil code phase-field method

来源：评论

学校读者我要写书评

暂无评论

A parallel Finite Element Method for 3D Two-Phase Moving Contact Line Problems in Complex Domains

引用

JOURNAL OF SCIENTIFIC COMPUTING 2017年第3期72卷 1119-1145页

作者： Luo, Li Zhang, Qian Wang, Xiao-Ping Cai, Xiao-Chuan Hong Kong Univ Sci & Technol Dept Math Kowloon Hong Kong Peoples R China Univ Colorado Dept Comp Sci Boulder CO 80309 USA

Moving contact line problem plays an important role in fluid-fluid interface motion on solid surfaces. The problem can be described by a phase-field model consisting of the coupled Cahn-Hilliard and Navier-Stokes equations with the generalized Navier boundary condition (GNBC). Accurate simulation of the interface and contact line motion requires very fine meshes, and the computation in 3D is even more challenging. Thus, the use of high performance computers and scalable parallel algorithms are indispensable. In this paper, we generalize the GNBC to surfaces with complex geometry and introduce a finite element method on unstructured 3D meshes with a semi-implicit time integration scheme. A highly parallel solution strategy using different solvers for different components of the discretization is presented. More precisely, we apply a restricted additive Schwarz preconditioned GMRES method to solve the systems arising from implicit discretization of the Cahn-Hilliard equation and the velocity equation, and an algebraic multigrid preconditioned CG method to solve the pressure Poisson system. Numerical experiments show that the strategy is efficient and scalable for 3D problems with complex geometry and on a supercomputer with a large number of processors.

关键词： Two-phase flows Moving contact line Phase-field model Unstructured mesh Finite element method scalable parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

MASSIVELY parallel algorithms FOR THE LATTICE BOLTZMANN METHOD ON NONUNIFORM GRIDS

引用

SIAM JOURNAL ON SCIENTIFIC COMPUTING 2016年第2期38卷 C96-C126页

作者： Schornbaum, Florian Ruede, Ulrich Univ Erlangen Nurnberg Chair Syst Simulat D-91058 Erlangen Germany

The lattice Boltzmann method exhibits excellent scalability on current supercomputing systems and has thus increasingly become an alternative method for large-scale nonstationary flow simulations, reaching up to a trillion (10(12)) grid nodes. Additionally, grid refinement can lead to substantial savings in memory and compute time. These savings, however, come at the cost of much more complex data structures and algorithms. In particular, the interface between subdomains with different grid sizes must receive special treatment. In this article, we present parallel algorithms, distributed data structures, and communication routines that are implemented in the software framework WALBERLA in order to support large-scale, massively parallel lattice Boltzmann based simulations on nonuniform grids. Additionally, we evaluate the performance of our approach on two current petascale supercomputers. On an IBM Blue Gene/Q system, the largest weak scaling benchmarks with refined grids are executed with almost 2 million threads, demonstrating not only near-perfect scalability but also an absolute performance of close to a trillion lattice Boltzmann cell updates per second. On an Intel-based system, the strong scaling of a simulation with refined grids and a total of more than 8.5 million cells is demonstrated to reach a performance of less than 1 millisecond per time step. This enables simulations with complex, nonuniform grids and 4 million time steps per hour compute time.

关键词： lattice Boltzmann method grid refinement nonuniform grids supercomputing scalable parallel algorithms parallel performance LBM HPC CFD

来源：评论

学校读者我要写书评

暂无评论

Numerical simulations of astrophysical problems on massively parallel supercomputer 11

Numerical simulations of astrophysical problems on massively...

引用

11th International Forum on Strategic Technology (IFOST)

作者： Kulikov, Igor Glinsky, Boris Chernykh, Igor Nenashev, Vladislav Shmelev, Alexey RAS ICMMG SB Novosibirsk Russia Novosibirsk State Tech Univ Novosibirsk Russia Novosibirsk State Univ Novosibirsk Russia RAS ICMMG SB Supercomp Ctr Dept Novosibirsk Russia RSC Grp Moscow Russia

ISBN: (纸本)9781509008551

Astrophysics is the branch of astronomy that employs the principles of physics and chemistry "to ascertain the nature of the heavenly bodies, rather than their positions or motions in space". We describe new version of our AstroPhi code for simulation of astrophysical objects dynamics and other physical processes on hybrid supercomputers equipped with Intel Xeon Phi accelerators. New version of AstroPhi code was rewritten in accordance to co-design technique. It means that we use latest knowledge about last Intel Xeon Phi generation during code development. The results of simulation by means AstroPhi code for Intel Xeon Phi based massive parallel supercomputer are presented in this paper. The RSC PetaStream architecture is used for astrophysical problems simulation in high resolution. The are some galaxies collision with chemodynamics problems and spiral galaxy formation tests are presented as a demonstration of AstroPhi code.

关键词： astrophysics simulation scalable parallel algorithms massively parallel architecture

来源：评论

学校读者我要写书评

暂无评论

Multiscale Electromagnetic Modeling Using Double-Higher-Order Quadrilateral Meshes and parallel MoM-SIE Direct Solutions

Multiscale Electromagnetic Modeling Using Double-Higher-Orde...

引用

IEEE-Antennas-and-Propagation-Society International Symposium

作者： Notaros, Branislav M. Manic, Ana B. Smull, Aaron P. Manic, Sanja B. Li, Xiaoye Sherry Rouet, Francois-Henry Colorado State Univ Dept Elect & Comp Engn Ft Collins CO 80523 USA Lawrence Berkeley Natl Lab Computat Res Div Berkeley CA 94720 USA

ISBN: (纸本)9781509028863

We present the development of a scalable parallel algorithm and solver for computational electromagnetics based on a double higher order method of moments in the surface integral equation formulation in conjunction with a direct hierarchically semiseparable structures solver. Multiscale modeling using the new method, for electrically very large structures that also include electrically very small details, is discussed, with several advancement strategies.

关键词： Multiscale modeling scalable parallel algorithms fast direct solvers surface integral equations integration methods

来源：评论

学校读者我要写书评

暂无评论

Astrophysics simulation on RSC massively parallel architecture 15

Astrophysics simulation on RSC massively parallel architectu...

引用

2015 15th IEEE ACM International Symposium on Cluster Cloud and Grid Computing (CCGrid 2015)

作者： Kulikov, Igor Glinsky, Boris Chernykh, Igor Weins, Dmitry Shmelev, Alexey ICMMG SB RAS Novosibirsk Russia Novosibirsk State Univ Novosibirsk Russia ICMMG SB RAS Supercomp Ctr Dept Novosibirsk Russia RSC Grp Moscow Russia

ISBN: (纸本)9781479980062

AstroPhi code is designed for simulation of astrophysical objects dynamics on hybrid supercomputers equipped with Intel Xenon Phi computation accelerators. New RSC PetaStream massively parallel architecture used for simulation. The results of AstroPhi acceleration for Intel Xeon Phi native and offload execution modes are presented in this paper. RSC PetaStream architecture gives possibility of astrophysical problems simulation in high resolution. AGNES simulation tool was used for scalability simulation of AstroPhi code. The are some gravitational collapse problems presented as demonstration of AstroPhi code.

关键词： astrophysics simulation scalable parallel algorithms massively parallel architecture

来源：评论

学校读者我要写书评

暂无评论

scalable Concurrent and parallel Mark 12

Scalable Concurrent and Parallel Mark

引用

2012 International Symposium on Memory Management

作者： Iyengar, Balaji Gehringer, Edward Wolf, Michael Manivannan, Karthikeyan N Carolina State Univ Raleigh NC 27695 USA

ISBN: (纸本)9781450313506

parallel marking algorithms use multiple threads to walk through the object heap graph and mark each reachable object as live. parallel marker threads mark an object "live" by atomically setting a bit in a mark-bitmap or a bit in the object header. Most of these parallel algorithms strive to improve the marking throughput by using work-stealing algorithms for load-balancing and to ensure that all participating threads are kept busy. A purely "processor-centric" load-balancing approach in conjunction with a need to atomically set the mark bit, results in significant contention during parallel marking. This limits the scalability and throughput of parallel marking algorithms. We describe a new non-blocking and lock-free, work-sharing algorithm, the primary goal being to reduce contention during atomic updates of the mark-bitmap by parallel task-threads. Our work-sharing mechanism uses the address of a word in the mark-bitmap as the key to stripe work among parallel task-threads, with only a subset of the task-threads working on each stripe. This filters out most of the contention during parallel marking with similar to 20% improvements in performance. In case of concurrent and on-the-fly collector algorithms, mutator threads also generate marking-work for the marking task-threads. In these schemes, mutator threads are also provided with thread-local marking stacks where they collect references to potentially "gray" objects, i.e., objects that haven't been "marked-through" by the collector. We note that since this work is generated by mutators when they reference these objects, there is a high likelihood that these objects continue to be present in the processor cache. We describe and evaluate a scheme to distribute mutator generated marking work among the collector's task-threads that is cognizant of the processor and cache topology. We prototype both our algorithms within the C4 [28] collector that ships as part of an industrial strength JVM for the Linux-X86 platfor

关键词： algorithms Design Performance parallel Marking Work-stealing Work-sharing scalable parallel algorithms Compare-and-swap instruction Concurrent marking Processor and cache topology Prefetching

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：