the proceedings contain 3 papers. the topics discussed include: a selective nesting approach for the sparse multi-threaded Cholesky factorization;from merging frameworks to merging stars: experiences using HPX, KOKKOS...
ISBN:
(纸本)9781665463393
the proceedings contain 3 papers. the topics discussed include: a selective nesting approach for the sparse multi-threaded Cholesky factorization;from merging frameworks to merging stars: experiences using HPX, KOKKOS and SIMD types;and broad performance measurement support for asynchronous multi-tasking with APEX.
APEX (Autonomic Performance Environment for eXascale) is a performance measurement library for distributed, asynchronous multitasking runtime systems. It provides support for both lightweight measurement and high conc...
详细信息
ISBN:
(纸本)9781665463393
APEX (Autonomic Performance Environment for eXascale) is a performance measurement library for distributed, asynchronous multitasking runtime systems. It provides support for both lightweight measurement and high concurrency. To support performance measurement in systems that employ user-level threading, APEX uses a dependency chain in addition to the call stack to produce traces and task dependency graphs. APEX also provides a runtime adaptation system based on the observed system performance. In this paper, we describe the evolution of APEX from its design for HPX to support an array of programmingmodels and abstraction layers and describe some of the features that have evolved to help understand the asynchrony and high concurrency of asynchronous tasking models.
作者:
Dufaud, thomasTsuji, MiwakoSato, MitsuhisaUniv Paris Sud
CEA CNRS Digiteo LabsUVSQINRIAMaison SimulatUSR 3441 Bat 565 F-91191 Gif Sur Yvette France UVSQ
EA 7432 LI PaRAD 45 Ave Etats Unis F-78035 Versailles France RIKEN
R CCS Ctr Computat Sci Chuo Ku 7-1-26 Minatojima Minami Machi Kobe Hyogo 6500047 Japan
As boththe complexity of algorithms and architecture increase, development of scientific software becomes a challenge. In order to exploit future architecture, we consider a Multi-SPMD workflow programing model. then...
详细信息
ISBN:
(纸本)9781728101781
As boththe complexity of algorithms and architecture increase, development of scientific software becomes a challenge. In order to exploit future architecture, we consider a Multi-SPMD workflow programing model. then, data transfer between tasks during computation highly depends on the architecture and middleware used. In this study we design an adaptive system for data management in a parallel programming environment which can express two level of parallelism. We show how the consideration of multiple strategies based on I/O and direct message passing can improve performances and fault tolerance in the YML-XMP environment. On a real application with a sufficiently large amount of local data, speedup of 1.36 for a mixed strategy to 1.73 for a direct message passing method are obtained compared to our original design.
In this paper we describe the basic idea, implementation and achieved performance of our DSL for stencil computation, Formura, on systems based on PEZY-SC2 many-core processor. Formura generates, from high-level descr...
详细信息
ISBN:
(纸本)9781728101781
In this paper we describe the basic idea, implementation and achieved performance of our DSL for stencil computation, Formura, on systems based on PEZY-SC2 many-core processor. Formura generates, from high-level description of the differential equation and simple description of finite-difference stencil, the entire simulation code with MPI parallelization with overlapped communication and calculation, advanced temporal blocking and parallelization for many-core processors. Achieved performance is 4.78 PF, or 21.5% of the theoretical peak performance for an explicit scheme for compressive CFD, withthe accuracy of fourth-order in space and third-order in time. For a slightly modified implementation of the same scheme, efficiency was slightly lower (17.5%) but actual calculation time per one timestep was faster by 25%. Temporal blocking improved the performance by up to 70%. Even though the B/F number of PEZY-SC2 is low, around 0.02, we have achieved the efficiency comparable to those of highly optimized CFD codes on machines with much higher memory bandwidth such as K computer. We have demonstrated that automatic generation of the code with temporal blocking is a quite effective way to make use of very large-scale machines with low memory bandwidth for large-scale CFD calculations.
APEX (Autonomic Performance Environment for eXascale) is a performance measurement library for distributed, asynchronous multitasking runtime systems. It provides support for both lightweight measurement and high conc...
详细信息
ISBN:
(纸本)9781665463409
APEX (Autonomic Performance Environment for eXascale) is a performance measurement library for distributed, asynchronous multitasking runtime systems. It provides support for both lightweight measurement and high concurrency. To support performance measurement in systems that employ user-level threading, APEX uses a dependency chain in addition to the call stack to produce traces and task dependency graphs. APEX also provides a runtime adaptation system based on the observed system performance. In this paper, we describe the evolution of APEX from its design for HPX to support an array of programmingmodels and abstraction layers and describe some of the features that have evolved to help understand the asynchrony and high concurrency of asynchronous tasking models.
Sparse linear algebra routines are fundamental building blocks of a large variety of scientific applications. Direct solvers, which are methods for solving linear systems via the factorization of matrices into product...
详细信息
ISBN:
(纸本)9781665463409
Sparse linear algebra routines are fundamental building blocks of a large variety of scientific applications. Direct solvers, which are methods for solving linear systems via the factorization of matrices into products of triangular matrices, are commonly used in many contexts. the Cholesky factorization is the fastest direct method for symmetric and positive definite matrices. this paper presents selective nesting, a method to determine the optimal task granularity for the parallel Cholesky factorization based on the structure of sparse matrices. We propose the Opt-D algorithm, which automatically and dynamically applies selective nesting. Opt-D leverages matrix sparsity to drive complex task-based parallel workloads in the context of direct solvers. We run an extensive evaluation campaign considering a heterogeneous set of 35 sparse matrices and a parallel machine featuring the A64FX processor. Opt-D delivers an average performance speedup of 1.75× with respect to the best state-of-the-art parallel methods to run direct solvers.
Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However,...
详细信息
ISBN:
(纸本)9781665463409
Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However, on A64FX CPUs, we encountered several missing pieces, hindering performance by causing problems withthe SIMD vectorization. therefore, we add std::experimental::simd as an option to use in Octo-Tiger’s Kokkos kernels alongside Kokkos SIMD, and further add a new SVE (Scalable Vector Extensions) SIMD backend. Additionally, we amend missing SIMD implementations in the Kokkos kernels within Octo-Tiger’s hydro solver. We test our changes by running Octo-Tiger on three different CPUs: An A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating SIMD speedup and node-level performance. We get a good SIMD speedup on the A64FX CPU, as well as noticeable speedups on the other two CPU platforms. However, we also experience a scaling issue on the EPYC CPU.
暂无评论