the proceedings contain 5 papers. the special focus in this conference is on acceleratorprogrammingusingdirectives. the topics include: GPU Acceleration of the FINE/FR CFD Solver in a Heterogeneous Environment with...
ISBN:
(纸本)9783030742232
the proceedings contain 5 papers. the special focus in this conference is on acceleratorprogrammingusingdirectives. the topics include: GPU Acceleration of the FINE/FR CFD Solver in a Heterogeneous Environment with OpenACC directives;performance and Portability of a Linear Solver Across Emerging Architectures;ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems.
the proceedings contain 7 papers. the special focus in this conference is on acceleratorprogrammingusingdirectives. the topics include: Accelerating the Performance of Modal Aerosol Module of E3SM using OpenACC;Eva...
ISBN:
(纸本)9783030499426
the proceedings contain 7 papers. the special focus in this conference is on acceleratorprogrammingusingdirectives. the topics include: Accelerating the Performance of Modal Aerosol Module of E3SM using OpenACC;Evaluation of Directive-Based GPU programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices;Performance of the RI-MP2 Fortran Kernel of GAMESS on GPUs via Directive-Based Offloading with Math Libraries;performance Portable Implementation of a Kinetic Plasma Simulation Mini-App;A Portable SIMD Primitive using Kokkos for Heterogeneous Architectures.
Heterogeneous systems are becoming increasingly prevalent. In order to exploit the rich compute resources of such systems, robust programming models are needed for application developers to seamlessly migrate legacy c...
详细信息
ISBN:
(纸本)9783030742232;9783030742249
Heterogeneous systems are becoming increasingly prevalent. In order to exploit the rich compute resources of such systems, robust programming models are needed for application developers to seamlessly migrate legacy code from today's systems to tomorrow's. Over the past decade and more, directives have been established as one of the promising paths to tackle programmatic challenges on emerging systems. this work focuses on applying and demonstrating OpenMP offloading directives on five proxy applications. We observe that the performance varies widely from one compiler to the other;a crucial aspect of our work is reporting best practices to application developers who use OpenMP offloading compilers. While some issues can be worked around by the developer, there are other issues that must be reported to the compiler vendors. By restructuring OpenMP offloading directives, we gain an 18x speedup for the su3 proxy application on NERSC's Cori system when usingthe Clang compiler, and a 15.7x speedup by switching max reductions to add reductions in the laplace mini-app when usingthe Cray-llvm compiler on Cori.
A linear solver algorithm used by a large-scale unstructured-grid computational fluid dynamics application is examined for a broad range of familiar and emerging architectures. Efficient implementation of a linear sol...
详细信息
ISBN:
(纸本)9783030742232;9783030742249
A linear solver algorithm used by a large-scale unstructured-grid computational fluid dynamics application is examined for a broad range of familiar and emerging architectures. Efficient implementation of a linear solver is challenging on recent CPUs offering vector architectures. Vector loads and stores are essential to effectively utilize available memory bandwidth on CPUs, and maintaining performance across different CPUs can be difficult in the face of varying vector lengths offered by each. A similar challenge occurs on GPU architectures, where it is essential to have coalesced memory accesses to utilize memory bandwidth effectively. In this work, we demonstrate that restructuring a computation, and possibly data layout, with regard to architecture is essential to achieve optimal performance by establishing a performance benchmark for each target architecture in a low level language such as vector intrinsics or CUDA. In doing so, we demonstrate how a linear solver kernel can be mapped to Intel((R)) Xeon (TM) and Xeon Phi (TM), Marvell((R)) thunderX2((R)), NEC (R) SX-Aurora (TM) TSUBASA Vector Engine, and NVIDIA((R)) and AMD((R)) GPUs. We further demonstrate that the required code restructuring can be achieved in higher level programming environments such as OpenACC, OCCA, and Intel((R)) OneAPI (TM)/SYCL, and that each generally results in optimal performance on the target architecture. Relative performance metrics for all implementations are shown, and subjective ratings for ease of implementation and optimization are suggested.
Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, w...
详细信息
ISBN:
(纸本)9783030499426;9783030499433
Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8x-4.3x speedup over an optimized CPU implementation when tested with four different input matrices. the evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory withthe original kernels. Our tiled SpMM implementation achieves a 2.9x and 48.2x speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively.
usingthe GPUs embedded in mobile devices allows for increasing the performance of the applications running on them while reducing the energy consumption of their execution. this article presents a task-based solution...
详细信息
ISBN:
(纸本)9783319748962;9783319748955
usingthe GPUs embedded in mobile devices allows for increasing the performance of the applications running on them while reducing the energy consumption of their execution. this article presents a task-based solution for adaptative, collaborative heterogeneous computing on mobile cloud environments. To implement our proposal, we extend the COMPSs-Mobile framework - an implementation of the COMPSs programming model for building mobile applications that offload part of the computation to the Cloud - to support offloading computation to GPUs through OpenCL. To evaluate our solution, we subject the prototype to three benchmark applications representing different application patterns.
this book constitutes the proceedings of the 7thinternationalworkshop on acceleratorprogrammingusingdirectives, waccpd 2020, which took place on November 20, 2021. the workshop was initially planned to take ...
详细信息
ISBN:
(数字)9783030742249
ISBN:
(纸本)9783030742232
this book constitutes the proceedings of the 7thinternationalworkshop on acceleratorprogrammingusingdirectives, waccpd 2020, which took place on November 20, 2021. the workshop was initially planned to take place in Atlanta, GA, USA, and changed to an online format due to the COVID-19 pandemic.
暂无评论