检索结果-内蒙古大学图书馆

heterogeneous programming using OpenMP and CUDA/HIP for hybrid CPU-GPU scientific applications

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2023年第5期37卷 626-646页

作者： Tallada, Marc Gonzalez Morancho, Enric Univ Politecn Catalunya BarcelonaTECH Comp Architecture Dept Barcelona Spain Univ Politecn Catalunya BarcelonaTECH Comp Architecture Dept Jordi Girona 1-3 Barcelona 08034 Spain

Hybrid computer systems combine compute units (CUs) of different nature like CPUs, GPUs and FPGAs. Simultaneously exploiting the computing power of these CUs requires a careful decomposition of the applications into balanced parallel tasks according to both the performance of each CU type and the communication costs among them. This paper describes the design and implementation of runtime support for OpenMP hybrid GPU-CPU applications, when mixed with GPU-oriented programming models (e.g. CUDA/HIP). The paper describes the case for a hybrid multi-level parallelization of the NPB-MZ benchmark suite. The implementation exploits both coarse-grain and fine-grain parallelism, mapped to compute units of different nature (GPUs and CPUs). The paper describes the implementation of runtime support to bridge OpenMP and HIP, introducing the abstractions of Computing Unit and Data Placement. We compare hybrid and non-hybrid executions under state-of-the-art schedulers for OpenMP: static and dynamic task schedulings. Then, we improve the set of schedulers with two additional variants: a memorizing-dynamic task scheduling and a profile-based static task scheduling. On a computing node composed of one AMD EPYC 7742 @ 2.250 GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 x GPU AMD Radeon Instinct MI50 with 32 GB, hybrid executions present speedups from 1.10x up to 3.5x with respect to a non-hybrid GPU implementation, depending on the number of activated CUs.

关键词： heterogeneous programming hybrid CPU-GPU OpenMP CUDA HIP

来源：评论

学校读者我要写书评

暂无评论

Supporting efficient overlapping of host-device operations for heterogeneous programming with CtrlEvents

引用

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 2023年 179卷

作者： Torres, Yuri Andujar, Francisco J. Gonzalez-Escribano, Arturo Llanos, Diego R. Univ Valladolid Dept Informat Valladolid Spain

heterogeneous systems with several kinds of devices, such as multi-core CPUs, GPUs, FPGAs, among others, are now commonplace. Exploiting all these devices with device-oriented programming models, such as CUDA or OpenCL, requires expertise and knowledge about the underlying hardware to tailor the application to each specific device, thus degrading performance portability. Higher-level proposals simplify the programming of these devices, but their current implementations do not have an efficient support to solve problems that include frequent bursts of computation and communication, or input/output operations. In this work we present CtrlEvents, a new heterogeneous runtime solution which automatically overlaps computation and communication whenever possible, simplifying and improving the efficiency of data-dependency analysis and the coordination of both device computations and host tasks that include generic I/O operations. Our solution outperforms other state-of-the-art implementations for most situations, presenting a good balance between portability, programmability and efficiency. (c) 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons .org /licenses /by /4 .0/).

关键词： Parallel programming heterogeneous programming Asynchronous operations Events GPUs

来源：评论

学校读者我要写书评

暂无评论

heterogeneous programming and Optimization of Gyrokinetic Toroidal Code Using Directives 5th

Heterogeneous Programming and Optimization of Gyrokinetic To...

引用

5th International Workshop on Accelerator programming Using Directives (WACCPD)

作者： Zhang, Wenlu Joubert, Wayne Wang, Peng Wang, Bei Tang, William Niemerg, Matthew Shi, Lei Taimourzadeh, Sam Bao, Jian Lin, Zhihong Univ Calif Irvine Dept Phys & Astron Irvine CA 92697 USA Chinese Acad Sci Inst Phys Beijing Peoples R China Oak Ridge Natl Lab Oak Ridge TN USA NVidia Santa Clara CA USA Princeton Univ Princeton NJ 08544 USA IBM Corp New York NY USA

ISBN: (纸本)9783030122744;9783030122737

The latest production version of the fusion particle simulation code, Gyrokinetic Toroidal Code (GTC), has been ported to and optimized for the next generation exascale GPU supercomputing platform. heterogeneous programming using directives has been utilized to balance the continuously implemented physical capabilities and rapidly evolving software/hardware systems. The original code has been refactored to a set of unified functions/calls to enable the acceleration for all the species of particles. Extensive GPU optimization has been performed on GTC to boost the performance of the particle push and shift operations. In order to identify the hotspots, the code was the first benchmarked on up to 8000 nodes of the Titan supercomputer, which shows about 2-3 times overall speedup comparing NVidia M2050 GPUs to Intel Xeon X5670 CPUs. This Phase I optimization was followed by further optimizations in Phase II, where single-node tests show an overall speedup of about 34 times on SummitDev and 7.9 times on Titan. The real physics tests on Summit machine showed impressive scaling properties that reaches roughly 50% efficiency on 928 nodes of Summit. The GPU + CPU speed up from purely CPU is over 20 times, leading to an unprecedented speed.

关键词： Massively parallel computing heterogeneous programming Directives GPU OpenACC Fusion plasma Particle in cell

来源：评论

学校读者我要写书评

暂无评论

Static Stages for heterogeneous programming

引用

PROCEEDINGS OF THE ACM ON programming LANGUAGES-PACMPL 2017年第OOPSLA期1卷 1-27页

作者： Sampson, Adrian McKinley, Kathryn S. Mytkowicz, Todd Cornell Univ Dept Comp Sci Gates Hall Ithaca NY 14853 USA Google 1600 Amphitheatre Pkwy Mountain View CA 94043 USA Microsoft Res 1 Microsoft Way Redmond WA 98052 USA

heterogeneous hardware is central to modern advances in performance and efficiency. Mainstream programming models for heterogeneous architectures, however, sacrifice safety and expressiveness in favor of low-level control over performance details. The interfaces between hardware units consist of verbose, unsafe APIs;hardware-specific languages make it difficult to move code between units;and brittle preprocessor macros complicate the task of specializing general code for efficient accelerated execution. We propose a unified low-level programming model for heterogeneous systems that offers control over performance, safe communication constructs, cross-device code portability, and hygienic metaprogramming for specialization. The language extends constructs from multi-stage programming to separate code for different hardware units, to communicate between them, and to express compile-time code optimization. We introduce static staging, a different take on multi-stage programming that lets the compiler generate all code and communication constructs ahead of time. To demonstrate our approach, we use static staging to implement BraidGL, a real-time graphics programming language for CPU-GPU systems. Current real-time graphics software in OpenGL uses stringly-typed APIs for communication and unsafe preprocessing to generate specialized GPU code variants. In BraidGL, programmers instead write hybrid CPU-GPU software in a unified language. The compiler statically generates target-specific code and guarantees safe communication between the CPU and the graphics pipeline stages. Example scenes demonstrate the language's productivity advantages: BraidGL eliminates the safety and expressiveness pitfalls of OpenGL and makes common specialization techniques easy to apply. The case study demonstrates how static staging can express core placement and specialization in general heterogeneous programming.

关键词： Multi-stage programming heterogeneous programming graphics programming OpenGL

来源：评论

学校读者我要写书评

暂无评论

Portability efficiency approach for calculating performance portability

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2025年 170卷

作者： Marowka, Ami Parallel Res Lab 8 Rosh Pina Petah Tiqwa 49729 Israel

The emergence of heterogeneity in high-performance computing, which harnesses under one integrated system several platforms of different architectures, also led to the development of innovative cross-platform programming models. Along with the expectation that these models will yield computationally intensive performance, there is demand for them to provide a reasonable degree of performance portability. Therefore, new tools and metrics are being developed to measure and calculate the level of performance portability of applications and programming models. The ultimate measure of performance portability is performance efficiency. Performance efficiency refers to the achieved performance as a fraction of some peak theoretical or practical baseline performance. Application efficiency approaches are the most popular and attractive performance efficiency measures among researchers because they are simple to measure and calculate. Unfortunately, the way they are used yields results that do not make sense, while violating one of the basic criteria that defines and characterizes the performance portability metrics. In this paper, we demonstrate how researchers currently use application efficiency to calculate the performance portability of applications and explain why this method deviates from its original definition. Then, we show why the obtained results do not make sense and propose practical solutions that satisfy the definition and criteria of performance portability metrics. Finally, we present a new performance efficiency approach called portability efficiency, which is immune to the shortcomings of application efficiency and substantially improves the aspect of portability when calculating performance portability.

关键词： Performance portability Application efficiency Portability efficiency heterogeneous programming High performance computing metric

来源：评论

学校读者我要写书评

暂无评论

OneGraph: a cross-architecture framework for large-scale graph computing on GPUs based on oneAPI

引用

CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING 2024年第2期6卷 179-191页

作者： Li, Shiyang Zhu, Jingyu Han, Jiaxun Peng, Yuting Wang, Zhuoran Gong, Xiaoli Wang, Gang Zhang, Jin Wang, Xuqiang Nankai Univ Colleage Comp Sci Tianjin 300350 Peoples R China State Grid Tianjin Informat & Commun Co Tianjin Peoples R China

The explosive growth of graph data sets has led to an increase in the computing power and storage resources required for graph computing. To handle large-scale graph processing, heterogeneous platforms have become necessary to provide sufficient computing power and storage. The most popular scheme for this is the CPU-GPU architecture. However, the steep learning curve and complex concurrency control for heterogeneous platforms pose a challenge for developers. Additionally, GPUs from different vendors have varying software stacks, making cross-platform porting and verification challenging. Recently, Intel proposed a unified programming model to manage multiple heterogeneous devices at the same time, named oneAPI. It provides a more friendly programming model for simple C++ developers and a convenient concurrency control scheme, allowing managing different vendors of devices at the same time. Hence there is an opportunity to utilize oneAPI to design a general cross-architecture framework for large-scale graph computing. In this paper, we propose a large-scale graph computing framework for multiple types of accelerators with Intel oneAPI and we name it as OneGraph. Our approach significantly reduces the data transfer between GPU and CPU and masks the latency by asynchronous transfer, which significantly improves performance. We conducted rigorous performance tests on the framework using four classical graph algorithms. The experiment results show that our approach achieves an average speedup of 3.3x over the state-of-the-art partitioning-based approaches. Moreover, thanks to the cross-architecture model of Intel oneAPI, the framework can be deployed on different GPU platforms without code modification. And our evaluation proves that OneGraph has only less than 1% performance loss compared to the dedicated programming model on GPUs in large-scale graph computing.

关键词： heterogeneous programming Graph computing Out-of-memory process Cross-architecture portability OneAPI

来源：评论

学校读者我要写书评

暂无评论

RenderKernel:High-level programming for real-time rendering systems

引用

Visual Informatics 2024年第3期8卷 82-95页

作者： Jinyuan Yang Soumyabrata Dev Abraham G.Campbell University College Dublin Ireland

Real-time rendering applications leverage heterogeneous computing to optimize ***,software development across multiple devices presents challenges,including data layout inconsistencies,synchronization issues,resource management complexities,and architectural ***,the creation of such systems requires verbose and unsafe programming *** developments in domain-specific and unified shading languages aim to mitigate these ***,current programming models primarily address data layout consistency,neglecting other persistent *** this paper,we introduce RenderKernel,a programming model designed to simplify the development of real-time rendering *** the need for a high-level approach,RenderKernel addresses the specific challenges of real-time rendering,enabling development on heterogeneous systems as if they were *** model allows for early detection and prevention of errors due to system heterogeneity at ***,RenderKernel enables the use of common programming patterns from homogeneous environments,freeing developers from the complexities of underlying heterogeneous *** can focus on coding unique application features,thereby enhancing productivity and reducing the cognitive load associated with real-time rendering system development.

关键词： heterogeneous programming High-level programming Real-time rendering Rendering systems

来源：评论

学校读者我要写书评

暂无评论

Experiences Building an MLIR-Based SYCL Compiler 24

Experiences Building an MLIR-Based SYCL Compiler

引用

22nd IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

作者： Tiotto, Ettore Perez, Victor Tsang, Whitney Sommer, Lukas Oppermann, Julian Lomueller, Victor Goli, Mehdi Brodman, James Intel Corp Toronto ON Canada Codeplay Software Edinburgh Midlothian Scotland Intel Corp Boston MA USA

ISBN: (纸本)9798350395099

Similar to other programming models, compilers for SYCL, the open programming model for heterogeneous computing based on C++, would benefit from access to higher-level intermediate representations. The loss of high-level structure and semantics caused by premature lowering to low-level intermediate representations and the inability to reason about host and device code simultaneously present major challenges for SYCL compilers. The MLIR compiler framework, through its dialect mechanism, allows to model domain-specific, high-level intermediate representations and provides the necessary facilities to address these challenges. This work therefore describes practical experience with the design and implementation of an MLIR-based SYCL compiler. By modeling key elements of the SYCL programming model in host and device code in the MLIR dialect framework, the presented approach enables the implementation of powerful device code optimizations as well as analyses across host and device code. Compared to two LLVM-based SYCL implementations, this yields speedups of up to 4.3x on a collection of SYCL benchmark applications. Finally, this work also discusses challenges encountered in the design and implementation and how these could be addressed in the future.

关键词： SYCL MLIR compiler optimization heterogeneous programming

来源：评论

学校读者我要写书评

暂无评论

Supporting Unified Shader Specialization by Co-opting C++ Features

引用

PROCEEDINGS OF THE ACM ON COMPUTER GRAPHICS AND INTERACTIVE TECHNIQUES 2022年第3期5卷 1–17页

作者： Seitz, Kerry A., Jr. Foley, Theresa Porumbescu, Serban D. Owens, John D. Univ Calif Davis Dept Comp Sci One Shields Ave Davis CA 95616 USA NVIDIA 2788 San Tomas Expressway Santa Clara CA 95051 USA Univ Calif Davis Dept Elect & Comp Engn One Shields Ave Davis CA 95616 USA

Modern unified programming models (such as CUDA and SYCL) that combine host (CPU) code and GPU code into the same programming language, same file, and same lexical scope lack adequate support for GPU code specialization, which is a key optimization in real-time graphics. Furthermore, current methods used to implement specialization do not translate to a unified environment. In this paper, we create a unified shader programming environment in C++ that provides first-class support for specialization by co-opting C++'s attribute and virtual function features and reimplementing them with alternate semantics to express the services required. By co-opting existing features, we enable programmers to use familiar C++ programming techniques to write host and GPU code together, while still achieving efficient generated C++ and HLSL code via our source-to-source translator.

关键词： Shaders Shading Languages Real-Time Rendering heterogeneous programming Unified programming

来源：评论

学校读者我要写书评

暂无评论

OmpSs@cloudFPGA: An FPGA Task-Based programming Model with Message Passing 36

OmpSs@cloudFPGA: An FPGA Task-Based Programming Model with M...

引用

36th IEEE International Parallel and Distributed Processing Symposium (IEEE IPDPS)

作者： Miguel de Haro, Juan Cano, Ruben Alvarez, Carlos Jimenez-Gonzalez, Daniel Martorell, Xavier Ayguade, Eduard Labarta, Jeses Abel, Francois Ringlein, Burkhard Weiss, Beat Barcelona Supercomp Ctr Barcelona Spain Univ Politecn Cataluna Barcelona Spain IBM Res Europe Zurich Switzerland

ISBN: (纸本)9781665481069

Nowadays, a new parallel paradigm for energyefficient heterogeneous hardware infrastructures is required to achieve better performance at a reasonable cost on highperformance computing applications. Under this new paradigm, some application parts are offloaded to specialized accelerators that run faster or are more energy-efficient than CPUs. FieldProgrammable Gate Arrays (FPGA) are one of those types of accelerators that are becoming widely available in data centers. This paper proposes OmpSs@cloudFPGA, which includes novel extensions to parallel task-based programming models that enable easy and efficient programming of heterogeneous clusters with FPGAs. The programmer only needs to annotate, with OpenMP-like pragmas, the tasks of the application that should be accelerated in the cluster of FPGAs. Next, the proposed programming model framework automatically extracts parts annotated with High-Level Synthesis (HLS) pragmas and synthesizes them into hardware accelerator cores for FPGAs. Additionally, our extensions include and support two novel features: 1) FPGA-toFPGA direct communication using a Message Passing Interface (MPI) similar Application programming Interface (API) with one-to-one and collective communications to alleviate host communication channel bottleneck, and 2) creating and spawning work from inside the FPGAs to their own accelerator cores based on an MPI rank-like identification. These features break the classical host-accelerator model, where the host (typically the CPU) generates all the work and distributes it to each accelerator. We also present an evaluation of OmpSs@cloudFPGA for different parallel strategies of the N-Body application on the IBM cloudFPGA research platform. Results show that for cluster sizes up to 56 FPGAs, the performance scales linearly. To the best of our knowledge, this is the best performance obtained for N-body over FPGA platforms, reaching 344 Gpairs/s with 56 FPGAs. Finally, we compare the performance and power consu

关键词： FPGA MPI OpenMP programming models network-attached FPGA stand-alone FPGA High-Level Synthesis heterogeneous programming High-performance computing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：