检索结果-内蒙古大学图书馆

14th International Workshop on Open MP (IWOMP)

作者： Watanabe, Yutaka Lee, Jinpil Boku, Taisuke Sato, Mitsuhisa Univ Tsukuba Grad Sch Syst & Informat Engn Tsukuba Ibaraki Japan RIKEN Ctr Computat Sci Kobe Hyogo Japan Univ Tsukuba Ctr Computat Sci Tsukuba Ibaraki Japan

ISBN: (纸本)9783319985213;9783319985206

In High-Performance Computing (HPC), Field Programmable Gate Array (FPGA) is attracting increased attention as an accelerator because its performance has been dramatically improved in recent years. On the other hand, task-based programming recently supported in OpenMP 4.0 enables to expose much parallelism by executing several tasks of the program in the form of a task graph. To accelerate the task-based parallel program by FPGA, it is useful for some dominant tasks frequently executed in parallel to be offloaded to FPGA as an asynchronous FPGA task. We present a performance optimization based on the trade-off between the kernel size and the number of asynchronously executed kernels in parallel in OpenMP task-based programming with FPGA tasks to make use of FPGA hardware resources efficiently. Since a "program" for FPGA is directly converted into the hardware, the hardware resource limitation raises a new issue in optimization on which and how to offload a task to FPGA. Taking task-based block Cholesky factorization as a motivating example, we present the trade-off on how to offload dominant "GEMM" task frequently executed in parallel in the execution of the task-graph. We found that under the limitation of the hardware resource, multiple small kernels are better than a single big high-performance kernel because of higher throughput and higher kernel frequency.

关键词： Accelerator FPGA OpenMP task-based programming

来源：评论

学校读者我要写书评

暂无评论

task-based programming for Seismic Imaging: Preliminary Results 16

Task-based programming for Seismic Imaging: Preliminary Resu...

引用

16th IEEE Int Conf on High Performance Computing and Communications/11th IEEE Int Conf on Embedded Software and Systems\6th Int Symposium on Cyberspace Safety and Security

作者： Boillot, Lionel Bosilca, George Agullo, Emmanuel Calandra, Henri INRIA Bordeaux Sud Ouest Mag Project Team 3D Ave UnivBP 1155 F-64013 Pau France Univ Tennessee Dept EECS ICL Knoxville TN 37996 USA INRIA Bordeaux Sud Ouest HiePACS Project Team F-33405 Talence France TOTAL EP Depth Imaging & High Performance Comp Houston TX 77057 USA

ISBN: (纸本)9781479961238

The level of hardware complexity of current super-computers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards. The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms. While this paradigm has proved efficient for achieving such goals for dense and sparse linear solvers, it is yet to be demonstrated that industrial parallel codes-relying on the classical Message Passing Interface (MPI) standard and that accumulate dozens of years of expertise (and countless lines of code)-may be revisited to turn them into efficient task-based programs. In this paper, we study the applicability of task-based programming in the case of a Reverse Time Migration (RTM) application for Seismic Imaging. The initial MPI-based application is turned into a task-based code executed on top of the PaRSEC runtime system. Preliminary results show that the approach is competitive with (and even potentially superior to) the original MPI code on a homogeneous multicore node, and can more efficiently exploit complex hardware such as a cache coherent Non Uniform Memory Access (ccNUMA) node or an Intel Xeon Phi accelerator.

关键词： Intel Xeon Phi co-processors ccNUMA nodes elastic wave propagation runtime system task-based programming seismic methods Codes hardware abstraction message passing Earthquakes wave propagation preliminary data programming High Performance Computing mannose phosphate isomerase remote procedure calls

来源：评论

学校读者我要写书评

暂无评论

Enabling FPGA and AI Engine tasks in the HPX programming Framework for Heterogeneous High-Performance Computing 20th

Enabling FPGA and AI Engine Tasks in the HPX Programming Fra...

引用

20th International Symposium on Applied Reconfigurable Computing (ARC)

作者： Kalkhof, Torben Heinz, Carsten Koch, Andreas Tech Univ Darmstadt Embedded Syst & Applicat Grp Darmstadt Germany

ISBN: (纸本)9783031556722;9783031556739

The increasing complexity of modern exascale computers, with a growing number of cores per node, poses a challenge to traditional programming models. To address this challenge, Asynchronous Many-task (AMT) runtimes such as the C++-based HPX, divide computational problems into smaller tasks that are executed asynchronously by the runtime. By unifying the syntax and semantics of local and remote task execution, the scalability for distributed execution is enhanced. The asynchronous execution model conceals communication latency in distributed systems and eliminates global synchronization barriers, which improves the overall utilization of computation resources. While HPX and other AMT runtimes often support GPUs, there is still a lack of support for other accelerators, such as FPGAs, or more coarse-grained AI processing elements such as AMD's AI Engines (AIE). In this work, we extend the TaPaSCo framework so that TaPaSCo FPGA and AIE tasks can be transparently integrated into HPX applications. We show results for both microbenchmarks as well as the complete LULESH proxy HPC application to demonstrate this concept and evaluate the overheads. Both applications show that the combination of TaPaSCo and HPX can be efficiently used for cooperative computing between CPU software and FPGA/AIE hardware. Compared to CPU-only execution, we achieve a speedup of up to 2.4x in our stencil microbenchmark and a wall-clock speedup of 1.37x for the entire LULESH application, with 2.12x in the accelerated kernels itself. Our TaPaSCo/HPX integration is released as open-source.

关键词： FPGA task-based programming HPC AI engines

来源：评论

学校读者我要写书评

暂无评论

PaRSEC: Scalability, flexibility, and hybrid architecture support for task-based applications in ECP

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2025年第1期39卷 147-166页

作者： Bouteiller, Aurelien Herault, Thomas Cao, Qinglei Schuchart, Joseph Bosilca, George Univ Tennessee Dept Elect Engn & Comp Sci 1122 Volunteer BlvdSuite 203 Knoxville TN 37996 USA St Louis Univ Dept Comp Sci St Louis MO USA Nvidia Corp Santa Clara CA USA

This paper highlights the most significant enhancements made to PaRSEC, a scalable task-based runtime system designed for hybrid machines, during the Exascale Computing Project (ECP). The enhancements focus on expanding the capabilities of PaRSEC to address the evolving landscape of parallel computing. Notable achievements include the integration of support for three major types of accelerators (NVIDIA, AMD, and Intel GPUs), the refinement and increased flexibility of the communication subsystem, and the introduction of new programming interfaces tailored for irregular applications. Additionally, the project resulted in the development of powerful debugging and performance analysis tools aimed at assisting users in understanding and optimizing their applications. We present a comprehensive demonstration of these advancements through a series of benchmarks and applications within ECP and beyond, thereby showcasing the enhanced capabilities of PaRSEC across the diverse architectures within the ECP, providing valuable insights into the runtime system's adaptability and performance across varied computing environments.

关键词： Exascale computing project task-based programming dataflow micro-tasks runtime

来源：评论

学校读者我要写书评

暂无评论

Optimizing Parallel System Efficiency: Dynamic task Graph Adaptation with Recursive tasks 1

引用

2nd International Workshop on Asynchronous Many-task Systems and Applications (WAMTA)

作者： Furmento, Nathalie Guermouche, Abdou Lucas, Gwenole Morin, Thomas Thibault, Samuel Wacrenier, Pierre-Andre Univ Bordeaux LaBRI Inria CNRS Bordeaux France

ISBN: (数字)9783031617638

ISBN: (纸本)9783031617621;9783031617638

task-based programming models significantly improve the efficiency of parallel systems. The Sequential task Flow (STF) model focuses on static task sizes within task graphs, but determining optimal granularity during graph submission is tedious. To overcome this, we extend StarPU's STF recursive tasks model, enabling dynamic transformation of tasks into subgraphs. Early evaluations on homogeneous shared memory reveal that this just-in-time adaptation enhances performance.

关键词： task-based programming Granularity Runtime System

来源：评论

学校读者我要写书评

暂无评论

Combining Asynchronous task Parallelism and Intel SGX for Secure Deep Learning 19

Combining Asynchronous Task Parallelism and Intel SGX for Se...

引用

19th European Dependable Computing Conference (EDCC)

作者： Rocha, Isabelly Felber, Pascal Martorel, Xavier Pasin, Marcelo Schiavoni, Valerio Unsal, Osman Univ Neuchatel Inst Comp Sci IIUN Neuchatel Switzerland Barcelona Supercomputing Ctr Barcelona Spain Univ Politecn Cataluna Barcelona Spain

ISBN: (纸本)9798350360691;9798350360684

A common way of improving performance of applications for multi-core processors is to exploit parallelism. In deep learning (DL), training or tuning parameters use user's sensitive data, and thus preserving privacy is critical. Hardware-assisted protection mechanisms (i.e., trusted execution environments - TEEs) offer a practical privacy-preserving solution, nowadays available both in private and public data centers. We present SGX- OMPSS, a new approach combining a task-based programming model (i.e., OmpSs) with TEEs (i.e., Intel Software Guard Extensions). SGX- OMPSS supports asynchronous task parallelism and hardware heterogeneity by using the data dependencies between tasks of the application, easily specified by code annotations. We evaluate SGX- OMPSS via several microbenchmarks and state-of-the-art DL applications and datasets (e.g., YOLO and MNIST). SGX-OMPSS achieves 94% gain speedup while offering additional security guarantees.

关键词： deep learning intel sgx mnist OmpSs task parallelism task-based programming yolo

来源：评论

学校读者我要写书评

暂无评论

Speaking Pygion: Experiences Writing an Exascale Single Particle Imaging Code 1

引用

2nd International Workshop on Asynchronous Many-task Systems and Applications (WAMTA)

作者： Mirchandaney, Seema Aiken, Alex Slaughter, Elliott SLAC Natl Accelerator Lab Menlo Pk CA 94025 USA Stanford Univ Stanford CA 94305 USA

ISBN: (数字)9783031617638

ISBN: (纸本)9783031617621;9783031617638

The goal of the SpiniFEL project was to write, from scratch, a single particle imaging code for exascale supercomputers. The original vision was to have two versions of the code, one in MPI and one in Pygion, a Python-based interface to the Legion task-based runtime. We describe the motivation for the project, some of the programming challenges we encountered along the way, what worked and what didn't, and why only the Pygion code eventually succeeded in running at scale.

关键词： task-based programming exascale computing single particle imaging

来源：评论

学校读者我要写书评

暂无评论

Elasticity in a task-based Dataflow Runtime Through Inter-node GPU Work Stealing 24

Elasticity in a Task-based Dataflow Runtime Through Inter-no...

引用

24th IEEE/ACM International Symposium on Cluster, Cloud, and Internet Computing (CCGrid)

作者： John, Joseph Milthorpe, Josh Natl Computat Infrastruct Canberra ACT Australia Australian Natl Univ Sch Comp Canberra ACT Australia Oak Ridge Natl Lab Oak Ridge TN USA

ISBN: (纸本)9798350395679;9798350395662

Most contemporary HPC programming models assume an inelastic runtime in which the resources allocated to an application remain fixed throughout its execution. Conversely, elastic runtimes can expand and shrink resources based on availability and/or dynamic application requirements. In this paper, we implement elasticity for PaRSEC, a task-based dataflow runtime, using inter-node GPU work stealing. In addition to supporting elasticity, we demonstrate that inter-node GPU work stealing can enhance the performance of imbalanced applications by up to 45%.

关键词： Distributed Work Stealing Elastic computing Malleable computing task-based programming Dataflow PaRSEC

来源：评论

学校读者我要写书评

暂无评论

Detrimental task Execution Patterns in Mainstream OpenMP® Runtimes 1

引用

20th International Workshop on OpenMP (IWOMP)

作者： Tuft, Adam S. Weinzierl, Tobias Klemm, Michael Univ Durham Dept Comp Sci Durham England Adv Micro Devices GmbH Dornach Munich Germany OpenMP Architecture Review Board Beaverton OR USA

ISBN: (数字)9783031725678

ISBN: (纸本)9783031725661;9783031725678

The OpenMP (R) API offers both task-based and data-parallel concepts to scientific computing. While it provides descriptive and prescriptive annotations, it is in many places deliberately unspecific how to implement its annotations. As the predominant OpenMP implementations share design rationales, they introduce "quasi-standards" how certain annotations behave. By means of a task-based astrophysical simulation code, we highlight situations where this "quasi-standard" reference behaviour introduces performance flaws. Therefore, we propose prescriptive clauses to constrain the OpenMP implementations. Simulated task traces uncover the clauses' potential, while a discussion of their realization highlights that they would manifest in rather incremental changes to any OpenMP runtime supporting task priorities.

关键词： OpenMP Scheduling task-based programming

来源：评论

学校读者我要写书评

暂无评论

Pushing the Boundaries of Small tasks: Scalable Low-Overhead Data-Flow programming in TTG

Pushing the Boundaries of Small Tasks: Scalable Low-Overhead...

引用

IEEE International Conference on Cluster Computing (CLUSTER)

作者： Schuchart, Joseph Nookala, Poornima Herault, Thomas Valeev, Edward F. Bosilca, George Univ Tennessee Innovat Comp Lab Knoxville TN 37996 USA SUNY Stony Brook Inst Adv Computat Sci Stony Brook NY 11794 USA Virginia Polytech Inst & State Univ Dept Chem Blacksburg VA 24061 USA

ISBN: (数字)9781665498562

ISBN: (纸本)9781665498562

Shared memory parallel programming models strive to provide low-overhead execution environments. task-based programming models, in particular, are well-suited to cope with the ubiquitous multi- and many-core systems since they allow applications to express all available concurrency to a scheduler, which is tasked with exploiting the available hardware resources. It is general consensus that atomic operations should be preferred over locks and mutexes to avoid inter-thread serialization and the resulting loss in efficiency. However, even atomic operations may serialize threads if not used judiciously. In this work, we will discuss several optimizations applied to TTG and the underlying PaRSEC runtime system aiming at removing contentious atomic operations to reduce the overhead of task management to a few hundred clock cycles. The result is an optimized data-flow programming system that seamlessly scales from a single node to distributed execution and which is able to compete with OpenMP in shared memory.

关键词： Dataflow graph Template task Graph PaRSEC TTG task-based programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：