In High-Performance Computing (HPC), Field Programmable Gate Array (FPGA) is attracting increased attention as an accelerator because its performance has been dramatically improved in recent years. On the other hand, ...
详细信息
ISBN:
(纸本)9783319985213;9783319985206
In High-Performance Computing (HPC), Field Programmable Gate Array (FPGA) is attracting increased attention as an accelerator because its performance has been dramatically improved in recent years. On the other hand, task-based programming recently supported in OpenMP 4.0 enables to expose much parallelism by executing several tasks of the program in the form of a task graph. To accelerate the task-based parallel program by FPGA, it is useful for some dominant tasks frequently executed in parallel to be offloaded to FPGA as an asynchronous FPGA task. We present a performance optimization based on the trade-off between the kernel size and the number of asynchronously executed kernels in parallel in OpenMP task-based programming with FPGA tasks to make use of FPGA hardware resources efficiently. Since a "program" for FPGA is directly converted into the hardware, the hardware resource limitation raises a new issue in optimization on which and how to offload a task to FPGA. Taking task-based block Cholesky factorization as a motivating example, we present the trade-off on how to offload dominant "GEMM" task frequently executed in parallel in the execution of the task-graph. We found that under the limitation of the hardware resource, multiple small kernels are better than a single big high-performance kernel because of higher throughput and higher kernel frequency.
The level of hardware complexity of current super-computers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards. The high-level of hardware abstraction ...
详细信息
ISBN:
(纸本)9781479961238
The level of hardware complexity of current super-computers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards. The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms. While this paradigm has proved efficient for achieving such goals for dense and sparse linear solvers, it is yet to be demonstrated that industrial parallel codes-relying on the classical Message Passing Interface (MPI) standard and that accumulate dozens of years of expertise (and countless lines of code)-may be revisited to turn them into efficient task-based programs. In this paper, we study the applicability of task-based programming in the case of a Reverse Time Migration (RTM) application for Seismic Imaging. The initial MPI-based application is turned into a task-based code executed on top of the PaRSEC runtime system. Preliminary results show that the approach is competitive with (and even potentially superior to) the original MPI code on a homogeneous multicore node, and can more efficiently exploit complex hardware such as a cache coherent Non Uniform Memory Access (ccNUMA) node or an Intel Xeon Phi accelerator.
The increasing complexity of modern exascale computers, with a growing number of cores per node, poses a challenge to traditional programming models. To address this challenge, Asynchronous Many-task (AMT) runtimes su...
详细信息
ISBN:
(纸本)9783031556722;9783031556739
The increasing complexity of modern exascale computers, with a growing number of cores per node, poses a challenge to traditional programming models. To address this challenge, Asynchronous Many-task (AMT) runtimes such as the C++-based HPX, divide computational problems into smaller tasks that are executed asynchronously by the runtime. By unifying the syntax and semantics of local and remote task execution, the scalability for distributed execution is enhanced. The asynchronous execution model conceals communication latency in distributed systems and eliminates global synchronization barriers, which improves the overall utilization of computation resources. While HPX and other AMT runtimes often support GPUs, there is still a lack of support for other accelerators, such as FPGAs, or more coarse-grained AI processing elements such as AMD's AI Engines (AIE). In this work, we extend the TaPaSCo framework so that TaPaSCo FPGA and AIE tasks can be transparently integrated into HPX applications. We show results for both microbenchmarks as well as the complete LULESH proxy HPC application to demonstrate this concept and evaluate the overheads. Both applications show that the combination of TaPaSCo and HPX can be efficiently used for cooperative computing between CPU software and FPGA/AIE hardware. Compared to CPU-only execution, we achieve a speedup of up to 2.4x in our stencil microbenchmark and a wall-clock speedup of 1.37x for the entire LULESH application, with 2.12x in the accelerated kernels itself. Our TaPaSCo/HPX integration is released as open-source.
This paper highlights the most significant enhancements made to PaRSEC, a scalable task-based runtime system designed for hybrid machines, during the Exascale Computing Project (ECP). The enhancements focus on expandi...
详细信息
This paper highlights the most significant enhancements made to PaRSEC, a scalable task-based runtime system designed for hybrid machines, during the Exascale Computing Project (ECP). The enhancements focus on expanding the capabilities of PaRSEC to address the evolving landscape of parallel computing. Notable achievements include the integration of support for three major types of accelerators (NVIDIA, AMD, and Intel GPUs), the refinement and increased flexibility of the communication subsystem, and the introduction of new programming interfaces tailored for irregular applications. Additionally, the project resulted in the development of powerful debugging and performance analysis tools aimed at assisting users in understanding and optimizing their applications. We present a comprehensive demonstration of these advancements through a series of benchmarks and applications within ECP and beyond, thereby showcasing the enhanced capabilities of PaRSEC across the diverse architectures within the ECP, providing valuable insights into the runtime system's adaptability and performance across varied computing environments.
task-based programming models significantly improve the efficiency of parallel systems. The Sequential task Flow (STF) model focuses on static task sizes within task graphs, but determining optimal granularity during ...
详细信息
ISBN:
(数字)9783031617638
ISBN:
(纸本)9783031617621;9783031617638
task-based programming models significantly improve the efficiency of parallel systems. The Sequential task Flow (STF) model focuses on static task sizes within task graphs, but determining optimal granularity during graph submission is tedious. To overcome this, we extend StarPU's STF recursive tasks model, enabling dynamic transformation of tasks into subgraphs. Early evaluations on homogeneous shared memory reveal that this just-in-time adaptation enhances performance.
A common way of improving performance of applications for multi-core processors is to exploit parallelism. In deep learning (DL), training or tuning parameters use user's sensitive data, and thus preserving privac...
详细信息
ISBN:
(纸本)9798350360691;9798350360684
A common way of improving performance of applications for multi-core processors is to exploit parallelism. In deep learning (DL), training or tuning parameters use user's sensitive data, and thus preserving privacy is critical. Hardware-assisted protection mechanisms (i.e., trusted execution environments - TEEs) offer a practical privacy-preserving solution, nowadays available both in private and public data centers. We present SGX- OMPSS, a new approach combining a task-based programming model (i.e., OmpSs) with TEEs (i.e., Intel Software Guard Extensions). SGX- OMPSS supports asynchronous task parallelism and hardware heterogeneity by using the data dependencies between tasks of the application, easily specified by code annotations. We evaluate SGX- OMPSS via several microbenchmarks and state-of-the-art DL applications and datasets (e.g., YOLO and MNIST). SGX-OMPSS achieves 94% gain speedup while offering additional security guarantees.
The goal of the SpiniFEL project was to write, from scratch, a single particle imaging code for exascale supercomputers. The original vision was to have two versions of the code, one in MPI and one in Pygion, a Python...
详细信息
ISBN:
(数字)9783031617638
ISBN:
(纸本)9783031617621;9783031617638
The goal of the SpiniFEL project was to write, from scratch, a single particle imaging code for exascale supercomputers. The original vision was to have two versions of the code, one in MPI and one in Pygion, a Python-based interface to the Legion task-based runtime. We describe the motivation for the project, some of the programming challenges we encountered along the way, what worked and what didn't, and why only the Pygion code eventually succeeded in running at scale.
Most contemporary HPC programming models assume an inelastic runtime in which the resources allocated to an application remain fixed throughout its execution. Conversely, elastic runtimes can expand and shrink resourc...
详细信息
ISBN:
(纸本)9798350395679;9798350395662
Most contemporary HPC programming models assume an inelastic runtime in which the resources allocated to an application remain fixed throughout its execution. Conversely, elastic runtimes can expand and shrink resources based on availability and/or dynamic application requirements. In this paper, we implement elasticity for PaRSEC, a task-based dataflow runtime, using inter-node GPU work stealing. In addition to supporting elasticity, we demonstrate that inter-node GPU work stealing can enhance the performance of imbalanced applications by up to 45%.
The OpenMP (R) API offers both task-based and data-parallel concepts to scientific computing. While it provides descriptive and prescriptive annotations, it is in many places deliberately unspecific how to implement i...
详细信息
ISBN:
(数字)9783031725678
ISBN:
(纸本)9783031725661;9783031725678
The OpenMP (R) API offers both task-based and data-parallel concepts to scientific computing. While it provides descriptive and prescriptive annotations, it is in many places deliberately unspecific how to implement its annotations. As the predominant OpenMP implementations share design rationales, they introduce "quasi-standards" how certain annotations behave. By means of a task-based astrophysical simulation code, we highlight situations where this "quasi-standard" reference behaviour introduces performance flaws. Therefore, we propose prescriptive clauses to constrain the OpenMP implementations. Simulated task traces uncover the clauses' potential, while a discussion of their realization highlights that they would manifest in rather incremental changes to any OpenMP runtime supporting task priorities.
Shared memory parallel programming models strive to provide low-overhead execution environments. task-based programming models, in particular, are well-suited to cope with the ubiquitous multi- and many-core systems s...
详细信息
ISBN:
(数字)9781665498562
ISBN:
(纸本)9781665498562
Shared memory parallel programming models strive to provide low-overhead execution environments. task-based programming models, in particular, are well-suited to cope with the ubiquitous multi- and many-core systems since they allow applications to express all available concurrency to a scheduler, which is tasked with exploiting the available hardware resources. It is general consensus that atomic operations should be preferred over locks and mutexes to avoid inter-thread serialization and the resulting loss in efficiency. However, even atomic operations may serialize threads if not used judiciously. In this work, we will discuss several optimizations applied to TTG and the underlying PaRSEC runtime system aiming at removing contentious atomic operations to reduce the overhead of task management to a few hundred clock cycles. The result is an optimized data-flow programming system that seamlessly scales from a single node to distributed execution and which is able to compete with OpenMP in shared memory.
暂无评论