Transistor size reduction and more aggressive power modes in HPC platforms make chip components more error prone. In this context, HPC applications can have a diverse level of tolerance to memory errors that may chang...
详细信息
ISBN:
(纸本)9783319321493;9783319321486
Transistor size reduction and more aggressive power modes in HPC platforms make chip components more error prone. In this context, HPC applications can have a diverse level of tolerance to memory errors that may change the execution in different ways. As the tolerance to memory errors depends on write frequency and access patterns, different programming models may exhibit a different behavior in the rate of failures and alleviate the performance loss caused by the overhead of fault-tolerance mechanisms. In this paper, we explore how tolerant to memory errors are two main parallel programming models, message-passing and shared memory: we perform a memory vulnerability analysis and also conduct error propagation experiments to observe the effect of memory errors through program flow. Our results show the need for soft error resiliency methods based on memory behavior of programs, and the evaluation of the tradeoffs between performance and reliability.
Computational science has benefited in the last years from emerging accelerators that increase the performance of scientific simulations, but using these devices hinders the programming task. This paper presents AMA: ...
详细信息
Computational science has benefited in the last years from emerging accelerators that increase the performance of scientific simulations, but using these devices hinders the programming task. This paper presents AMA: a set of optimization techniques to efficiently manage multi-accelerator systems. AMA maximizes the overlap of computation and communication in a blocking-free way. Then, we can use such spare time to do other work while waiting for device operations. Implemented on top of a task-based framework, the experimental evaluation of AMA on a quad-GPU node shows that we reach the performance of a hand-tuned native CUDA code, with the advantage of fully hiding the device management. In addition, we obtain up to more than 2x performance speed-up with respect to the original framework implementation.
Parallel systems that employ CPUs and GPUs as two heterogeneous computational units have become immensely popular due to their ability to maximize performance under restrictive thermal budgets. However, programming he...
详细信息
ISBN:
(纸本)9781509000883
Parallel systems that employ CPUs and GPUs as two heterogeneous computational units have become immensely popular due to their ability to maximize performance under restrictive thermal budgets. However, programming heterogeneous systems via traditional programming models like OpenCL or CUDA involves rewriting large portions of application-code. They also lead to code that is not performance portable across different architectures or even across different generations of the same architecture. In this paper, we evaluate the current state of two emerging parallel programming models: C++ AMP and OpenACC. These emerging programming paradigms require minimal code changes and rely on compilers to interact with the low-level hardware language, thereby producing performance portable code from an application standpoint. We analyze the performance and productivity of the emerging programming models and compare them with OpenCL using a diverse set of applications on two different architectures, a CPU coupled with a discrete GPU and an Accelerated programming Unit (APU). Our experiments demonstrate that while the emerging programming models improve programmer productivity, they do not yet expose enough flexibility to extract maximum performance as compared to traditional programming models.
Computational science has benefited in the last years from emerging accelerators that increase the performance of scientific simulations, but using these devices hinders the programming task. This paper presents AMA: ...
详细信息
Computational science has benefited in the last years from emerging accelerators that increase the performance of scientific simulations, but using these devices hinders the programming task. This paper presents AMA: a set of optimization techniques to efficiently manage multi-accelerator systems. AMA maximizes the overlap of computation and communication in a blocking-free way. Then, we can use such spare time to do other work while waiting for device operations. Implemented on top of a task-based framework, the experimental evaluation of AMA on a quad-GPU node shows that we reach the performance of a hand-tuned native CUDA code, with the advantage of fully hiding the device management. In addition, we obtain up to more than 2x performance speed-up with respect to the original framework implementation.
Since Aurora entered the TOP500 list in November 2023, the top ten systems saw some shifts in the ratio of GPU vendors represented. With each vendor supplying their own preferred programming models for their hardware,...
详细信息
Since Aurora entered the TOP500 list in November 2023, the top ten systems saw some shifts in the ratio of GPU vendors represented. With each vendor supplying their own preferred programming models for their hardware, it becomes relevant to compare the portability of these models on other hardware platforms. For the present paper we implemented the N-body problem with different optimizations using native and portable programming frameworks. For each of those we determined the best performing optimized version on one target architecture and compared the performance achieved for each platform.
Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy...
详细信息
Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.
Within the computing continuum, SBCs (single-board computers) are essential in the Edge and Fog, with many featuring multiple processing cores and GPU accelerators. In this way, parallel computing plays a crucial role...
详细信息
Within the computing continuum, SBCs (single-board computers) are essential in the Edge and Fog, with many featuring multiple processing cores and GPU accelerators. In this way, parallel computing plays a crucial role in enabling the full computational potential of SBCs. However, selecting the best-suited solution in this context is inherently complex due to the intricate interplay between PPI (parallel programming interface) strategies, SBC architectural characteristics, and application characteristics and constraints. To our knowledge, no solution presents a combined discussion of these three aspects. To tackle this problem, this article aims to provide a benchmark of the best-suited parallelism PPIs given a set of hardware and application characteristics and requirements. Compared to existing benchmarks, we introduce new metrics, additional applications, various parallelism interfaces, and extra hardware devices. Therefore, our contributions are the methodology to benchmark parallelism on SBCs and the characterization of the best-performing parallelism PPIs and strategies for given situations. We are confident that parallel computing will be mainstream to process edge and fog computing;thus, our solution provides the first insights regarding what kind of application and parallel programming interface is the most suited for a particular SBC hardware.
From edge to exascale, computer architectures are becoming more heterogeneous and complex. The systems typically have fat nodes, with multicore CPUs and multiple hardware accelerators such as GPUs, FPGAs, and DSPs. Th...
详细信息
From edge to exascale, computer architectures are becoming more heterogeneous and complex. The systems typically have fat nodes, with multicore CPUs and multiple hardware accelerators such as GPUs, FPGAs, and DSPs. This complexity is causing a crisis in programming systems and performance portability. Several programming systems are working to address these challenges, but the increasing architectural diversity is forcing software stacks and applications to be specialized for each architecture. As we show, all of these approaches critically depend on their software framework for discovery, execution, scheduling, and data orchestration. To address this challenge, we believe that a more agile and proactive software framework is essential to increase performance portability and improve user productivity. To this end, we have designed and implemented IRIS: a performance-portable framework for cross-platform heterogeneous computing. IRIS can discover available resources, manage multiple diverse programming platforms (e.g., CUDA, Hexagon, HIP, Level Zero, OpenCL, OpenMP) simultaneously in the same execution, respect data dependencies, orchestrate data movement proactively, and provide for user-configurable scheduling. To simplify data movement, IRIS introduces a shared virtual device memory with relaxed consistency among different heterogeneous devices. IRIS also adds an automatic kernel workload partitioning technique using the polyhedral model so that it can resize kernels for a wide range of devices. Our evaluation on three architectures, ranging from Qualcomm Snapdragon to a Summit supercomputer node, shows that IRIS improves portability across a wide range of diverse heterogeneous architectures with negligible overhead.
The evolution of parallel computing architectures presents new challenges for developing efficient parallelized codes. The emergence of heterogeneous systems has given rise to multiple programming models, each requiri...
详细信息
ISBN:
(纸本)9783031733697;9783031733703
The evolution of parallel computing architectures presents new challenges for developing efficient parallelized codes. The emergence of heterogeneous systems has given rise to multiple programming models, each requiring careful adaptation to maximize performance. In this context, we propose reevaluating memory layout designs for computational tasks within larger nodes by comparing various architectures. To gain insight into the performance discrepancies between shared memory and shared-address space settings, we systematically measure the bandwidth between cores and sockets using different methodologies. Our findings reveal significant differences in performance, suggesting that MPI running inside UNIX processes may not fully utilize its intranode bandwidth potential. In light of our work in the MPC thread-based MPI runtime, which can leverage shared memory to achieve higher performance due to its optimized layout, we advocate for enabling the use of shared memory within the MPI standard.
In a bid to improve living standards, the African Union and African Development Bank is encouraging the free movement of goods and productivity via technological innovation. In 2018, it even signed a continental free ...
详细信息
暂无评论