In recent years, the HPC landscape has shifted away from traditional multi-core CPU systems to energy-efficient architectures, such as many-core CPUs and accelerators like GPUs, to achieve high performance. The goal o...
详细信息
ISBN:
(纸本)9783030122744;9783030122737
In recent years, the HPC landscape has shifted away from traditional multi-core CPU systems to energy-efficient architectures, such as many-core CPUs and accelerators like GPUs, to achieve high performance. The goal of performance portability is to enable developers to rapidly produce applications which can run efficiently on a variety of these architectures, with little to no architecture specific code adoptions required. We implement a key kernel from a material science application using OpenMP 3.0, OpenMP 4.5, OpenACC, and CUDA on Intel architectures, Xeon and Xeon Phi, and NVIDIA GPUs, P100 and V100. We will compare the performance of the OpenMP 4.5 implementation with that of the more architecture-specific implementations, examine the performance of the OpenMP 4.5 implementation on CPUs after back-porting, and share our experience optimizing large reduction loops, as well as discuss the latest compiler status for OpenMP 4.5 and OpenACC.
The AXIOM project (Agile, eXtensible, fast I/O Module) aims at researching new software/hardware architectures for the future Cyber-Physical Systems (CPSs). These systems are expected to react in real-time, provide en...
详细信息
ISBN:
(纸本)9781467373111
The AXIOM project (Agile, eXtensible, fast I/O Module) aims at researching new software/hardware architectures for the future Cyber-Physical Systems (CPSs). These systems are expected to react in real-time, provide enough computational power for the assigned tasks, consume the least possible energy for such task (energy efficiency), scale up through modularity, allow for an easy programmability across performance scaling, and exploit at best existing standards at minimal costs. Current solutions for providing enough computational power are mainly based on multi-or many-core architectures. For example, some current research projects (such as ADEPT or P-SOCRATES) are already investigating how to join efforts from the High-Performance Computing (HPC) and the Embedded Computing domains, which are both focused on high power efficiency, while GPUs and new Dataflow platforms such as Maxeler, or in general FPGAs, are claimed as the most energy efficient. We present the project's initial approach, ideas and key concepts, and describe the AXIOM preliminary architecture. Our starting point uses power efficient multi-core nodes, such as ARM cores and FPGA accelerators on the same die, as in the Xilinx Zynq. We will work to provide an integrated environment that supports programmability of the parallel, interconnected nodes that form a CPS system, and evaluate our ideas using demanding test application scenarios.
Today, data-parallel programming models are the most successful programmingmodels for parallel computers both in terms of efficiency of execution and ease of use for the programmer. However, there is no parallel prog...
详细信息
Today, data-parallel programming models are the most successful programmingmodels for parallel computers both in terms of efficiency of execution and ease of use for the programmer. However, there is no parallelprogramming model that is conceptually simple and abstract, and that can be ported efficiently to the variety of parallel architectures available. The nested data-parallelprogramming model has some of the desired properties of a parallelprogramming model. In contrast to flat data parallelmodels, with this model it is possible to express irregular data structures and irregular parallel computations directly. In this paper, a collection-oriented approach to nested data parallelism is introduced. The state of the art of related research is presented and open questions are identified.
In state-of-the-art FPGA, especially in chiplet-based devices, place and route has become an important challenge due to an increase in device size and complexity. In the same way, off-chip memory resources have grown ...
详细信息
Task-based parallel programming models with explicit data dependencies, such as OmpSs, are gaining popularity, due to the ease of describing parallel algorithms with complex and irregular dependency patterns. These ad...
详细信息
ISBN:
(纸本)9781479961238
Task-based parallel programming models with explicit data dependencies, such as OmpSs, are gaining popularity, due to the ease of describing parallel algorithms with complex and irregular dependency patterns. These advantages, however, come at a steep cost of runtime overhead incurred by dynamic dependency resolution. Hardware support for task management has been proposed in previous work as a possible solution. We present VSs, a runtime library for the OmpSs programming model that integrates the Nexus++ hardware task manager, and evaluate the performance of the VSs-Nexus++ system. Experimental results show that applications with fine-grain tasks can achieve speedups of up to 3.4x, while applications optimized for current runtimes attain 1.3x. Providing support for hardware task managers in runtime libraries is therefore a viable approach to improve the performance of OmpSs applications.
As new heterogeneous systems and hardware accelerators appear, high performance computers can reach a higher level of computational power. Nevertheless, this does not come for free: the more heterogeneity the system p...
详细信息
ISBN:
(纸本)9780769549712
As new heterogeneous systems and hardware accelerators appear, high performance computers can reach a higher level of computational power. Nevertheless, this does not come for free: the more heterogeneity the system presents, the more complex becomes the programming task in terms of resource management. OmpSs is a task-based programming model and framework focused on the runtime exploitation of parallelism from annotated sequential applications. This paper presents a set of extensions to this framework: we show how the application programmer can expose different specialized versions of tasks (i.e. pieces of specific code targeted and optimized for a particular architecture) and how the system can choose between these versions at runtime to obtain the best performance achievable for the given application. From the results obtained in a multi-GPU system, we prove that our proposal gives flexibility to application's source code and can potentially increase application's performance.
Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced with the ability to use tasks to mak...
详细信息
ISBN:
(纸本)9798400704352
Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced with the ability to use tasks to make use of idle cores. Pure leverages shared memory in two ways: (a) by allowing cores to steal work from each other while waiting on messages to arrive, and, (b) by leveraging *** lock-free data structures in shared memory to achieve highperformance messaging and collective operations between the ranks within nodes. We use microbenchmarks to evaluate Pure's key messaging and collective features and also show application speedups up to 2.1 Chi on the CoMD molecular dynamics and the miniAMR adaptive mesh *** applications scaling up to 4,096 cores.
In this article is described the development of soft and hard environment for integrating individual cluster systems in a single, integrate parallel HPC systems. The elaborated applications can be ported to the resour...
详细信息
ISBN:
(纸本)9781479968602
In this article is described the development of soft and hard environment for integrating individual cluster systems in a single, integrate parallel HPC systems. The elaborated applications can be ported to the resources of the integrated HPC system. To acquire the necessary theoretical and practical skills on using regional HPC clusters, currently it is preparing an interactive educational course for teaching students in the area of parallelprogramming, HPC clusters, and use of parallel software.
Cutting-edge functionalities in embedded systems require the use of parallel architectures to meet their performance requirements. This imposes the introduction of a new layer in the software stacks of embedded system...
详细信息
ISBN:
(纸本)9781450388160
Cutting-edge functionalities in embedded systems require the use of parallel architectures to meet their performance requirements. This imposes the introduction of a new layer in the software stacks of embedded systems: the parallelprogramming model. Unfortunately, the tools used to analyze embedded systems fall short to characterize the performance of parallel applications at a parallelprogramming model level, and correlate this with information about non-functional requirements such as real-time, energy, memory usage, etc. HPC tools, like Extrae, are designed with that level of abstraction in mind, but their main focus is on performance evaluation. Overall, providing insightful information about the performance of parallel embedded applications at the parallelprogramming model level, and relate it to the non-functional requirements, is of paramount importance to fully exploit the performance capabilities of parallel embedded architectures. This paper contributes to the state-of-the-art of analysis tools for embedded systems by: (1) analyzing the particular constraints of embedded systems compared to HPC systems (e.g., static setting, restricted memory, limited drivers) to support HPC analysis tools;(2) porting Extrae, a powerful tracing tool from the HPC domain, to the GR740 platform, a SoC used in the space domain;and (3) augmenting Extrae with new features needed to correlate the parallel execution with the following non-functional requirements: energy, temperature and memory usage. Finally, the paper presents the usefulness of Extrae to characterize OpenMP applications and its non-functional requirements, evaluating different aspects of the applications running in the GR740.
Shared memory platforms provide a memory consistency specification (MCS) so that developers can reason about the behaviors of their parallel programs. Unfortunately, ensuring that a platform conforms to its MCS is dif...
详细信息
ISBN:
(纸本)9781450399166
Shared memory platforms provide a memory consistency specification (MCS) so that developers can reason about the behaviors of their parallel programs. Unfortunately, ensuring that a platform conforms to its MCS is difficult, as is exemplified by numerous bugs in well-used platforms. While existing MCS testing approaches find bugs, their efficacy depends on the testing environment (e.g. if synthetic memory pressure is applied). MCS testing environments are difficult to evaluate since legitimate MCS violations are too rare to use as an efficacy metric. As a result, prior approaches have missed critical MCS bugs. This work proposes a mutation testing approach for evaluating MCS testing environments: MC Mutants. This approach mutates MCS tests such that the mutants simulate bugs that might occur. A testing environment can then be evaluated using a mutation score. We utilizeMCMutants in two novel contributions: (1) a parallel testing environment, and (2) An MCS testing confidence strategy that is parameterized over a time budget and confidence threshold. We implement our contributions in WebGPU, a new web-based GPU programming specification, and evaluate our techniques across four GPUs. We improve testing speed by three orders of magnitude over prior work, empowering us to create a conformance test suite that reproduces many mutated tests with high confidence and requires only 64 seconds per test. We identified two bugs in WebGPU implementations, one of which led to a specification change. Moreover, the official WebGPU conformance test suite has adopted our approach due to its efficiency, effectiveness, and broad applicability.
暂无评论