OpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and acce...
详细信息
ISBN:
(纸本)9781665490207
OpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers’ implementations. To address this challenge, the OpenACC Validation and Verification team has proposed a validation testsuite to verify the OpenACC implementations across various compilers with an infrastructure for a more streamlined execution. This paper will cover the following aspects: (a) the new developments since the last publication on the testsuite, (b) outline the use of the infrastructure, (c) discuss tests that highlight our workflow process, (d) analyze the results from executing the testsuite on various systems, and (e) outline future developments.
Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP ...
详细信息
This paper aims to present a multi-tiered approach to designing learning experiences in HPC for undergraduate students that significantly reinforce comprehension of CS topics while working with new concepts in paralle...
详细信息
This paper aims to present a multi-tiered approach to designing learning experiences in HPC for undergraduate students that significantly reinforce comprehension of CS topics while working with new concepts in parallel and distributed computing. The paper will detail the experience of students working in the design, construction, and testing of a computing cluster including budgeting, hardware purchase and setup, software installation and configuration, interconnection networks, communication, benchmarking, and running parallel code using MPI and OpenMP. The case study of building a relatively low-cost, small-scale computing cluster that can be used as a template for CS senior projects or independent studies, also yielded an opportunity to involve students in the creation of teaching tools for parallel computing at many levels of the CS curriculum.
Contingency analysis (CA) is one of the critical tools of a static security assessment (SSA). It is used to forecast the operating states of a power system under one or more outages of generators, transmission lines, ...
详细信息
Contingency analysis (CA) is one of the critical tools of a static security assessment (SSA). It is used to forecast the operating states of a power system under one or more outages of generators, transmission lines, transformers, etc. To perform SSA, repetitive load flow analyses are required for obtaining the bus voltages, bus injections, and line flows considering each possible outage. A repetitive load flow analysis demands huge computational efforts like efficient system modelling for faster load flow solutions, parallel programming and High performance computing (HPC). In this paper, an N-1-1 CA has been analysed using fast decoupled load flow (FDLF) with a strategy of screening and ranking the catastrophic contingencies. This paper explores a computationally efficient method to analyze the severity and the ranking of N-1-1 contingencies for large power system SSA. The performance of the FDLF based SSA method is demonstrated on two standard IEEE 14 and 118 bus systems.
The rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers...
详细信息
The rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers the possibility of transformational changes in computational science. The HPL-AI benchmark is designed to measure the performance of mixed precision arithmetic as opposed to the HPL benchmark which measures double precision performance. Pushing the limits of systems at extreme scale is nontrivial ---little public literature explores optimization of mixed precision computations at this scale. In this work, we demonstrate how to scale up the HPL-AI benchmark on the pre-exascale Summit and exascale Frontier systems at the Oak Ridge Leadership Computing Facility (OLCF) with a cross-platform design. We present the implementation, performance results, and a guideline of optimization strategies employed for delivering portable performance on both AMD and NVIDIA GPUs at extreme scale.
Code generation and programming have become ever more challenging over the last decade due to the shift towards parallel processing. Emerging processor architectures such as multi-cores and GPUs exploit increasingly p...
详细信息
Code generation and programming have become ever more challenging over the last decade due to the shift towards parallel processing. Emerging processor architectures such as multi-cores and GPUs exploit increasingly parallelism, requiring programmers and compilers to deal with aspects such as threading, concurrency, synchronization, and complex memory partitioning. We advocate that programmers and compilers can greatly benefit from a structured classification of program code. Such a classification can help programmers to find opportunities for parallelization, reason about their code, and interact with other programmers. Similarly, parallelising compilers and source-to-source compilers can take threading and optimization decisions based on the same classification. In this work, we introduce algorithmic species, a classification of affine loop nests based on the polyhedral model and targeted for both automatic and manual use. Individual classes capture information such as the structure of parallelism and the data reuse. To make the classification applicable for manual use, a basic vocabulary forms the base for the creation of a set of intuitive classes. To demonstrate the use of algorithmic species, we identify 115 classes in a benchmark set. Additionally, we demonstrate the suitability of algorithmic species for automated uses by showing a tool to automatically extract species from program code, a species-based source-to-source compiler, and a species-based performance prediction model.
We introduce SpDISTAL, a compiler for sparse tensor algebra that targets distributed systems. SpDISTAL combines separate descriptions of tensor algebra expressions, sparse data structures, data distribution, and compu...
详细信息
We introduce SpDISTAL, a compiler for sparse tensor algebra that targets distributed systems. SpDISTAL combines separate descriptions of tensor algebra expressions, sparse data structures, data distribution, and computation distribution. Thus, it enables distributed execution of sparse tensor algebra expressions with a wide variety of sparse data structures and data distributions. SpDISTAL is implemented as a C++ library that targets a distributed task-based runtime system and can generate code for nodes with both multi-core CPUs and multiple GPUs. SpDISTAL generates distributed code that achieves performance competitive with hand-written distributed functions for specific sparse tensor algebra expressions and that outperforms general interpretation-based systems by one to two orders of magnitude.
We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous (CPU+GPU) high performance computing (HPC) systems. Various implementations of...
详细信息
The British Geological Survey's global geomagnetic model, Model of the Earth's Magnetic Environment (MEME), is an important tool for calculating the strength and direction of the Earth's magnetic field, wh...
详细信息
The British Geological Survey's global geomagnetic model, Model of the Earth's Magnetic Environment (MEME), is an important tool for calculating the strength and direction of the Earth's magnetic field, which is continually in flux. While the ability to collect data from ground-based observation sites and satellites has grown rapidly, the memory bound nature of the original code has proved a significant limitation on the size of the modelling problem required. In this paper, we describe work done replacing the bespoke, sequential, eigensolver with that of the PETSc/SLEPc package for solving the system of normal equations. Adopting PETSc/SLEPc also required fundamental changes in how we built and distributed the data structures, and as such, we describe an approach for building symmetric matrices that provides good load balance and avoids the need for close coordination between the processes or replication of work. We also study the memory bound nature of the code from an irregular memory accesses perspective and combine detailed profiling with software cache prefetching to significantly optimise this. Performance and scaling characteristics are explored on ARCHER, a Cray XC30, where we achieved a speed up for the solver of 294 times by replacing the model's bespoke approach with SLEPc.
This work aims at contributing with a structured parallel programming abstraction for Rust in order to provide ready-to-use parallel patterns that abstract low-level and architecture-dependent details from application...
详细信息
This work aims at contributing with a structured parallel programming abstraction for Rust in order to provide ready-to-use parallel patterns that abstract low-level and architecture-dependent details from application programmers. We focus on stream processing applications running on shared-memory multi-core architectures (i.e, video processing, compression, and others). Therefore, we provide a new high-level and efficient parallel programming abstraction for expressing stream parallelism, named Rust-SSP. We also created a new stream benchmark suite for Rust that represents real-world scenarios and has different application characteristics and workloads. Our benchmark suite is an initiative to assess existing parallelism abstraction for this domain, as parallel implementations using these abstractions were provided. The results revealed that Rust-SSP achieved up to 41.1% better performance than other solutions. In terms of programmability, the results revealed that Rust-SSP requires the smallest number of extra lines of code to enable stream parallelism.
暂无评论