One obstacle to application development on multi-FPGA systems with high-level synthesis (HLS) is a lack of support for a programming interface. Implementing and debugging an application on multiple FPGA boards is diff...
详细信息
ISBN:
(纸本)9781665465007
One obstacle to application development on multi-FPGA systems with high-level synthesis (HLS) is a lack of support for a programming interface. Implementing and debugging an application on multiple FPGA boards is difficult without a standard interface. Message Passing Interface (MPI) is a standard parallel programming interface commonly used in distributed memory systems. This paper presents a tool-independent MPI library called FiC-MPI that can be used in HLS for multi-FPGA systems in which each FPGA node is connected directly. By using FiC-MPI, various parallel software, including a general-purpose benchmark, can be easily implemented. FiC-MPI was implemented and evaluated on the M-KUBOS cluster consisting of Zynq MPSoC boards connected with a static time-division multiplexing network. By using the FiC-MPI simulator, parallel programs can be debugged before implementing on real machines. As a case study, the Himeno-BMT benchmark was implemented with FiC-MPI. It achieved 178.7 MFLOPS with a single node and scaled to 643.7 MFLOPS with four nodes, and 896.9 MFLOPS with six nodes of the M-KUBOS cluster. Through the implementation, the easiness of developing parallel programs with FiC-MPI on multi-FPGA systems was demonstrated.
Python's ease of use and rich collection of numeric libraries make it an excellent choice for rapidly developing scientific applications. However, composing these libraries to take advantage of complex heterogeneo...
详细信息
ISBN:
(纸本)9781665454452
Python's ease of use and rich collection of numeric libraries make it an excellent choice for rapidly developing scientific applications. However, composing these libraries to take advantage of complex heterogeneous nodes is still difficult. To simplify writing multi-device code, we created Parla, a heterogeneous task-based programming framework that fully supports Python's scientific programming stack. Parla's API is based on Python decorators and allows users to wrap code in Parla tasks for parallel execution. Parla arrays enable automatic movement of data between devices. The Parla runtime handles resource-aware mapping, scheduling, and execution of tasks. Compared to other Python tasking systems, Parla is unique in its parallelization of tasks within a single process, its GPU context and resource-aware runtime, and its design around gradual adoption to provide easy migration of and integration into existing Python applications. We show that Parla can achieve performance competitive with hand-optimized code while improving ease of development.
The wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and *** the proposed ON...
详细信息
ISBN:
(纸本)9781665460224
The wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and *** the proposed ONNX as a de facto for DNN model description, provides portability across various AI frameworks, supporting DNN models on various hardware architectures remains *** provides a C++-based portable parallel programming model to target various devices. Thus, enabling SYCL backend for an AI framework can lead to a hardware-agnostic model for heterogeneous *** paper proposes a SYCL backend for ONNXRuntime as a possible solution towards the performance portability of deep learning algorithms. The proposed backend uses existing state-of-the-art SYCL-DNN and SYCL-BLAS libraries to invoke tuned SYCL kernels for DNN operations. Our performance evaluation shows that the proposed approach can achieve comparable performance with respect to the state-of-the-art optimized vendor-specific libraries.
oneAPI is a major initiative by Intel aimed at making it easier to program heterogeneous architectures used in high-performance computing using a unified application programming interface (API). While raising the abst...
详细信息
ISBN:
(纸本)9781665473675
oneAPI is a major initiative by Intel aimed at making it easier to program heterogeneous architectures used in high-performance computing using a unified application programming interface (API). While raising the abstraction level via a unified API represents a promising step for the current generation of students and practitioners to embrace high-performance computing, we argue that a curriculum of well-developed software engineering methods and well-crafted exem-plars will be necessary to ensure interest by this audience and those who teach them. We aim to bridge the gap by developing a curriculum-codenamed UnoAPI-that takes a more holistic approach by looking beyond language and framework to include the broader development ecosystem, similar to the experience found in popular HPC languages such as Python. We hope to make parallel programming a more attractive option by making it look more like general application development in modern languages being used by most students and educators today. Our curriculum emanates from the perspective of well-crafted exemplars from the foundations of computer systems-given that most HPC architectures of interest begin from the systems tradition-with an integrated treatment of essential principles of distributed systems, programming languages, and software engineering. We argue that a curriculum should cover the essence of these topics to attract students to HPC and enable them to confidently solve computational problems using oneAPI. By the time of this submission, we have shared our materials with a small group of undergraduate sophomores, and their responses have been encouraging in terms of self-reported comprehension and ability to reproduce the compilation and execution of exemplars on their personal systems. We plan a follow-up study with a larger cohort by incorporating some of our materials in our existing course on High-Performance Computing.
Graph Neural Network (GNN) has shown great success in graph learning, including physics systems, protein interfaces, disease classification, molecular fingerprints, etc. Due to the complexity of the real-world tasks a...
详细信息
Graph Neural Network (GNN) has shown great success in graph learning, including physics systems, protein interfaces, disease classification, molecular fingerprints, etc. Due to the complexity of the real-world tasks and the big graph datasets, current GNN models become increasingly bigger and more complicated to enhance the learning ability and prediction accuracy. In addition, GNN contains two main major components graph operation and neural network (NN) operation, both are executed alternately during processing. The interleaved complex processing of GNN poses a challenge for many computational platforms, especially for those without accelerators. Current optimization frameworks solely for deep learning or graph computing cannot achieve good performance on GNN. In this work, we first investigate the performance bottlenecks when handling GNN over multi-core processors. Then, we perform a set of optimization strategies to leverage the capability of multi-core processors for GNN processing. Specifically, we build a set of microkernels for graph operations of GNN with assembly instructions that can exploit the capability of SIMD processing units within a core and implement a task allocator based on a greedy algorithm to balance the workloads between the cores of CPU. In addition, we use some strategies to optimize NN operation for GNN, according to the characteristics of NN operation in GNN. Experimental results show that the proposed methods can achieve a maximum of 2.88x and 4.08x performance improvements for various GNN models on Phytium 2000+ and Kunpeng 920 over the state-of-the-art GNN framework.
OpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and acce...
详细信息
ISBN:
(纸本)9781665490207
OpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers’ implementations. To address this challenge, the OpenACC Validation and Verification team has proposed a validation testsuite to verify the OpenACC implementations across various compilers with an infrastructure for a more streamlined execution. This paper will cover the following aspects: (a) the new developments since the last publication on the testsuite, (b) outline the use of the infrastructure, (c) discuss tests that highlight our workflow process, (d) analyze the results from executing the testsuite on various systems, and (e) outline future developments.
Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP ...
详细信息
This paper aims to present a multi-tiered approach to designing learning experiences in HPC for undergraduate students that significantly reinforce comprehension of CS topics while working with new concepts in paralle...
详细信息
This paper aims to present a multi-tiered approach to designing learning experiences in HPC for undergraduate students that significantly reinforce comprehension of CS topics while working with new concepts in parallel and distributed computing. The paper will detail the experience of students working in the design, construction, and testing of a computing cluster including budgeting, hardware purchase and setup, software installation and configuration, interconnection networks, communication, benchmarking, and running parallel code using MPI and OpenMP. The case study of building a relatively low-cost, small-scale computing cluster that can be used as a template for CS senior projects or independent studies, also yielded an opportunity to involve students in the creation of teaching tools for parallel computing at many levels of the CS curriculum.
Contingency analysis (CA) is one of the critical tools of a static security assessment (SSA). It is used to forecast the operating states of a power system under one or more outages of generators, transmission lines, ...
详细信息
Contingency analysis (CA) is one of the critical tools of a static security assessment (SSA). It is used to forecast the operating states of a power system under one or more outages of generators, transmission lines, transformers, etc. To perform SSA, repetitive load flow analyses are required for obtaining the bus voltages, bus injections, and line flows considering each possible outage. A repetitive load flow analysis demands huge computational efforts like efficient system modelling for faster load flow solutions, parallel programming and High performance computing (HPC). In this paper, an N-1-1 CA has been analysed using fast decoupled load flow (FDLF) with a strategy of screening and ranking the catastrophic contingencies. This paper explores a computationally efficient method to analyze the severity and the ranking of N-1-1 contingencies for large power system SSA. The performance of the FDLF based SSA method is demonstrated on two standard IEEE 14 and 118 bus systems.
The rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers...
详细信息
The rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers the possibility of transformational changes in computational science. The HPL-AI benchmark is designed to measure the performance of mixed precision arithmetic as opposed to the HPL benchmark which measures double precision performance. Pushing the limits of systems at extreme scale is nontrivial ---little public literature explores optimization of mixed precision computations at this scale. In this work, we demonstrate how to scale up the HPL-AI benchmark on the pre-exascale Summit and exascale Frontier systems at the Oak Ridge Leadership Computing Facility (OLCF) with a cross-platform design. We present the implementation, performance results, and a guideline of optimization strategies employed for delivering portable performance on both AMD and NVIDIA GPUs at extreme scale.
暂无评论