Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored per...
详细信息
ISBN:
(纸本)9781450368186
Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from within a GPU kernel itself. However, these approaches typically involve the CPU in the critical path, which leads to high latency and ineicient utilization of network and/or GPU resources. In this work, we introduce GPU Initiated OpenSHMEM (GIO), a new intra-kernel PGAS programming model and runtime that enables GPUs to communicate directly with a NIC without the intervention of the CPU. We accomplish this by exploring the GPU's coarse-grained memory model and correcting semantic mismatches when GPUs wish to directly interact with the network. GIO also reduces latency by relying on a novel template-based design to minimize the overhead of initiating a network operation. We illustrate that for structured applications like a Jacobi 2D stencil, GIO can improve application performance by up to 40% compared to traditional kernel-boundary networking. Furthermore, we demonstrate that on irregular applications like Sparse Triangular Solve (SpTS), GIO provides up to 44% improvement compared to existing intra-kernel networking schemes.
The automotive industry is embracing new challenges to deliver self-driving cars, and this in turn requires increasingly complex hardware and software. Software developers are leveraging artificial intelligence, and i...
详细信息
ISBN:
(纸本)9781450362306
The automotive industry is embracing new challenges to deliver self-driving cars, and this in turn requires increasingly complex hardware and software. Software developers are leveraging artificial intelligence, and in particular machine learning, to deliver the capabilities required for an autonomous vehicle to operate. This has driven automotive systems to become increasingly heterogeneous offering multi-core processors and custom co-processors capable of performing the intense algorithms required for artificial intelligence and machine learning. These new processors can be used to vastly speed up common operations used in AI (Artificial Intelligence) and machine learning *** R-Car V3H system-on-chip (SoC) from the Renesas AutonomyâĎć platform for ADAS (Advanced Driver Assistance Systems) and automated driving supports Level 3 and above (as defined by SAE's automation level definitions). It follows the heterogeneous IP concept of the Renesas Autonomy platformâĎć, giving the developer the choice of high performance computer vision at low power consumption, as well as flexibility to implement the latest algorithms such as those used in machine *** examining the architecture of the R-Car hardware we can understand how this differs from HPC and desktop heterogeneous systems, and how this can be mapped to the SYCL and OpenCL programmingmodels. When both power consumption and performance are important, as is the case in the automotive industry, the focus for implementing OpenCL and SYCL on these hardware platforms must be a balanced approach. The memory capacity and layout must be used in the most optimum way to build a pipeline that provides the best throughput. The R-Car hardware provides DMA and on-chip memory where these are used to facilitate efficient data transfer on the device. The memory hierarchy layers can be seen on how it is efficiently mapped to OpenCL *** R-Car hardware also offers many fixed function IP blocks, each performin
In this paper, we review the background and the state of the art of the distributed Computing software stack. We aim to provide the readers with a comprehensive overview of this area by supplying a detailed big-pictur...
详细信息
In this paper, we review the background and the state of the art of the distributed Computing software stack. We aim to provide the readers with a comprehensive overview of this area by supplying a detailed big-picture of the latest technologies. First, we introduce the general background of distributed Computing and propose a layered top-bottom classification of the latest available software. Next, we focus on each abstraction layer, i.e. Application Development (including Task-based Workflows, Dataflows, and Graph Processing), Platform (including Data Sharing and Resource Management), Communication (including Remote Invocation, Message Passing, and Message Queuing), and Infrastructure (including Batch and Interactive systems). For each layer, we give a general background, discuss its technical challenges, review the latest programming languages, programmingmodels, frameworks, libraries, and tools, and provide a summary table comparing the features of each alternative. Finally, we conclude this survey with a discussion of open problems and future directions. (C) 2021 Elsevier Inc. All rights reserved.
Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored per...
详细信息
ISBN:
(纸本)9781450368186
Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from within a GPU kernel itself. However, these approaches typically involve the CPU in the critical path, which leads to high latency and inefficient utilization of network and/or GPU *** this work, we introduce GPU Initiated OpenSHMEM (GIO), a new intra-kernel PGAS programming model and runtime that enables GPUs to communicate directly with a NIC without the intervention of the CPU. We accomplish this by exploring the GPU's coarse-grained memory model and correcting semantic mismatches when GPUs wish to directly interact with the network. GIO also reduces latency by relying on a novel template-based design to minimize the overhead of initiating a network operation. We illustrate that for structured applications like a Jacobi 2D stencil, GIO can improve application performance by up to 40% compared to traditional kernel-boundary networking. Furthermore, we demonstrate that on irregular applications like Sparse Triangular Solve (SpTS), GIO provides up to 44% improvement compared to existing intra-kernel networking schemes.
暂无评论