AMD AI Engines (AIEs) extend the design space and open up new options for coarse -grained processing in re-configurable accelerators. Pure FPGA designs for machine learning often struggle to compete with the high cloc...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
AMD AI Engines (AIEs) extend the design space and open up new options for coarse -grained processing in re-configurable accelerators. Pure FPGA designs for machine learning often struggle to compete with the high clock frequencies of GPUs for data -intensive workloads with only limited control flow. Having AIEs available on-chip with an FPGA fabric allows for low -latency co -processing and permits parts of an application to be placed on the most suitable kind of processing unit. Many data -heavy workloads, particularly in the AI domain, benefit from data streaming. With TaPaSCo-AIE, we present a framework for heterogeneous systems centered around data streams. Our framework focuses on AMD Versal devices and incorporates AI Engines and 100G network. We demonstrate the efficient use of TaPaSCo-AIE in a real -world evaluation based on a neuralnetwork, and achieve significant performance improvements over CPUs, and even exceed the performance of an AIM GPU.
Near-storage data processing and computational storage have recently received considerable attention from the industry as energy- and cost-efficient ways to improve system performance. This paper introduces a computat...
详细信息
ISBN:
(纸本)9783031396977;9783031396984
Near-storage data processing and computational storage have recently received considerable attention from the industry as energy- and cost-efficient ways to improve system performance. This paper introduces a computational-storage solution to enhance the performance and energy efficiency of an AI training system, especially for training a deep learning model with large datasets or high-dimensional data. Our system leverages dimensionality reduction effectively by offloading its operations to computational storage in a systematic manner. Our experiment results show that it can reduce the training time of a deep learning model by over 40.3%, while lowering energy consumption by 38.2%.
The growing need to perform neuralnetwork inference with low latency is giving place to a broad spectrum of heterogeneous devices with deep learning capabilities. Therefore, obtaining the best performance from each d...
详细信息
With the wide adoption of deep neuralnetwork (DNN) models for various applications, enterprises, and cloud providers have built deep learning clusters and increasingly deployed specialized accelerators, such as GPUs ...
详细信息
ISBN:
(纸本)9798350387117;9798350387124
With the wide adoption of deep neuralnetwork (DNN) models for various applications, enterprises, and cloud providers have built deep learning clusters and increasingly deployed specialized accelerators, such as GPUs and TPUs, for DNN training jobs. To arbitrate cluster resources among multi-user jobs, existing schedulers fall short, either lacking fine-grained heterogeneity awareness or hardly generalizable to various scheduling policies. To fill this gap, we propose a novel design of a task-level heterogeneity-aware scheduler, Hadar, based on an online optimization framework that can express other scheduling algorithms. Hadar leverages the performance traits of DNN jobs on a heterogeneous cluster, characterizes the task-level performance heterogeneity in the optimization problem, and makes scheduling decisions across both spatial and temporal dimensions. The primal-dual framework is employed, with our design of a dual subroutine, to solve the optimization problem and guide the scheduling design. Extensive trace-driven simulations with representative DNN models have been conducted to demonstrate that Hadar improves the average job completion time (JCT) by 3x over an Apache YARN-based resource manager used in production. Moreover, Hadar outperforms Gavel[1], the state-of-the-art heterogeneity-aware scheduler, by 2.5x for the average JCT, shortens the queuing delay by 13%, and improves FTF (Finish-Time-Fairness) by 1.5%.
A distributed Denial-of-Service attack (DDoS) involves overwhelming a network with a large amount of traffic that aims to disrupt the normal functioning of a network. DDoS attacks can cause a variety of problems, such...
详细信息
The introduction of the Internet of Things (IoT) has resulted in the enlargement of complex and intelligent solutions, which has led to an expansion of the variety of innovative services and processing capacity that a...
详细信息
This article presents a graphics processing unit (GPU) scheduling scheme that maximizes the exploitation of data locality in deep neuralnetworks (DNNs). Convolution is one of the fundamental operations used in DNNs a...
详细信息
This article presents a graphics processing unit (GPU) scheduling scheme that maximizes the exploitation of data locality in deep neuralnetworks (DNNs). Convolution is one of the fundamental operations used in DNNs and accounts for more than 90% of the total execution time. To leverage massive thread-level parallelism (TLP) in a GPU, deeply nested convolution loops are lowered (or unrolled) into large matrix multiplication, which trades memory capacity and bandwidth for TLP augmentation. A large workspace matrix is split into tiles of general matrix multiplication (GEMM) and concurrently executed by many thread blocks. Notably, the workspace is filled with a number of duplicate data that originate from the same sources in the input feature map during the lowering process. However, conventional GPU scheduling is oblivious to data duplication patterns in the workspace, and thread blocks are assigned to streaming multiprocessors (SMs) irrespective of data similarity between GEMM tiles. Such scheduling misses a significant opportunity to exploit data locality manifested in the DNN convolution. This article proposes a GPU scheduling technique called Locality-Aware Scheduling (LAS) that i) identifies which thread blocks share the largest amount of identical data based on the lowered patterns of a DNN convolution and ii) allocates such thread blocks showing the greatest data similarity to the same SM. In this way, small caches in SMs can efficiently utilize the data locality of the DNN convolution. Experimental results show that LAS with tensor cores achieves 20.1% performance improvements on average with 14.8% increases in L1 cache hit rates.
We consider a distributed online convex optimization problem when streaming data are distributed among computing agents over a connected communication network. Since the data are high-dimensional or the network is lar...
详细信息
ISBN:
(纸本)9781713871088
We consider a distributed online convex optimization problem when streaming data are distributed among computing agents over a connected communication network. Since the data are high-dimensional or the network is large-scale, communication load can be a bottleneck for the efficiency of distributed algorithms. To tackle this bottleneck, we apply the state-of-art data compression scheme to the fundamental GD-based distributed online algorithms. Three algorithms with difference-compressed communication are proposed for full information feedback (DC-DOGD), one-point bandit feedback (DC-DOBD), and two-point bandit feedback (DC-DO2BD), respectively. We obtain regret bounds explicitly in terms of time horizon, compression ratio, decision dimension, agent number, and network parameters. Our algorithms are proved to be no-regret and match the same regret bounds, w.r.t. time horizon, with their uncompressed versions for both convex and strongly convex losses. Numerical experiments are given to validate the theoretical findings and illustrate that the proposed algorithms can effectively reduce the total transmitted bits for distributed online training compared with the uncompressed baseline.
This paper considers the problem of recovering the policies of multiple interacting experts by estimating their reward functions and constraints where the demonstration data of the experts is distributed to a group of...
详细信息
ISBN:
(纸本)9781713871088
This paper considers the problem of recovering the policies of multiple interacting experts by estimating their reward functions and constraints where the demonstration data of the experts is distributed to a group of learners. We formulate this problem as a distributed bi-level optimization problem and propose a novel bi-level "distributed inverse constrained reinforcement learning" (D-ICRL) algorithm that allows the learners to collaboratively estimate the constraints in the outer loop and learn the corresponding policies and reward functions in the inner loop from the distributed demonstrations through intermittent communications. We formally guarantee that the distributed learners asymptotically achieve consensus which belongs to the set of stationary points of the bi-level optimization problem. Simulations are done to validate the proposed algorithm.
This paper investigates the formation tracking control problem for autonomous surface vehicles (ASVs) with dynamic uncertainties and external disturbances under secure and privacy-preserving interaction. An innovative...
详细信息
This paper investigates the formation tracking control problem for autonomous surface vehicles (ASVs) with dynamic uncertainties and external disturbances under secure and privacy-preserving interaction. An innovative hierarchical information security control (HISC) framework is proposed to solve the estimation problem in a secure and privacy-preserving way and the formation tracking problem for ASVs. The information processing layer of HISC framework focuses on the distributed secure and privacy-preserving estimator (DSPE) algorithm under sampled-data interaction and the local control layer is mainly about the robust neuro-adaptive controller without any model information for the formation of networked ASVs under communication delay. Through systematic analysis, sufficient conditions are given for guaranteeing the stability and convergence of the studied closed-loop system. Ultimately, simulation outcomes are showcased to corroborate the efficacy of the proposed control scheme.
暂无评论