fieldprogrammablegatearrays (fpgas) utilize multiple programmable elements and non-programmable blocks. After synthesizing an input Hardware Design Language (HDL) design into a circuit, optimizations are used to di...
详细信息
In this paper we describe Xilinx's Versal-Adaptive Compute Acceleration Platform (ACAP). ACAP is a hybrid compute platform that tightly integrates traditional fpgaprogrammable fabric, software programmable proces...
详细信息
ISBN:
(纸本)9781450361378
In this paper we describe Xilinx's Versal-Adaptive Compute Acceleration Platform (ACAP). ACAP is a hybrid compute platform that tightly integrates traditional fpgaprogrammable fabric, software programmable processors and software programmable accelerator engines. ACAP improves over the programmability of traditional reconfigurable platforms by introducing newer compute models in the form of software programmable accelerators and by separating out the data movement architecture from the compute architecture. The Versal architecture includes a host of new capabilities, including a chip-pervasive programmable Network-on-Chip (NoC), Imux Registers, compute shell, more advanced SSIT, adaptive deskew of global clocks, faster configuration, and other new programmable elements as well as enhancements to the CLB and interconnect. We discuss these architectural developments and highlight their key motivations and differences in relation to traditional fpga architectures.
A frame object detection problem consists of two problems: one is a regression problem to spatially separated bounding boxes, the second is the associated classification of the objects within realtime frame rate. It i...
详细信息
ISBN:
(纸本)9781450356145
A frame object detection problem consists of two problems: one is a regression problem to spatially separated bounding boxes, the second is the associated classification of the objects within realtime frame rate. It is widely used in the embedded systems, such as robotics, autonomous driving, security, and drones - all of which require high-performance and low-power consumption. This paper implements the YOLO (You only look once) object detector on an fpga, which is faster and has higher accuracy. It is based on the convolutional deep neural network (CNN), and it is a dominant part both the performance and the area. However, the object detector based on the CNN consists of a bounding box prediction (regression) and a class estimation (classification). Thus, the conventional all binarized CNN fails to recognize in most cases. In the paper, we propose a lightweight YOLOv2, which consists of the binarized CNN for feature extraction and the parallel support vector regression (SVR) for both classification and localization. To our knowledge, this is the first time binarized CNN's have been successfully used in object detection. We implement a pipelined based architecture for the lightweight YOLOv2 on the Xilinx Inc. zcu102 board, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC. The implemented object detector archived 40.81 frames per second (FPS). Compared with the ARM Cortex-A57, it was 177.4 times faster, it dissipated 1.1 times more power, and its performance per power efficiency was 158.9 times better. Also, compared with the nVidia Pascall embedded GPU, it was 27.5 times faster, it dissipated 1.5 times lower power, and its performance per power efficiency was 42.9 times better. Thus, our method is suitable for the frame object detector for an embedded vision system.
Three-dimensional convolutional neural networks (3D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on designing and optimizing accelerators for 2...
详细信息
ISBN:
(纸本)9781450356145
Three-dimensional convolutional neural networks (3D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on designing and optimizing accelerators for 2D CNN, with few attempts made to accelerate 3D CNN on fpga. We find accelerating 3D CNNs on fpga to be challenge due to their high computational complexity and storage demands. More importantly, although the computation patterns of 2D and 3D CNNs are analogous, the conventional approaches adopted for accelerating 2D CNNs may be unfit for 3D CNN acceleration. In this paper, in order to accelerate 2D and 3D CNNs using a uniform framework, we propose a uniform template-based architecture that uses templates based on the Winograd algorithm to ensure fast development of 2D and 3D CNN accelerators. Furthermore, we also develop a uniform analytical model to facilitate efficient design space explorations of 2D and 3D CNN accelerators based on our architecture. Finally, we demonstrate the effectiveness of the template-based architecture by implementing accelerators for real-life 2D and 3D CNNs (VGG16 and C3D) on multiple fpga platforms. On S2C VUS440, we achieve up to 1.13 TOPS and 1.11 TOPS under low resource utilization for VGG16 and C3D, respectively. End-to-end comparisons with CPU and GPU solutions demonstrate that our implementation of C3D achieves gains of up to 13x and 60x in performance and energy relative to a CPU solution, and a 6.4x energy efficiency gain over a GPU solution.
We propose a framework to generate highly efficient accelerators for inferencing on fpgas. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a ...
详细信息
ISBN:
(纸本)9781450356145
We propose a framework to generate highly efficient accelerators for inferencing on fpgas. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic Veri log generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs "wasted" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the "wasted" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost the throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize fpgas by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable Verilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of 780.6 GOPS, 669.1 GOPS and 552.1 GOPS for AlexNet, VGG16 and FCN-16s respectively. These correspond to 6.8x (AlexNet) and 4.9x (VGG16) improvement compared with the state-of-the-art implementations.
Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM mod...
详细信息
ISBN:
(纸本)9781450356145
Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to inefficient designs on fpgas due to the limited on-chip resources. The previous work proposes to use a pruning based compression technique to reduce the model size and thus speedups the inference on fpgas. However, the random nature of the pruning technique transforms the dense matrices of the model to highly unstructured sparse ones, which leads to unbalanced computation and irregular memory accesses and thus hurts the overall performance and energy efficiency. In contrast, we propose to use a structured compression technique which could not only reduce the LSTM model size but also eliminate the irregularities of computation and memory accesses. This approach employs block-circulant instead of sparse matrices to compress weight matrices and reduces the storage requirement from O(k(2)) to O(k). Fast Fourier Transform algorithm is utilized to further accelerate the inference by reducing the computational complexity from O(k(2)) to O(klogk). The datapath and activation functions are quantized as 16-bit to improve the resource utilization. More importantly, we propose a comprehensive framework called C-LSTM to automatically optimize and implement a wide range of LSTM variants on fpgas. According to the experimental results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy efficiency compared with the state-of-the-art LSTM implementation under the same experimental setup, and the accuracy degradation is very small.
Sparse matrix factorization using Stochastic Gradient Descent (SGD) is a popular technique for deriving latent features from observations. SGD is widely used for Collaborative Filtering (CF), itself a well-known machi...
详细信息
ISBN:
(纸本)9781450356145
Sparse matrix factorization using Stochastic Gradient Descent (SGD) is a popular technique for deriving latent features from observations. SGD is widely used for Collaborative Filtering (CF), itself a well-known machine learning technique for recommender systems. In this paper, we develop an fpga-based accelerator, FASTCF, to accelerate the SGD-based CF algorithm. FASTCF consists of parallel, pipelined processing units which concurrently process distinct user ratings by accessing a shared on-chip buffer. We design FASTCF through a holistic analysis of the specific design challenges for the acceleration of SGD-based CF on fpga. Based on our analysis of these design challenges, we develop a bipartite graph processing approach with a novel 3-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of on-chip feature vector data to significantly accelerate the processing of this bipartite graph. First, we develop a fast heuristic to partition the input graph into induced subgraphs;this enables FASTCF to efficiently buffer vertex data for reuse and completely hide communication overhead. Second, we partition all the edges of each subgraph into matchings to extract the maximum parallelism. Third, we schedule the execution of the edges inside each matching to reduce concurrent memory access conflicts to the shared on-chip buffer. Compared with non-optimized baseline designs, the hierarchical partitioning approach results in up to 60x data dependency reduction, 4.2x bank conflict reduction, and 15.4x speedup. We implement FASTCF based on state-of-the-art fpga and evaluate its performance using three large real-life datasets. Experimental results show that FASTCF sustains a high throughput of up to 217 billion floating-point operations per second (GFLOPS). Compared with state-of-the-art multi-core and GPU implementations, FASTCF demonstrates 13.3x and 12.7x speedup, respectively.
field-programmablegatearrays (fpgas) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current fpgas. The ma...
详细信息
ISBN:
(纸本)9781450361378
field-programmablegatearrays (fpgas) are becoming a promising choice as a heterogeneous computing component for scientific computing when floating-point optimized architectures are added to the current fpgas. The maturing high-level synthesis (HLS) tools, such as Intel fpga SDK for OpenCL, provide a streamlined design flow to facilitate parallel application on fpgas. In this paper, we evaluate and optimize the OpenCL implementations of three nuclear reactor simulation applications (XSBench, RSBench, and SimpleMOC kernel) on a heterogeneous computing platform that consists of a general-purpose CPU and an fpga. We introduce the applications, and describe their OpenCL implementations and optimization methods on an Arria10-based fpga platform. Compared with the baseline kernel implementations, our optimizations increase the performance of the three kernels by a factor of 35, 295, and 102, respectively. We compare the performance, power, and performance per watt of the three applications on an Intel Xeon 16-core CPU, an Nvidia Tesla K80 GPU, and an Intel Arria10 GX1150 fpga. The performance per watt on the fpga is competitive. For XSBench, the performance per watt on the fpga is 1.43X higher than that on the CPU, and 2.58X lower than that on the GPU. For RSBench, the performance per watt on the fpga is 3.6X higher than that on the CPU, and 5.8X lower than that on the GPU. For SimpleMOC kernel, the performance per watt on the fpga is 1.74X higher than that on the CPU, and 1.65X lower than that on the GPU.
General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. In this work, we pres...
详细信息
ISBN:
(纸本)9781450356145
General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. In this work, we present a customizable matrix multiplication framework for the Intel HARPv2 CPU+fpga platform that includes support for both traditional single precision floating point and reduced precision workloads. Our framework supports arbitrary size GEMMs and consists of two parts: (1) a simple application programming interface (API) for easy configuration and integration into existing software and (2) a highly customizable hardware template. The API provides both compile and runtime options for controlling key aspects of the hardware template including dynamic precision switching;interleaving and block size control;and fused deep learning specific operations. The framework currently supports single precision floating point (FP32), 16, 8, 4 and 2 bit Integer and Fixed Point (INT16, INT8, INT4, INT2) and more exotic data types for deep learning workloads: INT16xTernary, INT8xTernary, BinaryxBinary. We compare our implementation to the latest NVIDIA Pascal GPU and evaluate the performance benefits provided by optimizations built into the hardware template. Using three neural networks (AlexNet, VGGNet and ResNet) we illustrate that reduced precision representations such as binary achieve the best performance, and that the HARPv2 enables fine-grained partitioning of computations over both the Xeon and fpga. We observe up to 50x improvement in execution time compared to single precision floating point, and that runtime configuration options can improve the efficiency of certain layers in AlexNet up to 4x, achieving an overall 1.3x improvement over the entire network.
The reconfigurability, energy-efficiency, and massive parallelism on fpgas make them one of the best choices for implementing efficient deep learning accelerators. However, state-of-art implementations seldom consider...
详细信息
ISBN:
(纸本)9781450361378
The reconfigurability, energy-efficiency, and massive parallelism on fpgas make them one of the best choices for implementing efficient deep learning accelerators. However, state-of-art implementations seldom consider the balance between high throughput of computation power and the ability of the memory subsystem to support it. In this paper, we implement a framework on fpga by combining the sparse Winograd convolution, clusters of small-scale systolic arrays, and a tailored recursive Z-Morton memory layout design. We also provide an analytical model analysis for the general Winograd convolution algorithm as a design reference. Experimental results on various CNN models show that it achieves very high computation resource utilization, 20x~30x energy efficiency, and more than 5x speedup compared with the dense implementation.
暂无评论