multi-bit-width neural network enlightens a promising method for high performance yet energy efficient edge computing due to its balance between software algorithm accuracy and hardware efficiency. To date, FPGA has b...
详细信息
ISBN:
(纸本)9781450394178
multi-bit-width neural network enlightens a promising method for high performance yet energy efficient edge computing due to its balance between software algorithm accuracy and hardware efficiency. To date, FPGA has been one of the core hardware platforms for deploying various neural networks. However, it is still difficult to fully make use of the dedicated digital signal processing (DSP) blocks in FPGA for accelerating the multi-bit-width network. In this work, we develop state-of-the-art multi-bit-width convolutional neural network accelerator with novel systolic-in-systolic type of dataflow and single DSP multiple multiplication (SDMM) INT2/4/8 execution scheme. multi-level optimizations have also been adopted to further improve the performance, including group-vector systolic array for maximizing the circuit efficiency as well as minimizing the systolic delay, and differential neural architecture search (NAS) method for the high accuracy multi-bit-width network generation. The proposed accelerator has been practically deployed on Xilinx ZCU102 with accelerating NAS optimized VGG16 and Resnet18 networks as case studies. Average performance on accelerating the convolutional layer in VGG16 and Resnet18 is 1289GOPs and 1155GOPs, respectively. Throughput for running the full multi-bit-width VGG16 network is 870.73 GOPS at 250MHz, which has exceeded all of previous cnn accelerators on the same platform.
multi-bit-width convolutional neural network (cnn) maintains the balance between network accuracy and hardware efficiency, thus enlightening a promising method for accurate yet energy-efficient edge computing. In this...
详细信息
multi-bit-width convolutional neural network (cnn) maintains the balance between network accuracy and hardware efficiency, thus enlightening a promising method for accurate yet energy-efficient edge computing. In this work, we develop state-of-the-art multi-bit-width accelerator for NAS Optimized deep learning neural networks. To efficiently process the multi-bit-width network inferencing, multi-level optimizations have been proposed. Firstly, differential Neural Architecture Search (NAS) method is adopted for the high accuracy multi-bit-width network generation. Secondly, hybrid Booth based multi-bit-widthmultiply-add-accumulation (MAC) unit is developed for data processing. Thirdly, vector systolic array is proposed for effectively accelerating the matrix multiplications. With vector-style systolic dataflow, both the processing time and logic resources consumption can be reduced when compared with the classical systolic array. Finally, The proposed multi-bit-width cnn acceleration scheme has been practically deployed on FPGA platform of Xilinx ZCU102. Average performance on accelerating the full NAS optimized VGG16 network is 784.2 GOPS, and peek performance of the convolutional layer can reach as high as 871.26 GOPS for INT8, 1676.96 GOPS for INT4, and 2863.29 GOPS for INT2 respectively, which is among the best results in previous cnn accelerator benchmarks.
Neural architecture search (NAS) optimized multi-bit-width convolutional neural network (cnn) maintains the balance between network performance and efficiency, thus enlightening a promising method for accurate yet ene...
详细信息
ISBN:
(纸本)9781450391498
Neural architecture search (NAS) optimized multi-bit-width convolutional neural network (cnn) maintains the balance between network performance and efficiency, thus enlightening a promising method for accurate yet energy-efficient edge computing. In this work, we propose a high throughput three-dimensional (3D) systolic accelerator for NAS optimized cnns, in which the input feature matrix, weight matrix and output feature matrix are delivering vertically, horizontally and perpendicularly through the systolic array respectively. With 3D systolic data flow, the processing time and logic resources consumption can be both reduced compared to the classical non-stationary systolic array. Besides, Booth-based multi-bit-width (INT2/4/8) multiply-add-accumulation (MAC) unit is developed within the 3D systolic accelerator. Deployed on FPGA platform Xilinx ZCU102, peek performance of the convolutional layer can reach as high as 2775 GOPS for INT2, 1650 GOPS for INT4, and 816 GOPS for INT8 respectively. The average performance on accelerating full NAS VGG16 network is 647 GOPS.
暂无评论