咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >A High Performance Multi-Bit-W... 收藏
IEEE Transactions on Circuits and Systems I: Regular Papers

A High Performance Multi-Bit-Width Booth Vector Systolic Accelerator for NAS Optimized Deep Learning Neural Networks

作     者:Huang, Mingqiang Liu, Yucen Man, Changhai Li, Kai Cheng, Quan Mao, Wei Yu, Hao 

作者机构:Shenzhen Institute of Advanced Technology Chinese Academy of Sciences Shenzhen 518055 China Southern University of Science and Technology School of Microelectronics Shenzhen 518055 China Kyoto University Department of Communications and Computer Engineering Kyoto 606 Japan 

出 版 物:《IEEE Transactions on Circuits and Systems I: Regular Papers》 (IEEE Trans. Circuits Syst. Regul. Pap.)

年 卷 期:2022年第69卷第9期

页      面:3619-3631页

学科分类:0808[工学-电气工程] 0809[工学-电子科学与技术(可授工学、理学学位)] 08[工学] 0835[工学-软件工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

主  题:boldsymbol Multi-bit-width CNN FPGA CNN NAS systolic array 

摘      要:Multi-bit-width convolutional neural network (CNN) maintains the balance between network accuracy and hardware efficiency, thus enlightening a promising method for accurate yet energy-efficient edge computing. In this work, we develop state-of-the-art multi-bit-width accelerator for NAS Optimized deep learning neural networks. To efficiently process the multi-bit-width network inferencing, multi-level optimizations have been proposed. Firstly, differential Neural Architecture Search (NAS) method is adopted for the high accuracy multi-bit-width network generation. Secondly, hybrid Booth based multi-bit-width multiply-add-accumulation (MAC) unit is developed for data processing. Thirdly, vector systolic array is proposed for effectively accelerating the matrix multiplications. With vector-style systolic dataflow, both the processing time and logic resources consumption can be reduced when compared with the classical systolic array. Finally, The proposed multi-bit-width CNN acceleration scheme has been practically deployed on FPGA platform of Xilinx ZCU102. Average performance on accelerating the full NAS optimized VGG16 network is 784.2 GOPS, and peek performance of the convolutional layer can reach as high as 871.26 GOPS for INT8, 1676.96 GOPS for INT4, and 2863.29 GOPS for INT2 respectively, which is among the best results in previous CNN accelerator benchmarks. © 2004-2012 IEEE.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分