检索结果-内蒙古大学图书馆

A Communication-Aware DNN Accelerator on ImageNet Using In-Memory Entry-Counting Based algorithm-circuit-Architecture Co-Design in 65-nm CMOS

引用

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN circuitS AND SYSTEMS 2020年第3期10卷 283-294页

作者： Zhu, Haozhe Chen, Chixiao Liu, Shiwei Zou, Qiaosha Wang, Mingyu Zhang, Lihua Zeng, Xiaoyang Richard Shi, C. -J. Fudan Univ State Key Lab ASIC & Syst Shanghai 201203 Peoples R China Fudan Univ Shanghai Engn Res Ctr AI & Robot Shanghai 200433 Peoples R China Fudan Univ Inst Brain Inspired Circuits & Syst Shanghai 201203 Peoples R China Univ Washington Dept Elect & Comp Engn Seattle WA 98195 USA

This article presents a communication-aware processing-in-memory deep neural network accelerator, which implements an in-memory entry-counting scheme for low bit-width quantized multiplication-and-accumulations (MACs). To maintain good accuracy on ImageNet, the proposed design adopts a full-stack co-design methodology, from algorithms, circuits to architectures. In the algorithm level, an entry-counting based MAC is proposed to fit the learned step-sized quantization scheme, and exploit the sparsity of both activations and weights intrinsically. In the circuit level, content addressable memory cells and multiplexed arrays are developed in the processing-in-memory macro. In the architecture level, the proposed design is compatible with different stationary dataflow mappings, further reducing the memory access. An in-memory entry-counting silicon prototype and its entire peripheral circuits are fabricated in 65nm LP CMOS technology with an active area of 0.76 x times 0.66 mm(2). The 7.36-Kb processing-in-memory macro with 128 search entries can reduce the multiplication number by 12.8x . The peak throughput is 3.58 GOPS, achieved at a clock rate of 143MHz and a power supply of 1.23V. The peak energy efficiency of the processing-in-memory macro is 11.6 TOPS/W, achieved at a clock rate of 40MHz and a power supply of 1.01V. Note that the physical design of the entry-counting memory is completed in a standard digital placement and routing flow by augmenting the library with two dedicated memory cells. A 3-bit quantized ResNet-18 on the ImageNet dataset is performed, where the top-1 accuracy is 64.4%.

关键词： Neural network accelerator processing-in-memory dataflow CAM algorithm-circuit codesign

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还