In-memorycomputing (IMC) has emerged as a promising paradigm for energy-efficient, throughput-efficient and area-efficient machine learning at the edge. However, the differences in hardware architectures, array dimen...
详细信息
ISBN:
(纸本)9798350322255
In-memorycomputing (IMC) has emerged as a promising paradigm for energy-efficient, throughput-efficient and area-efficient machine learning at the edge. However, the differences in hardware architectures, array dimensions, and fabrication technologies among published IMC realizations have made it difficult to grasp their relative strengths. Moreover, previous studies have primarily focused on exploring and benchmarking the peak performance of a single IMC macro rather than full system performance on real workloads. This paper aims to address the lack of a quantitative comparison of Analog In-memorycomputing (AIMC) and digital in-memory computing (DIMC) processor architectures. We propose an analytical IMC performance model that is validated against published implementations and integrated into a system-level exploration framework for comprehensive performance assessments on different workloads with varying IMC configurations. Our experiments show that while DIMC generally has higher computational density than AIMC, AIMC with large macro sizes may have better energy efficiency than DIMC on convolutional-layers and pointwise-layers, which can exploit high spatial unrolling. On the other hand, DIMC with small macro size outperforms AIMC on depthwise-layers, which feature limited spatial unrolling opportunities inside a macro.
In-memorycomputing (IMC) provides energy-efficient solutions to deep neural networks (DNN). Most IMC designs for DNNs employ fixed-point precisions. However, floating-point precision is still required for DNN trainin...
详细信息
ISBN:
(纸本)9798350304206
In-memorycomputing (IMC) provides energy-efficient solutions to deep neural networks (DNN). Most IMC designs for DNNs employ fixed-point precisions. However, floating-point precision is still required for DNN training and complex inference models to maintain high accuracy. There have not been float-point precision based IMC works in the literature where the float-point computation is immersed into the weight memory storage. In this work, we propose a novel floating-point precision IMC macro with a configurable architecture that supports both normal 8-bit floating point (FP8) and 8-bit block floating point (BF8) with a shared exponent. The proposed FP-IMC macro implemented in 28nm CMOS demonstrates 12.1 TOPS/W for FP8 precision and 66.6 TOPS/W for BF8 precision, improving energy-efficiency beyond the state-of-the-art FP IMC macros.
In-memorycomputing (IMC) has emerged as a promising approach to address the von Neumann bottleneck in deep learning applications. This work proposes FP-ATM, a 6T SRAM-based all-digital design for multiply-accumulate ...
详细信息
ISBN:
(纸本)9798350384406
In-memorycomputing (IMC) has emerged as a promising approach to address the von Neumann bottleneck in deep learning applications. This work proposes FP-ATM, a 6T SRAM-based all-digital design for multiply-accumulate (MAC) operations, featuring a flexible NOR Adder Tree for In-memorycomputing. The proposed macro is data-aware and can support input activations and weights for INT8 and BF16 number formats in a convolutional neural network. Using multiple macros in different configurations can support neural networks with different topologies. The proposed macro is based on bit-serial multiplication and parallel adder trees. This architecture can achieve massively parallel MAC operations with high energy efficiency and throughput. The proposed macro achieves a peak energy efficiency of 267.7 TFLOPS/W at 0.65V, 8.5 times the state-of-the-art work. The maximum frequency is 1.67 GHz and achieves throughput of 2.67 GFLOPS/Kb at a voltage of 0.9V.
This paper describes a bit-serial MAC accelerator building block architecture that is optimized for low power and cost constrained systems. The architecture is implemented with both a mixed signal and fully digital ap...
详细信息
ISBN:
(纸本)9798350377217;9798350377200
This paper describes a bit-serial MAC accelerator building block architecture that is optimized for low power and cost constrained systems. The architecture is implemented with both a mixed signal and fully digital approach to compare performance. The accelerators are implemented in 40 nm CMOS and achieve up to 12.5 TOPS/W at 0.65 V supply voltage and 15 MHz clock frequency.
暂无评论