In this study, a very-large-scale integration implementation of a convolutionalneuralnetwork (CNN) inference for abnormal heartbeat detection was proposed. Four-lead electrocardiogram signals were used to detect abn...
详细信息
In this study, a very-large-scale integration implementation of a convolutionalneuralnetwork (CNN) inference for abnormal heartbeat detection was proposed. Four-lead electrocardiogram signals were used to detect abnormal heartbeat conditions, such as premature ventricular complex. 1D CNNs and fully connected layers were utilised in the proposed chip to achieve high-speed, small-area, and high-accuracy arrhythmia detection. The proposed chip was implemented using a 90-nm complementary metal-oxide-semiconductor process and operated at 125 MHz with a 0.67mm(2) core area. The power consumption was 4.18mW at high-speed operation frequency (125 MHz) and 3.79 mu W at 10 kHz for low-power applications. The detection accuracy was 95.14% based on the MIT-BIH arrhythmia database. Consequently, the properties of high speed, low power, small area, and high accuracy were established in the proposed accelerator chip.
In the realm of convolutionalneuralnetworks (CNNs), convolution operations exhibit a multiple degree of data reuse. However, the current Weight Stationary (WS) can't adequately exploit the potential local data r...
详细信息
In the realm of convolutionalneuralnetworks (CNNs), convolution operations exhibit a multiple degree of data reuse. However, the current Weight Stationary (WS) can't adequately exploit the potential local data reuse. In addition, the local data reuse of the Systolic Array (SA) implementing WS is limited by array size. To address this issue, we propose a novel dataflow called Enhanced Weight Stationary (EWS) for SA-based CNN accelerators with our customized architecture to enhance data reuse. Our approach focuses on expanding the flexibility of weight mapping on Processing Elements (PEs) through the utilization of Weight Register Files (WRF) in PE arrays. Additionally, by incorporating Activation Register Files (ARF) and Partial Sum Register Files (PRF), our accelerator enables the convolutional reuse of input feature maps (ifmaps) and facilitates the reuse of partial sums (psums) during channel-wise accumulation, which can effectively reduce access to on-chip SRAM. Experimental results demonstrate the effectiveness of our CNN accelerator employing the EWS by achieving 1.22- $1.72\times $ throughput and 1.35- $2.48\times $ energy efficiency over WS dataflow as the size of array varies from 16 to 64.
convolutionalneuralnetworks require a huge amount of computation in video applications. For some specific tasks, such as surveillance, differential frame convolution reuses inter-frame data and significantly reduces...
详细信息
ISBN:
(纸本)9798350330991;9798350331004
convolutionalneuralnetworks require a huge amount of computation in video applications. For some specific tasks, such as surveillance, differential frame convolution reuses inter-frame data and significantly reduces multiplication and accumulation. However, there are still some challenges in improving energy efficiency of differential frame convolution on chips. Firstly, differential frame convolution brings additional on-chip storage for reusing inter-frame data. Secondly, in post-processing of differential frame convolution, there are more memory accessing and arithmetic logic operations. Therefore, sparse working mode is of vital importance for the post-processing. In response to these challenges, this work proposes an on-chip fusion storage architecture for energy-efficient differential frame convolution and a pixel-level pipeline data flow that supports the sparsity of features. The simulation of our accelerator implemented in 28nm CMOS can achieve energy efficiency by 3.09x compared with other state-of-the-art works
State-of-the-art convolutionalneuralnetwork (CNN) accelerators are typically communication-dominate architectures. To reduce the energy consumption of data accesses and also to maintain the high performance, researc...
详细信息
ISBN:
(纸本)9781450393225
State-of-the-art convolutionalneuralnetwork (CNN) accelerators are typically communication-dominate architectures. To reduce the energy consumption of data accesses and also to maintain the high performance, researches have adopted large amounts of on-chip register resources and proposed various methods to concentrate communication on on-chip register accesses. As a result, the on-chip register accesses become the energy bottleneck. To further reduce the energy consumption, in this work we propose an in-SRAM accumulation architecture to replace the conventional register files and digital accumulators in the processing elements of CNN accelerators. Compared with the existing in-SRAM computing approaches (which may not be targeted at CNN accelerators), the presented in-SRAM computing architecture not only realizes in-memory accumulation, but also solves the structure contention problem which occurs frequently when embedding in-memory architectures into CNN accelerators. HSPICE simulation results based on the 45nm technology demonstrate that with the proposed in-SRAM accumulator, the overall energy efficiency of a state-of-the-art communication-optimal CNN accelerator is increased by 29% on average.
Recently, various deep learning accelerators are being studied through data flow structure improvement and memory access optimization. Among them, the encoder-decoder model is widely used in object detection and seman...
详细信息
ISBN:
(纸本)9781665401746
Recently, various deep learning accelerators are being studied through data flow structure improvement and memory access optimization. Among them, the encoder-decoder model is widely used in object detection and semantic segmentation showing good performance. However, due to the deconvolution operation that outputs a high-resolution feature map from the decoder, the memory access and computational complexity are higher than that of the existing encoder-only structure. Thus, it is a big obstacle to the implementation of encoder-decoder accelerators. Most of the previous studies have focused only on the encoder part. This paper attempts to apply the fusion approach, which was effective for the convolution layer of the encoder, to the deconvolution of the decoder and shows the possibility of reducing the processing time and hardware complexity.
Digit-serial arithmetic has emerged as a viable approach for designing hardware accelerators, reducing interconnections, area utilization, and power consumption. However, conventional methods suffer from performance a...
详细信息
Digit-serial arithmetic has emerged as a viable approach for designing hardware accelerators, reducing interconnections, area utilization, and power consumption. However, conventional methods suffer from performance and latency issues. To address these challenges, we propose an accelerator design using left-to-right (LR) arithmetic, which performs computations in a most-significant digit first (MSDF) manner, enabling digit-level pipelining. This leads to substantial performance improvements and reduced latency. The processing engine is designed for convolutionalneuralnetworks (CNNs), which includes low-latency LR multipliers and adders for digit-level parallelism. The proposed DSLR-CNN is implemented in Verilog and synthesized with Synopsys design compiler using GSCL 45nm technology, the DSLR-CNN accelerator was evaluated on AlexNet, VGG-16, and ResNet-18 networks. Results show significant improvements across key performance evaluation metrics, including response time, peak performance, power consumption, operational intensity, area efficiency, and energy efficiency. The peak performance measured in GOPS of the proposed design is 4.37x to 569.11x higher than contemporary designs, and it achieved 3.58x to 44.75x higher peak energy efficiency (TOPS/W), outperforming conventional bit-serial designs.
暂无评论