As an emerging computing architecture, the computing-in-memory (CIM) exhibits significant potential for energy efficiency and computing power in artificial intelligence applications. However, the intrinsic non-idealit...
详细信息
As an emerging computing architecture, the computing-in-memory (CIM) exhibits significant potential for energy efficiency and computing power in artificial intelligence applications. However, the intrinsic non-idealities of CIM devices, manifesting as random interference on the weights of neural network, may significantly impact the inference accuracy. In this paper, we propose a novel training algorithm designed to mitigate the impact of weight noise. The algorithm strategically minimizes cross-entropy loss while concurrently refining the feature representations in intermediate layers to emulate those of an ideal, noise-free network. This dual-objective approach not only preserves the accuracy of the neural network but also enhances its robustness against noise-induced degradation. Empirical validation across several benchmark datasets confirms that our algorithm sets a new benchmark for accuracy in CIM-enabled neural network applications. Compared to the most commonly used forward noise training methods, our approach yields approximately a 2% accuracy boost on the ResNet32 model with the CIFAR-10 dataset and a weight noise scale of 0.2, and achieves a minimum performance gain of 1% on ResNet18 with the ImageNet dataset under the same noise quantization conditions.
High-performance vertical-channel flash (HVF) memory cells were fabricated on the single crystalline Si (c-Si) sidewalls of the cylindrical deep wells in c-Si substrate. To investigate the diameter effects of the cyli...
详细信息
High-performance vertical-channel flash (HVF) memory cells were fabricated on the single crystalline Si (c-Si) sidewalls of the cylindrical deep wells in c-Si substrate. To investigate the diameter effects of the cylindrical deep wells, namely channel holes, on HVF cells, the channel holes with different diameters, ranging from 65 nm to 260 nm, were made. memory gate stacks of SiO2/Al2O3/HfO2/Al2O3/TiN/W were formed by ozone oxidation and then ALD with the deposition thicknesses of 1/5/7/8/2/150 nm, respectively. For the devices with their diameters equal to or greater than 150 nm, their electrical properties, such as Vt, SS, DIBL, and program/erase characteristics, are close. As expected, DIBL and SS become better as the diameter increasing due to better gate control with larger diameter. However, large changes were occurred for the devices with the diameters of 90 nm and 65 nm. A simple model based on cylinder bulk for vertical flash memory devices was presented to obtain an approximate analytical solution for depletion-width and explain our experimental data. For the devices with the diameters of 150 nm, the high On/Off current ratio of 107 and relatively large memory window of 4.5 V were achieved. However, programming/erasing efficiency were degraded with hole diameter decreasing.
Convolutional neural networks (CNNs) play a key role in deep learning applications. However, the high computational complexity and high-energy consumption of CNNs trammel their application in hardware accelerators. Co...
详细信息
Convolutional neural networks (CNNs) play a key role in deep learning applications. However, the high computational complexity and high-energy consumption of CNNs trammel their application in hardware accelerators. computing-in-memory (CIM) is the technique of running calculations entirely in memory (in our design, we use SRAM). CIM architecture has demonstrated great potential to effectively compute large-scale matrix-vector multiplication. CIM-based architecture for event detection is designed to trigger the next stage of precision inference. To implement an SRAM-based CIM accelerator, a software and hardware co-design approach must consider the CIM macro's hardware limitations to map the weight onto the AI edge devices. In this paper, we designed a hierarchical AI architecture to optimize the end-to-end system power in the AIoT application. In the experiment, the CIM-aware algorithm with 4-bit activation and 8-bit weight is examined on hand gesture and CIFAR-10 datasets, and determined to have 99.70% and 70.58% accuracy, respectively. A profiling tool to analyze the proposed design is also developed to measure how efficient our architecture design is. The proposed design system utilizes the operating frequency of 100 MHz, hand gesture and CIFAR-10 as the datasets, and nine CNNs and one FC layer as its network, resulting in a frame rate of 662 FPS, 37.6% processing unit utilization, and a power consumption of 0.853 mW.
Convolutional neural network (CNN) is a power-hungry and resource-consuming application, which makes it hard to deploy on end devices. We propose a method to perform convolution operations in NOR flash memory. Experim...
详细信息
ISBN:
(纸本)9781450366618
Convolutional neural network (CNN) is a power-hungry and resource-consuming application, which makes it hard to deploy on end devices. We propose a method to perform convolution operations in NOR flash memory. Experiment results show that our method has great performance and high energy efficiency.
Utilizing emerging nonvolatile memories to accelerate deep neural network (DNN) has been considered as one of the promising approaches to solve the bottleneck of data transfer during the multiplication and accumulatio...
详细信息
ISBN:
(纸本)9781450379441
Utilizing emerging nonvolatile memories to accelerate deep neural network (DNN) has been considered as one of the promising approaches to solve the bottleneck of data transfer during the multiplication and accumulation (MAC). Among them, spintronic memories show tempting prospect due to their low access power, fast access speed, high density, and relatively mature process. As shown in fig.1, according to the principle to achieve DNN computing, it can be mainly divided into three different technical routes. The first one is an "analog" method [1, 2], as shown in fig.1(a). By transforming the digital input signals into multi-level voltage signals, and applying them to different columns of the memory array, the MAC results can be obtained in different columns with current integrator and analog to digital converter (ADC). Besides, the WL drivers can control the pulse width of different rows, to achieve the effect of multi-bit weights. This method can theoretically achieve high energy efficiency and computing speed. However, the variation of magnetic tunnel junction (MTJ) may have influence on the computing accuracy. Besides, the power consumption and area overhead of the ADC are also challenging. The other two methods are in a "digital" way, and they realize MAC computing through row-by-row read/write operation. Fig.1(b) shows the second reading-based method [3]. The weights of the neural network are stored in the memory cell. By putting the input signal to the modified sensing amplifier (SA), it can also achieve XOR function, which is the core of binary NN, with the content stored in the memory cell. Nevertheless, the modification to the SA is usually to add extra transistors in the read path, which will increase the bit error rate. Fig.1(c) shows the diagram of the last one, which is based on the "stateful logic" [4]. The input data is sent to the modified write driver when the WL receiving weight signals from outside I/O. Based on a unique logic paradigm, it can real
Computation-in-memory using memristive devices is a promising approach to overcome the performance limitations of conventional computing architectures introduced by the von Neumann bottleneck which are also known as m...
详细信息
Computation-in-memory using memristive devices is a promising approach to overcome the performance limitations of conventional computing architectures introduced by the von Neumann bottleneck which are also known as memory wall and power wall. It has been shown that accelerators based on memristive devices can deliver higher energy efficiencies and data throughputs when compared with conventional architectures. In the vast multitude of memristive devices, bipolar resistive switches based on the valence change mechanism (VCM) are particularly interesting due to their low power operation, non-volatility, high integration density and their CMOS compatibility. While a wide range of possible applications is considered, many of them such as artificial neural networks heavily rely on vector-matrix-multiplications (VMMs) as a mathematical operation. These VMMs are made up of large numbers of multiplication and accumulation (MAC) operations. The MAC operation can be realised using memristive devices in an analog fashion using Ohm's law and Kirchhoff's law. However, VCM devices exhibit a range of non-idealities, affecting the VMM performance, which in turn impacts the overall accuracy of the application. Those non-idealities can be classified into time-independent (programming variability) and time-dependent (read disturb and read noise). Additionally, peripheral circuits such as analog to digital converters can introduce errors during the digitalization. In this work, we experimentally and theoretically investigate the impact of device- and circuit-level effects on the VMM in a VCM crossbars. Our analysis shows that the variability of the low resistive state plays a key role and that reading in the RESET direction should be favored to reading in the SET direction.
暂无评论