Accurate programming of non-volatile memory (NVM) devices in analog in-memory computing (AIMC) cores is critical to achieve high matrix-vector multiplication (MVM) accuracy during deep learning inference workloads. In...
详细信息
Accurate programming of non-volatile memory (NVM) devices in analog in-memory computing (AIMC) cores is critical to achieve high matrix-vector multiplication (MVM) accuracy during deep learning inference workloads. In this paper, we propose a novel programming approach that directly minimizes the MVM error by performing stochastic gradient descent optimization with synthetic random input data. The MVM error is significantly reduced compared to the conventional unit-cell by unit-cell iterative programming. We demonstrate that the optimal hyperparameters in our method are agnostic to the weights being programmed, enabling large-scale deployment across multiple AIMC cores without further fine tuning. It also eliminates the need for high-resolution analog to digital converters (ADCs) to decipher the small unit-cell conductance during programming. We experimentally validate this approach by demonstrating an inference accuracy increase of 1.26% on ResNet-9. The experiments were performed using phase change memory (PCM)-based AIMC cores fabricated in 14nm CMOS technology.
analog in-memory computing architectures demand high-speed analog-to-digital converters, for which a dynamic comparator is a crucial building block. Speed and commonmode insensitivity are the critical features of such...
详细信息
ISBN:
(纸本)9781665451093
analog in-memory computing architectures demand high-speed analog-to-digital converters, for which a dynamic comparator is a crucial building block. Speed and commonmode insensitivity are the critical features of such dynamic comparators. Most of the reported dynamic comparators achieve high speed only for a narrow range of the input commonmode voltages. The comparators' performance degrades at the extremities of common-mode voltages. We propose a commonmode insensitive cascode cross-coupled dynamic comparator to overcome this drawback. The proposed comparator is designed, simulated, and compared with the state-of-the-art techniques in a 65nm CMOS technology. At 1.1V supply voltage, the proposed comparator shows a delay of 37 ps when the input difference is 10mV with a common-mode voltage of 400mV.
In-memorycomputing is a promising architecture to meet the exploding demand for data-intensive workloads, including deep neural networks. In particular, analog in-memory computings (AIMCs) is a promising way to build...
详细信息
ISBN:
(纸本)9798350300116
In-memorycomputing is a promising architecture to meet the exploding demand for data-intensive workloads, including deep neural networks. In particular, analog in-memory computings (AIMCs) is a promising way to build matrix multiplication accelerators that take full advantage of data parallelism and reusability. However, most AIMCs use voltage readout circuits that have no benefit from CMOS scaling, which is an obstacle to improving computational density. We propose a method that combines capacitive AIMC and readout with near-memory time-subtraction, which is theoretically scalable concerning miniaturization and row/column parallelism and is adjustable with output resolution. We have evaluated the signed multi-bit dot product operation in post-layout simulation using circuits designed with a 180-nm process. Even with x 16 increase in row-parallelism (9 to 144), the time resolution required for readout was successfully reduced to a variation of 0.39%.
This paper presents a Flash A/D converter to be integrated at the periphery of mixed-signal computing memories for convolutional neural networks. We investigate the feasibility of a true time-multiplexing, which allow...
详细信息
ISBN:
(纸本)9781728182810
This paper presents a Flash A/D converter to be integrated at the periphery of mixed-signal computing memories for convolutional neural networks. We investigate the feasibility of a true time-multiplexing, which allows to greatly relax the ADC requirements of area and aspect ratio, without sacrificing the data throughput of the memory array. The ADC, based on a strong-arm latched comparator combining built-in reference generation, body bias, and offset calibration, exhibits 29.8-dB SNDR at 3.2 GS/s with 1.5-mW power consumption, and a silicon area of 900 mu m(2). Integrated with the memory array, the converter enables up to 32-to-1 column multiplexing with 20 ns of A/D conversion latency.
Despite resistive random-access memories (RRAMs) have the ability of analog in-memory computing and they can be utilized to accelerate some applications (e.g., neural networks), the analog-digital interface consumes c...
详细信息
ISBN:
(纸本)9781665409599
Despite resistive random-access memories (RRAMs) have the ability of analog in-memory computing and they can be utilized to accelerate some applications (e.g., neural networks), the analog-digital interface consumes considerable overhead and may even counteract the benefits brought by RRAM-based in-memorycomputing. In this paper, we introduce how to reduce or eliminate the overhead of the analog-digital interface in RRAM-based neural network accelerators and linear solver accelerators. In the former, we create an analog inference flow and introduce a new methodology to accelerate the entire analog flow by using resistive content-addressable memories (RCAMs). Redundant analog-to-digital conversions are eliminated. In the latter, we provide an approach to map classical iterative solvers onto RRAM-based crossbar arrays such that the hardware can get the solution in O(1) time complexity without actual iterations, and thus, intermediate analog-to-digital conversions and digital-to-analog conversions are completely eliminated. Simulation results have proven the superiorities in the performance and energy efficiency of our approaches. The accuracy problem of RRAM-based analogcomputing will be a future research focus.
As the demands of big data applications and deep learning continue to rise, the industry is increasingly looking to artificial intelligence (AI) accelerators. analog in-memory computing (AiMC) with emerging nonvolatil...
详细信息
As the demands of big data applications and deep learning continue to rise, the industry is increasingly looking to artificial intelligence (AI) accelerators. analog in-memory computing (AiMC) with emerging nonvolatile devices enable good hardware solutions, due to its high energy efficiency in accelerating the multiply-and-accumulation (MAC) operation. Herein, an Applied Materials custom-designed system-on-chip (SoC) targeting AI applications with analog in-memory computing using resistive random-access memory (ReRAM) as the compute element is demonstrated. The first silicon achieves high energy efficiency in MAC operations. This chip is implemented with LeNet-1 neural network on ReRAM tiles and demonstrated by Modified National Institute of Standards and Technology (MNIST) classification with accuracy matching that predicted in the simulations. A simulation framework, AI Sim, is also developed to evaluate the system performance for large-scale application and guide the bitcell development and design choices.
In-memorycomputing (IMC) has emerged as a promising paradigm for energy-efficient, throughput-efficient and area-efficient machine learning at the edge. However, the differences in hardware architectures, array dimen...
详细信息
ISBN:
(纸本)9798350322255
In-memorycomputing (IMC) has emerged as a promising paradigm for energy-efficient, throughput-efficient and area-efficient machine learning at the edge. However, the differences in hardware architectures, array dimensions, and fabrication technologies among published IMC realizations have made it difficult to grasp their relative strengths. Moreover, previous studies have primarily focused on exploring and benchmarking the peak performance of a single IMC macro rather than full system performance on real workloads. This paper aims to address the lack of a quantitative comparison of analog in-memory computing (AIMC) and Digital In-memorycomputing (DIMC) processor architectures. We propose an analytical IMC performance model that is validated against published implementations and integrated into a system-level exploration framework for comprehensive performance assessments on different workloads with varying IMC configurations. Our experiments show that while DIMC generally has higher computational density than AIMC, AIMC with large macro sizes may have better energy efficiency than DIMC on convolutional-layers and pointwise-layers, which can exploit high spatial unrolling. On the other hand, DIMC with small macro size outperforms AIMC on depthwise-layers, which feature limited spatial unrolling opportunities inside a macro.
As genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 G...
详细信息
As genome sequencing is finding utility in a wide variety of domains beyond the confines of traditional medical settings, its computational pipeline faces two significant challenges. First, the creation of up to 0.5 GB of data per minute imposes substantial communication and storage overheads. Second, the sequencing pipeline is bottlenecked at the basecalling step, consuming >40% of genome analysis time. A range of proposals have attempted to address these challenges, with limited success. We propose to address these challenges with a Compute-in-memory Basecalling Accelerator (CiMBA), the first embedded (similar to 25 mm(2)) accelerator capable of real-time, on-device basecalling, coupled with analog (AL)-Dorado, a new family of analog focused basecalling DNNs. Our resulting hardware/software co-design greatly reduces data communication overhead, is capable of a throughput of 4.77 million bases per second, 24x that required for real-time operation, and achieves 17 x /27x power/area efficiency over the best prior basecalling embedded accelerator while maintaining a high accuracy comparable to state-of-the-art software basecallers.
Recently, specialized training algorithms for analog cross-point array-based neural network accelerators have been introduced to counteract device non-idealities such as update asymmetry and cycle-to-cycle variation, ...
详细信息
Recently, specialized training algorithms for analog cross-point array-based neural network accelerators have been introduced to counteract device non-idealities such as update asymmetry and cycle-to-cycle variation, achieving software-level performance in neural network training. However, a quantitative analysis of how these algorithms affect the relaxation of device specifications is yet to be conducted. This study provides a detailed analysis by elucidating the device prerequisites for training with the Tiki-Taka algorithm versions 1 (TTv1) and 2 (TTv2), which leverage the dynamics between multiple arrays to compensate for device non-idealities. A multiparameter simulation is conducted to assess the impact of device non-idealities, including asymmetry, retention, number of pulses, and cycle-to-cycle variation, on neural network training. Using pattern-recognition accuracy as a performance metric, the required device specifications for each algorithm are revealed. The results demonstrate that the standard stochastic gradient descent algorithm requires stringent device specifications. Conversely, TTv2 permits more lenient device specifications than the TTv1 across all examined non-idealities. The analysis provides guidelines for the development, optimization, and utilization of devices for high-performance neural network training using Tiki-Taka algorithms. This study investigates the device specifications required for neural network training using analog resistive cross-point arrays with the training algorithms. By demonstrating the robustness against non-ideal update characteristics with these algorithms, it quantitatively shows how hardware-aware training can relax device specifications. It could pave the way for successful implementation of analog deep learning accelerators with actual *** (c) 2024 WILEY-VCH GmbH
Electrochemical random access memory (ECRAM) is an emerging three-terminal nonvolatile memory (NVM) with highly controllable channel conductance which is promising for use as an analogmemory (or synapse) in analog in...
详细信息
Electrochemical random access memory (ECRAM) is an emerging three-terminal nonvolatile memory (NVM) with highly controllable channel conductance which is promising for use as an analogmemory (or synapse) in analog in-memory computing (IMC) systems. Energy-efficient analog IMC computing is particularly desirable for power-constrained, high-radiation environments such as satellites. However, little is known about the suitability of ECRAM for use in a total ionizing dose (TID) environment. This work investigates the effect of Co-60 gamma radiation on the channel conductance and noise-two properties critical for analog IMC systems-of a TaOx-based ECRAM up to 17.3 Mrad(SiO2) for both low- and high-channel-conductance state devices. A transient increase in conductance is observed in response to radiation which consists of two elements: an immediate increase in conductivity due to photocurrent and a secondary increase in conductivity, which has a slower rise and saturation and can persist for hours after exposure. This secondary, persistent photoconductivity is attributed to charging caused by hole trapping. These transient effects would not likely occur in a space environment due to the low dose rate compared with this experiment. No permanent change is found in the low conductance state (LCS) following exposure and the minor shift in the high conductance change would be less significant than the regular retention decay in this state. A permanent increase in the random telegraph noise is observed, possibly due to increased traps created in the channel. This work demonstrates that TaOx-based ECRAM is suitable for use in spaceborne analog IMC systems that are subject to significant TID.
暂无评论