Memory-augmented neural networks are getting more attention from many researchers as they can make an inference with the previous history stored in memory. Especially, among these memory-augmented neural networks, mem...
详细信息
ISBN:
(纸本)9781450366694
Memory-augmented neural networks are getting more attention from many researchers as they can make an inference with the previous history stored in memory. Especially, among these memory-augmented neural networks, memory networks are known for their huge reasoning power and capability to learn from a large number of inputs rather than other networks. As the size of input datasets rapidly grows, the necessity of large-scale memory networks continuously arises. Such large-scale memory networks provide excellent reasoning power;however, the current computer infrastructure cannot achieve scalable performance due to its limited system architecture. In this paper, we propose MnnFast, a novel system architecture for large-scale memory networks to achieve fast and scalable reasoning performance. We identify the performance problems of the current architecture by conducting extensive performance bottleneck analysis. Our in-depth analysis indicates that the current architecture suffers from three major performance problems: high memory bandwidth consumption, heavy computation, and cache contention. To overcome these performance problems, we propose three novel optimizations. First, to reduce the memory bandwidth consumption, we propose a new column-based algorithm with streaming which minimizes the size of data spills and hides most of the offchip memory accessing overhead. Second, to decrease the high computational overhead, we propose a zero-skipping optimization to bypass a large amount of output computation. Lastly, to eliminate the cache contention, we propose an embedding cache dedicated to efficiently cache the embedding matrix. Our evaluations show that MnnFast is significantly effective in various types of hardware: CPU, GPU, and FPGA. MnnFast improves the overall throughput by up to 5.38x, 4.34x, and 2.01x on CPU, GPU, and FPGA respectively. Also, compared to CPU-based MnnFast, our FPGA-based MnnFast achieves 6.54x higher energy efficiency.
The lower power object detection challenge(LPODC) at the IEEE/ACM design Automation conference is a premier contest in low-power object detection and algorithm(software)-hardwareco-design for edge artificial intellig...
详细信息
The lower power object detection challenge(LPODC) at the IEEE/ACM design Automation conference is a premier contest in low-power object detection and algorithm(software)-hardwareco-design for edge artificial intelligence, which has been a success in the past five years. LPODC focused on designing and implementing novel algorithms on the edge platform for object detection in images taken from unmanned aerial vehicles(UAVs), which attracted hundreds of teams from dozens of countries to participate. Our team SEUer has been participating in this competition for three consecutive years from 2020 to 2022 and obtained sixth place respectively in 2020 and 2021. Recently, we achieved the championship in 2022. In this paper, we presented the LPODC for UAV object detection from 2018 to 2022, including the dataset, hardware platform,and evaluation method. In addition, we also introduced and discussed the details of methods proposed by each year's top three teams from 2018 to 2022 in terms of network, accuracy, quantization method, hardware performance, and total score. Additionally, we conducted an in-depth analysis of the selected entries and results, along with summarizing representative methodologies. This analysis serves as a valuable practical resource for researchers and engineers in deploying the UAV application on edge platforms and enhancing its feasibility and reliability. According to the analysis and discussion, it becomes evident that the adoption of a hardware-algorithmco-design approach is paramount in the context of tiny machine learning(TinyML).This approach surpasses the mere optimization of software and hardware as separate entities, proving to be essential for achieving optimal performance and efficiency in TinyML applications.
Graph Neural Networks (GNNs) are of great value in numerous applications and promote the development of cognitive intelligence, due to the capability of modeling non-euclidean data structures. However, the inherent ir...
详细信息
Graph Neural Networks (GNNs) are of great value in numerous applications and promote the development of cognitive intelligence, due to the capability of modeling non-euclidean data structures. However, the inherent irregularity makes GNNs memory-bound, and the hybrid computing paradigm of GNNs poses significant challenges for efficient deployment on existing hardware architectures. Near-Memory Processing (NMP) is a promising solution for alleviating the memory wall problem. In this paper, we present G-NMP, a practical and efficient DIMM-based NMP solution for accelerating GNNs, which accelerates both sparse Aggregation and dense combination computations on DIMM for the first time. We propose a novel G-NMP hardware architecture to exploit rank-level memory parallelism efficiently, and the G-ISA instructions to reduce host memory requests significantly. We conduct several data flow optimizations on the G-NMP to improve memory-compute overlap and to realize efficient matrix computation. Then we develop an adaptive data allocation strategy for diverse vector sizes to further exploit feature-level parallelism. We also propose a novel memory request scheduling method to achieve flexible and low-overhead DRAM ownership transition between host and G-NMP. Overall, G-NMP achieves consistent performance advantages across diverse GNN models and datasets, and offers 1.46x overall performance and 1.29x energy efficiency on average compared with the state-of-the-art work.
Matrix Algebra and Deep Neural Networks represent foundational classes of computational algorithms across multiple emerging applications like Augmented Reality or Virtual Reality, autonomous navigation (cars, drones, ...
详细信息
Matrix Algebra and Deep Neural Networks represent foundational classes of computational algorithms across multiple emerging applications like Augmented Reality or Virtual Reality, autonomous navigation (cars, drones, robots), data science, and various artificial intelligence-driven solutions. An accelerator-based architecture can provide performance and energy efficiency supporting fixed functions through customized data paths. However, constrained Edge systems requiring multiple applications and diverse matrix operations to be efficiently supported, cannot afford numerous custom accelerators. In this article, we present Mxcore, a unified architecture that comprises tightly coupled vector and programmable cores sharing data through highly optimized interconnects along with a configurable hardware schedulermanaging the co-execution. We submit Mxcore as the generalized approach to facilitate the flexible acceleration of multiple Matrix Algebra and Deep-learning applications across a range of sparsity levels. Unified compute resources improve overall resource utilization and performance per unit area. Aggressive and novel microarchitecture techniques along with block-level sparsity support optimize compute and data-reuse to minimize bandwidth and power requirements enabling ultra-low latency applications for low-power and cost-sensitive Edge deployments. Mxcore requires a small silicon footprint of 0.2068 mm(2), in a modern 7-nm process at 1 GHz and achieves (0.15 FP32 and 0.62 INT8) TMAC/mm(2), dissipating only 11.66 mu W of leakage power. At iso-technology and iso-frequency, Mxcore provides an energy efficiency of 651.4x, 159.9x, 104.8x, and 124.2x as compared to the 128-core Nvidia's Maxwell GPU for dense General Matrix Multiply, sparse Deep Neural Network, Cholesky decomposition, and triangular matrix solve respectively.
Organic neuromorphic devices can accelerate neural networks and integrate with biological systems. Devices based on the biocompatible and conductive polymer PEDOT:PSS are fast, require low amounts of energy and perfor...
详细信息
Organic neuromorphic devices can accelerate neural networks and integrate with biological systems. Devices based on the biocompatible and conductive polymer PEDOT:PSS are fast, require low amounts of energy and perform well in crossbar simulations. However, parasitic electrochemical reactions lead to self-discharge and the fading of the learned conductance states over time. This limits a neural network's operating time and requires complex compensation mechanisms. Spiking neural networks (SNNs) take inspiration from biology to implement local and always-on learning. We show that these SNNs can function on organic neuromorphic hardware and compensate for self-discharge by continuously relearning and reinforcing forgotten states. In this work, we use a high-resolution charge transport model to describe the behavior of organic neuromorphic devices and create a computationally efficient surrogate model. By integrating the surrogate model into a Brian 2 simulation, we can describe the behavior of SNNs on organic neuromorphic hardware. A biologically plausible two-layer network for recognizing 28x28 pixel MNIST images is trained and observed during self-discharge. The network achieves, for its size, competitive recognition results of up to 82.5%. Building a network with forgetful devices yields superior accuracy during training with 84.5% compared to ideal devices. However, trained networks without active spike-timing-dependent plasticity quickly lose their predictive performance. We show that online learning can keep the performance at a steady level close to the initial accuracy, even for idle rates of up to 90%. This performance is maintained when the output neuron's labels are not revalidated for up to 24 h. These findings reconfirm the potential of organic neuromorphic devices for brain-inspired computing. Their biocompatibility and the demonstrated adaptability to SNNs open the path towards close integration with multi-electrode arrays, drug-delivery devices, and ot
Organic neuromorphic device networks can accelerate neural network algorithms and directly integrate with microfluidic systems or living tissues. Proposed devices based on the bio-compatible conductive polymer PEDOT:P...
详细信息
Organic neuromorphic device networks can accelerate neural network algorithms and directly integrate with microfluidic systems or living tissues. Proposed devices based on the bio-compatible conductive polymer PEDOT:PSS have shown high switching speeds and low energy demand. However, as electrochemical systems, they are prone to self-discharge through parasitic electrochemical reactions. Therefore, the network's synapses forget their trained conductance states over time. This work integrates single-device high-resolution charge transport models to simulate entire neuromorphic device networks and analyze the impact of self-discharge on network performance. Simulation of a single-layer nine-pixel image classification network commonly used in experimental demonstrations reveals no significant impact of self-discharge on training efficiency. And, even though the network's weights drift significantly during self-discharge, its predictions remain 100% accurate for over ten hours. On the other hand, a multi-layer network for the approximation of the circle function is shown to degrade significantly over twenty minutes with a final mean-squared-error loss of 0.4. We propose to counter the effect by periodically reminding the network based on a map between a synapse's current state, the time since the last reminder, and the weight drift. We show that this method with a map obtained through validated simulations can reduce the effective loss to below 0.1 even with worst-case assumptions. Finally, while the training of this network is affected by self-discharge, a good classification is still obtained. Electrochemical organic neuromorphic devices have not been integrated into larger device networks. This work predicts their behavior under nonideal conditions, mitigates the worst-case effects of parasitic self-discharge, and opens the path toward implementing fast and efficient neural networks on organic neuromorphic hardware.
暂无评论