Although vision transformers (ViTs) have achieved significant success, their intensive computations and substantial memory overheads challenge their deployment on edge devices. To address this, efficient ViTs have eme...
详细信息
Although vision transformers (ViTs) have achieved significant success, their intensive computations and substantial memory overheads challenge their deployment on edge devices. To address this, efficient ViTs have emerged, typically featuring convolution-transformer hybrid architectures to enhance both accuracy and hardware efficiency. While prior work has explored quantization for efficient ViTs to marry the hardware efficiency of efficient hybrid ViT architectures and quantization, it focuses on uniform quantization and overlooks the potential advantages of mixed quantization. Meanwhile, although several works have studied mixed quantization for standard ViTs, they are not directly applicable to hybrid ViTs due to their distinct algorithmic and hardware characteristics. To bridge this gap, we present M-2-ViT to accelerate convolution-transformer hybrid efficient ViTs with two-level mixed quantization (M(2)Q). Specifically, we introduce a hardware-friendly M(2)Q strategy, characterized by both mixed quantization precision and mixed quantization schemes uniform and power-of-two (PoT), to exploit the architectural properties of efficient ViTs. We further build a dedicated accelerator with heterogeneous computing engines to transform algorithmic benefits into real hardware improvements. The experimental results validate our effectiveness, showcasing an average of 80% energy-delay product (EDP) saving with comparable quantization accuracy compared to the prior work. codes are available at https://***/lybbill/M2ViT.
This work presents an energy-efficient ECG processor designed for Cardiac Arrhythmia Classification. The processor integrates a pre-processing and neural network accelerator, achieved through algorithm-hardwareco-des...
详细信息
This work presents an energy-efficient ECG processor designed for Cardiac Arrhythmia Classification. The processor integrates a pre-processing and neural network accelerator, achieved through algorithm-hardware co-design to optimize hardware resources. We propose a lightweight two-stage neural network architecture, where the first stage includes discrete wavelet transform and an ultra-low-parameter multilayer perceptron (MLP) network, and the second stage utilizes group convolution and channel shuffle. Both stages leverage neural networks for hardware resource reuse and feature a reconfigurable processing element array and memory blocks adapted to the proposed two-stage structure to efficiently handle various convolution and MLP layers operations in the two-stage network. Additionally, an optimized power-of-two (OPOT) quantization technique is proposed to enhance accuracy in low-bit quantization, and a multiplier-less processing element structure tailored for the OPOT weight quantization is introduced. The ECG processor was implemented on a 65nm CMOS process technology with 4KB of SRAM memory, achieving an energy consumption per inference of 0.15 uJ with a power supply of 1V, 64% energy saving compared to the recent state-of-the-art work. Under 4-bit weight precision, the 5-class ECG signal classification accuracy reached 98.59% on the MIT-BIH arrhythmia dataset.
Vision Transformers (ViTs) have achieved excellent performance on various computer vision tasks, while their high computation and memory costs pose challenges for practical deployment. To address this issue, token-lev...
详细信息
Vision Transformers (ViTs) have achieved excellent performance on various computer vision tasks, while their high computation and memory costs pose challenges for practical deployment. To address this issue, token-level pruning is used as an effective method to compress ViTs, discarding unimportant image tokens that contribute little to predictions. However, directly applying unstructured token pruning to window-based ViTs damages their regular feature map structure, resulting in load imbalance when deployed on mobile devices. In this work, we propose an efficient algorithm-hardwareco-optimized framework to accelerate window-based ViTs via adaptive Mixed-Granularity Sparsity (MGS). At the algorithm level, a hardware-friendly MGS algorithm is developed by integrating the inherent sparsity, global window pruning, and local N:M token pruning to balance model accuracy and its computational complexity. At the hardware level, we present a dedicated accelerator equipped with a sparse computing core and two lightweight auxiliary processing units to execute window-based calculations efficiently using MGS. Additionally, we devise a dynamic pipeline interleaving dataflow to achieve on-chip layer fusion, which reduces the processing latency and maximizes data reuse. Experimental results demonstrate that, with similar computational complexity, our highly structured MGS algorithm can achieve comparable or even better accuracy than previous compression methods. Moreover, compared to existing FPGA-based accelerators for Transformers, our design can achieve 1.80 similar to 6.52x and 1.16 similar to 12.05x improvements in terms of throughput and energy efficiency, respectively.
The costly multiplications challenge the deployment of modern deep neural networks (DNNs) on resource-constrained devices. To promote hardware efficiency, prior works have built multiplication-free models. However, th...
详细信息
The costly multiplications challenge the deployment of modern deep neural networks (DNNs) on resource-constrained devices. To promote hardware efficiency, prior works have built multiplication-free models. However, they are generally inferior to their multiplication-based counterparts in accuracy, calling for multiplication-reduced hybrid models to marry the benefits of both approaches. To achieve this goal, recent works, i.e., NASA and NASA+ , have developed Neural Architecture Search (NAS) and Acceleration frameworks to search for and accelerate such hybrid models via a tailored differentiable NAS (DNAS) engine and dedicated ASIC-based accelerators. In this paper, we delve deeper into the inherent advantages of FPGAs and present an enhanced approach called NASA-F, which focuses on FPGA-oriented search and acceleration for hybrid models. Specifically, on the algorithm level, we develop a tailored one-shot supernet-based NAS engine to streamline the search for hybrid models, eliminating the need for executing NAS for each deployment as well as additional training/finetuning steps. On the hardware level, we develop a chunk-based accelerator to fully leverage the diverse hardware resources available on FPGAs for the acceleration of heterogeneous layers in hybrid models, aiming to enhance both hardware utilization and throughput. Extensive experimental results consistently validate the superiority of our NASA-F framework, e.g., we can gain up arrow 0.67% top-1 accuracy over the prior work NASA on CIFAR100 even without additional training steps for searched models. Additionally, we can achieve up to up arrow 1.86x throughout and up arrow 2.16x FPS with up arrow 0.39% top-1 accuracy over the state-of-the-art multiplication-based system on Tiny-ImageNet. codes are available at https://***/shihuihong214/NASA-F.
Vision Transformers (ViTs) have achieved remark-able success in computer vision (CV) and are increasingly recognized as the new backbone for vision-language multi-modal tasks. Despite their success, the high computati...
详细信息
Vision Transformers (ViTs) have achieved remark-able success in computer vision (CV) and are increasingly recognized as the new backbone for vision-language multi-modal tasks. Despite their success, the high computational cost associated with ViTs hinders their inference efficiency. In this paper, we introduce BSViT, a bit-serial Vision Transformer accelerator enhanced by algorithm-hardware co-design. BSViT can efficiently accelerate both plain and hierarchical Vision Transformer inference. At the algorithm level, we propose a post-training quantization scheme named dynamic patch and weight bit-group quantization. We first introduce a dynamic patch quantization (DPQ) scheme to dynamically allocate bit-width to different image patches based on their importance, thus reducing bit width and saving computation without significantly impacting accuracy. Second, we propose a weight bit-group quantization (BGQ)scheme to evenly distribute bits within groups and achieve workload balance across processing elements (PEs). At the hardware level, we propose a term-separate bit-serial accelerator to efficiently support DPQ and BGQ. We introduce dense and sparse bit-serial PEs to manipulate the dense least significant term (LST) and sparse most significant term (MST) work loads. A dense-sparse hybrid dataflow is devised to efficiently balance the two kinds of workloads. Our experiments show that BSViT can achieve up to 1.95x speedup and 2.72x energy efficiency compared to state-of-the-art (SOTA) bit-serial accelerators and achieve up to 3.69x energy efficiency compared to SOTA Trans-former accelerators.
This letter presents an ultralow-power (ULP) H.264/AVCintra-frame image compression accelerator tailored for intelligent event-driven ULP IoT imaging systems. The H.264/AVC intra-frame codec is customized to enable co...
详细信息
This letter presents an ultralow-power (ULP) H.264/AVCintra-frame image compression accelerator tailored for intelligent event-driven ULP IoT imaging systems. The H.264/AVC intra-frame codec is customized to enable compression of arbitrary nonrectangular change-detected regions. To optimize energy and latency from image memory accesses, novel algorithm-hardware co-designs are proposed for intra-frame predictions, reducing overhead for neighbor macroblock (McB)accesses by 2.6xat a negligible quality loss. With split control for major processing phases, latency is optimized by exploiting data dependency and pipelining. Area and leakage of major computation units are reduced through data path micro-architecture reconfiguration. Fabricated in40 nm, it occupies a mere 0.32 mm2area with 4-kB SRAM. At 0.6 V and153 kHz, it consumes only 1.21 mu W, with 30.9 pJ/pixel compression energy efficiency that rivals state-of-the-art designs. For an event-driven IoT imaging system, the combination of the proposed accelerator and change detection brings 133xreduction to the overall energy for regressing an image of change-detected region of interest
Large Language Models (LLMs) have achieved significant success in various Natural Language Processing (NLP) tasks, becoming essential to modern intelligent computing. Their large memory footprint and high computationa...
详细信息
Large Language Models (LLMs) have achieved significant success in various Natural Language Processing (NLP) tasks, becoming essential to modern intelligent computing. Their large memory footprint and high computational cost hinder efficient deployment. Post-Training Quantization (PTQ) is a promising technique to alleviate this issue and accelerate LLM inference. However, the presence of outliers impedes the advancement of LLM quantization to lower bit levels. In this paper, we introduce OFQ-LLM, an algorithm-hardware co-design solution that adopts outlier-flexing quantization to efficiently accelerate LLM at low-bit levels. The key insight of OFQ-LLM is that normal data can be efficiently quantized in a slightly reduced data encoding space, while the rest encoding space can be used for flexible outlier values. During quantization, we use rescale-based clipping (RBC) to optimize accuracy for normal data and group outlier clustering (GOC) to flexibly represent outlier values. At the hardware level, we introduce a memory-aligned outlier-flexing encoding scheme to encode activations and weights in LLMs at a low bit level. The outlier-normal mixed hardware architecture is devised to leverage the encoding scheme and accelerate LLMs with high speed and high energy efficiency. Our experiments show that OFQ-LLM achieves better accuracy compared to state-of-the-art (SOTA) low-bit LLM PTQ works. OFQ-LLM-based accelerator surpasses the SOTA outlier-aware accelerators by up to 2.69x core energy efficiency, up to 3.83x speed up and 2.44x energy reduction in LLM prefilling phase, and up to 2.01x speed up and 2.88x energy reduction in LLM decoding phase, with superior accuracy.
This paper proposed an algorithm-hardware co-design of an event-driven spiking neural network (SNN) accelerator for classification tasks of event-based data from dynamic vision sensors (DVS), which can implement a fee...
详细信息
This paper proposed an algorithm-hardware co-design of an event-driven spiking neural network (SNN) accelerator for classification tasks of event-based data from dynamic vision sensors (DVS), which can implement a feed-forward SNN with a maximum network size of 1 million synapses. configurable structured sparsity is introduced between the first layer and the second layer to improve energy efficiency and balance the workload between different processing elements (PEs). The number of available neurons in the accelerator is sparsity-dependent and ranges from 1024 to 4096. The modified leaky-integrate-fire (LIF) neuron model and an event-driven neuron update scheme are employed in both algorithm and hardware to fully utilize the natural sparsity of DVS event stream. Early-stop inference strategy on the hardware enables a trade-off between inference accuracy and efficiency. A three-layer fully connected SNN is trained through backpropagation through time (BPTT) and is implemented and evaluated on Xilinx ZCU104 FPGA. Our design can achieve 96.0% accuracy on the N-MNIST dataset and 79.0% accuracy on the DVS128-Gesture dataset both at 50% sparsity. The top performance of the accelerator on ZCU104 is 3.22 GSOP/S and 3.99 GSOP/W at 250 MHz.
With the growing demand for processing deep learning applications on edge devices, on-device DNN training has become a major workload to execute a variety of vision tasks suited for users. Therefore, architectures emp...
详细信息
With the growing demand for processing deep learning applications on edge devices, on-device DNN training has become a major workload to execute a variety of vision tasks suited for users. Therefore, architectures employed with algorithmco-design to accelerate the training process have been steadily studied. However, previous solutions are mostly supported by extended versions of the inference studies, such as sparsity, data flow, quantization, etc. Moreover, most works examine their schemes on from-the-scratch training that cannot tolerate inaccurate computing. Accordingly, there are still factors that hinder the overall speed of the DNN training process that has not been addressed in practical workloads. In this work, we propose a runtime convergence monitor to achieve massive computational savings in the practical on-device training workloads (i.e., transfer-learning-based task adaptation). By monitoring the network output data, we determine the training intensity of incoming tasks and adaptively detect the convergence in iteration intervals for training diverse datasets. Furthermore, we enable the computation skip of converged images determined by the monitored prediction probability to enhance the training speed within an iteration. As a result, we perform an accurate but fast convergence in model training for the task adaptation with minimal overhead. Unlike the previous approximation methods, our monitoring system enables runtime optimization and can be easily applicable to any type of accelerator attaining significant speedup. Evaluation results on various datasets show geomean of 2.2x speedup when applied in any systolic architectures and further enhancement of 3.6x when applied in accelerators dedicated for on-device training.
Spiking neural networks (SNNs) offer a promising alternative to traditional analog neural networks (ANNs), especially for sequential tasks, with enhanced energy efficiency. The internal memory in SNNs obtained through...
详细信息
ISBN:
(纸本)9798350348606;9783981926385
Spiking neural networks (SNNs) offer a promising alternative to traditional analog neural networks (ANNs), especially for sequential tasks, with enhanced energy efficiency. The internal memory in SNNs obtained through the membrane potential equips them with innate lightweight temporal processing capabilities. However, the unique advantages of this temporal dimension of SNNs have not yet been effectively harnessed. To that end, this article delves deeper into the what, why and where of SNNs. By considering event-based optical flow as an exemplary task in vision-based navigation, we highlight that the true potential of SNNs lies in sequential tasks. The event-driven recurrent dynamics of a spiking neuron merged harmoniously with event camera inputs enables SNNs to outperform corresponding ANNs with a lower number of parameters for optical flow. Furthermore, we demonstrate that SNNs can be synergistically combined with ANNs to form SNN-ANN hybrids to obtain the best of both worlds in terms of accuracy, energy, memory, and training efficiency. Additionally, the emergence of various near-memory and in-memory computing techniques has propelled efficient implementation of these approaches. Overall, the immediate future of SNNs looks exciting, as we discover the niche of SNNs, comprising sequential tasks with low power requirements.
暂无评论