检索结果-内蒙古大学图书馆

M²-ViT: Accelerating Hybrid Vision Transformers With Two-Level Mixed Quantization

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 2025年第5期33卷 1492-1496页

作者： Liang, Yanbiao Shi, Huihong Wang, Zhongfeng Nanjing Univ Sch Elect Sci & Engn Nanjing 210023 Peoples R China Sun Yat Sen Univ Sch Integrated Circuits Shenzhen 518107 Peoples R China

Although vision transformers (ViTs) have achieved significant success, their intensive computations and substantial memory overheads challenge their deployment on edge devices. To address this, efficient ViTs have emerged, typically featuring convolution-transformer hybrid architectures to enhance both accuracy and hardware efficiency. While prior work has explored quantization for efficient ViTs to marry the hardware efficiency of efficient hybrid ViT architectures and quantization, it focuses on uniform quantization and overlooks the potential advantages of mixed quantization. Meanwhile, although several works have studied mixed quantization for standard ViTs, they are not directly applicable to hybrid ViTs due to their distinct algorithmic and hardware characteristics. To bridge this gap, we present M-2-ViT to accelerate convolution-transformer hybrid efficient ViTs with two-level mixed quantization (M(2)Q). Specifically, we introduce a hardware-friendly M(2)Q strategy, characterized by both mixed quantization precision and mixed quantization schemes uniform and power-of-two (PoT), to exploit the architectural properties of efficient ViTs. We further build a dedicated accelerator with heterogeneous computing engines to transform algorithmic benefits into real hardware improvements. The experimental results validate our effectiveness, showcasing an average of 80% energy-delay product (EDP) saving with comparable quantization accuracy compared to the prior work. codes are available at https://***/lybbill/M2ViT.

关键词： Quantization (signal) hardware computational efficiency Accuracy computer architecture Standards Transformers Parallel processing Engines convolutional codes algorithm-hardware co-design efficient vision transformers (ViTs) hardware acceleration mixed quantization ViT

来源：评论

学校读者我要写书评

暂无评论

An Energy-Efficient ECG Processor With Ultra-Low-Parameter Multistage Neural Network and Optimized Power-of-Two Quantization

引用

IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 2024年第6期18卷 1296-1307页

作者： Zhang, Zuo Guan, Yunqi Ye, Wenbin Shenzhen Univ Sch Microscale Optoelect Shenzhen 518060 Peoples R China Shenzhen Univ Coll Elect & Informat Engn Shenzhen 518060 Peoples R China

This work presents an energy-efficient ECG processor designed for Cardiac Arrhythmia Classification. The processor integrates a pre-processing and neural network accelerator, achieved through algorithm-hardware co-design to optimize hardware resources. We propose a lightweight two-stage neural network architecture, where the first stage includes discrete wavelet transform and an ultra-low-parameter multilayer perceptron (MLP) network, and the second stage utilizes group convolution and channel shuffle. Both stages leverage neural networks for hardware resource reuse and feature a reconfigurable processing element array and memory blocks adapted to the proposed two-stage structure to efficiently handle various convolution and MLP layers operations in the two-stage network. Additionally, an optimized power-of-two (OPOT) quantization technique is proposed to enhance accuracy in low-bit quantization, and a multiplier-less processing element structure tailored for the OPOT weight quantization is introduced. The ECG processor was implemented on a 65nm CMOS process technology with 4KB of SRAM memory, achieving an energy consumption per inference of 0.15 uJ with a power supply of 1V, 64% energy saving compared to the recent state-of-the-art work. Under 4-bit weight precision, the 5-class ECG signal classification accuracy reached 98.59% on the MIT-BIH arrhythmia dataset.

关键词： Electrocardiography Neural networks Discrete wavelet transforms Heart beat Quantization (signal) convolution hardware ECG processor energy-efficient neural network low-bit quantization algorithm-hardware co-design

来源：评论

学校读者我要写书评

暂无评论

An Efficient Window-Based Vision Transformer Accelerator via Mixed-Granularity Sparsity

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 2025年

作者： Dong, Qiwei Zhang, Siyu Wang, Zhongfeng Nanjing Univ Sch Elect Sci & Engn Nanjing 210023 Peoples R China Sun Yat Sen Univ Sch Integrated Circuits Shenzhen 518107 Peoples R China

Vision Transformers (ViTs) have achieved excellent performance on various computer vision tasks, while their high computation and memory costs pose challenges for practical deployment. To address this issue, token-level pruning is used as an effective method to compress ViTs, discarding unimportant image tokens that contribute little to predictions. However, directly applying unstructured token pruning to window-based ViTs damages their regular feature map structure, resulting in load imbalance when deployed on mobile devices. In this work, we propose an efficient algorithm-hardware co-optimized framework to accelerate window-based ViTs via adaptive Mixed-Granularity Sparsity (MGS). At the algorithm level, a hardware-friendly MGS algorithm is developed by integrating the inherent sparsity, global window pruning, and local N:M token pruning to balance model accuracy and its computational complexity. At the hardware level, we present a dedicated accelerator equipped with a sparse computing core and two lightweight auxiliary processing units to execute window-based calculations efficiently using MGS. Additionally, we devise a dynamic pipeline interleaving dataflow to achieve on-chip layer fusion, which reduces the processing latency and maximizes data reuse. Experimental results demonstrate that, with similar computational complexity, our highly structured MGS algorithm can achieve comparable or even better accuracy than previous compression methods. Moreover, compared to existing FPGA-based accelerators for Transformers, our design can achieve 1.80 similar to 6.52x and 1.16 similar to 12.05x improvements in terms of throughput and energy efficiency, respectively.

关键词： algorithm-hardware co-design vision transformer pruning field-programmable gate array (FPGA) algorithm-hardware co-design vision transformer pruning field-programmable gate array (FPGA)

来源：评论

学校读者我要写书评

暂无评论

NASA-F: FPGA-Oriented Search and Acceleration for Multiplication-Reduced Hybrid Networks

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 2024年第1期71卷 306-319页

作者： Shi, Huihong Xu, Yang Wang, Yuefei Mao, Wendong Wang, Zhongfeng Nanjing Univ Sch Elect Sci & Engn Nanjing 210023 Peoples R China Sun Yat sen Univ Sch Integrated Circuits Shenzhen 518107 Peoples R China

The costly multiplications challenge the deployment of modern deep neural networks (DNNs) on resource-constrained devices. To promote hardware efficiency, prior works have built multiplication-free models. However, they are generally inferior to their multiplication-based counterparts in accuracy, calling for multiplication-reduced hybrid models to marry the benefits of both approaches. To achieve this goal, recent works, i.e., NASA and NASA+ , have developed Neural Architecture Search (NAS) and Acceleration frameworks to search for and accelerate such hybrid models via a tailored differentiable NAS (DNAS) engine and dedicated ASIC-based accelerators. In this paper, we delve deeper into the inherent advantages of FPGAs and present an enhanced approach called NASA-F, which focuses on FPGA-oriented search and acceleration for hybrid models. Specifically, on the algorithm level, we develop a tailored one-shot supernet-based NAS engine to streamline the search for hybrid models, eliminating the need for executing NAS for each deployment as well as additional training/finetuning steps. On the hardware level, we develop a chunk-based accelerator to fully leverage the diverse hardware resources available on FPGAs for the acceleration of heterogeneous layers in hybrid models, aiming to enhance both hardware utilization and throughput. Extensive experimental results consistently validate the superiority of our NASA-F framework, e.g., we can gain up arrow 0.67% top-1 accuracy over the prior work NASA on CIFAR100 even without additional training steps for searched models. Additionally, we can achieve up to up arrow 1.86x throughout and up arrow 2.16x FPS with up arrow 0.39% top-1 accuracy over the state-of-the-art multiplication-based system on Tiny-ImageNet. codes are available at https://***/shihuihong214/NASA-F.

关键词： Multiplication-reduced hybrid networks neural architecture search chunk-based accelerator FPGA accelerator algorithm-hardware co-design

来源：评论

学校读者我要写书评

暂无评论

BSViT: A Bit-Serial Vision Transformer Accelerator Exploiting Dynamic Patch and Weight Bit-Group Quantization

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 2024年第9期71卷 4064-4077页

作者： Wang, Gang Cai, Siqi Li, Wenjie Lyu, Dongxu He, Guanghui Shanghai Jiao Tong Univ Sch Elect Informat & Elect Engn Shanghai 200240 Peoples R China Shanghai Jiao Tong Univ AI Inst MoE Key Lab Artificial Intelligence Shanghai 200240 Peoples R China

Vision Transformers (ViTs) have achieved remark-able success in computer vision (CV) and are increasingly recognized as the new backbone for vision-language multi-modal tasks. Despite their success, the high computational cost associated with ViTs hinders their inference efficiency. In this paper, we introduce BSViT, a bit-serial Vision Transformer accelerator enhanced by algorithm-hardware co-design. BSViT can efficiently accelerate both plain and hierarchical Vision Transformer inference. At the algorithm level, we propose a post-training quantization scheme named dynamic patch and weight bit-group quantization. We first introduce a dynamic patch quantization (DPQ) scheme to dynamically allocate bit-width to different image patches based on their importance, thus reducing bit width and saving computation without significantly impacting accuracy. Second, we propose a weight bit-group quantization (BGQ)scheme to evenly distribute bits within groups and achieve workload balance across processing elements (PEs). At the hardware level, we propose a term-separate bit-serial accelerator to efficiently support DPQ and BGQ. We introduce dense and sparse bit-serial PEs to manipulate the dense least significant term (LST) and sparse most significant term (MST) work loads. A dense-sparse hybrid dataflow is devised to efficiently balance the two kinds of workloads. Our experiments show that BSViT can achieve up to 1.95x speedup and 2.72x energy efficiency compared to state-of-the-art (SOTA) bit-serial accelerators and achieve up to 3.69x energy efficiency compared to SOTA Trans-former accelerators.

关键词： Transformers Quantization (signal) computer architecture Training hardware Accuracy Heuristic algorithms Vision transformer bit serial algorithm-hardware co-design post-training quantization dense-sparse hybrid dataflow

来源：评论

学校读者我要写书评

暂无评论

An Ultralow-Power H.264/AVC Intra-Frame Image compression Accelerator for Intelligent Event-Driven IoT Imaging Systems

IEEE SOLID-STATE CIRCUITS LETTERS

引用

IEEE SOLID-STATE CIRCUITS LETTERS 2024年 7卷 30-33页

作者： Zhang, Qirui An, Hyochan Bejarano-Carbo, Andrea Kim, Hun-Seok Blaauw, David Sylvester, Dennis Univ Michigan Dept Elect & Comp Engn Ann Arbor MI 48109 USA Apple Cupertino CA 95014 USA

This letter presents an ultralow-power (ULP) H.264/AVCintra-frame image compression accelerator tailored for intelligent event-driven ULP IoT imaging systems. The H.264/AVC intra-frame codec is customized to enable compression of arbitrary nonrectangular change-detected regions. To optimize energy and latency from image memory accesses, novel algorithm-hardware co-designs are proposed for intra-frame predictions, reducing overhead for neighbor macroblock (McB)accesses by 2.6xat a negligible quality loss. With split control for major processing phases, latency is optimized by exploiting data dependency and pipelining. Area and leakage of major computation units are reduced through data path micro-architecture reconfiguration. Fabricated in40 nm, it occupies a mere 0.32 mm2area with 4-kB SRAM. At 0.6 V and153 kHz, it consumes only 1.21 mu W, with 30.9 pJ/pixel compression energy efficiency that rivals state-of-the-art designs. For an event-driven IoT imaging system, the combination of the proposed accelerator and change detection brings 133xreduction to the overall energy for regressing an image of change-detected region of interest

关键词： algorithm-hardware co-design event-driven imaging system H.264/AVC intra-frame coding hardware accelerator ultralow-power (ULP)

来源：评论

学校读者我要写书评

暂无评论

OFQ-LLM: Outlier-Flexing Quantization for Efficient Low-Bit Large Language Model Acceleration

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 2025年

作者： Wang, Gang Cai, Siqi Li, Wenjie Lyu, Dongxu He, Guanghui Shanghai Jiao Tong Univ Sch Integrated Circuits State Key Lab MicroNano Engn Sci Shanghai 200240 Peoples R China Shanghai Jiao Tong Univ Sch Comp Sci Shanghai 200240 Peoples R China

Large Language Models (LLMs) have achieved significant success in various Natural Language Processing (NLP) tasks, becoming essential to modern intelligent computing. Their large memory footprint and high computational cost hinder efficient deployment. Post-Training Quantization (PTQ) is a promising technique to alleviate this issue and accelerate LLM inference. However, the presence of outliers impedes the advancement of LLM quantization to lower bit levels. In this paper, we introduce OFQ-LLM, an algorithm-hardware co-design solution that adopts outlier-flexing quantization to efficiently accelerate LLM at low-bit levels. The key insight of OFQ-LLM is that normal data can be efficiently quantized in a slightly reduced data encoding space, while the rest encoding space can be used for flexible outlier values. During quantization, we use rescale-based clipping (RBC) to optimize accuracy for normal data and group outlier clustering (GOC) to flexibly represent outlier values. At the hardware level, we introduce a memory-aligned outlier-flexing encoding scheme to encode activations and weights in LLMs at a low bit level. The outlier-normal mixed hardware architecture is devised to leverage the encoding scheme and accelerate LLMs with high speed and high energy efficiency. Our experiments show that OFQ-LLM achieves better accuracy compared to state-of-the-art (SOTA) low-bit LLM PTQ works. OFQ-LLM-based accelerator surpasses the SOTA outlier-aware accelerators by up to 2.69x core energy efficiency, up to 3.83x speed up and 2.44x energy reduction in LLM prefilling phase, and up to 2.01x speed up and 2.88x energy reduction in LLM decoding phase, with superior accuracy.

关键词： Quantization (signal) Encoding Accuracy Transformers Decoding Large language models computational modeling Memory management Energy efficiency Training post-training quantization algorithm-hardware co-design outlier efficient acceleration

来源：评论

学校读者我要写书评

暂无评论

An FPGA-Based Event-Driven SNN Accelerator for DVS Applications With Structured Sparsity and Early-Stop

引用

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 2025年

作者： Cheng, Xi Cao, Shu Wang, Shangmei Wang, Mingyu Li, Wenhong Zeng, Xiaoyang Fudan Univ State Key Lab Integrated Chips & Syst Shanghai 201203 Peoples R China

This paper proposed an algorithm-hardware co-design of an event-driven spiking neural network (SNN) accelerator for classification tasks of event-based data from dynamic vision sensors (DVS), which can implement a feed-forward SNN with a maximum network size of 1 million synapses. configurable structured sparsity is introduced between the first layer and the second layer to improve energy efficiency and balance the workload between different processing elements (PEs). The number of available neurons in the accelerator is sparsity-dependent and ranges from 1024 to 4096. The modified leaky-integrate-fire (LIF) neuron model and an event-driven neuron update scheme are employed in both algorithm and hardware to fully utilize the natural sparsity of DVS event stream. Early-stop inference strategy on the hardware enables a trade-off between inference accuracy and efficiency. A three-layer fully connected SNN is trained through backpropagation through time (BPTT) and is implemented and evaluated on Xilinx ZCU104 FPGA. Our design can achieve 96.0% accuracy on the N-MNIST dataset and 79.0% accuracy on the DVS128-Gesture dataset both at 50% sparsity. The top performance of the accelerator on ZCU104 is 3.22 GSOP/S and 3.99 GSOP/W at 250 MHz.

关键词： Spiking neural networks dynamic vision sensors algorithm-hardware co-design event-driven structured sparsity FPGAs

来源：评论

学校读者我要写书评

暂无评论

Accelerating On-Device DNN Training Workloads via Runtime convergence Monitor

引用

IEEE TRANSACTIONS ON coMPUTER-AIDED design OF INTEGRATED CIRCUITS AND SYSTEMS 2023年第5期42卷 1574-1587页

作者： Choi, Seungkyu Shin, Jaekang Kim, Lee-Sup Univ Michigan Dept Elect & Comp Engn Ann Arbor MI 48109 USA Korea Adv Inst Sci & Technol Sch Elect Engn Daejeon 34141 South Korea

With the growing demand for processing deep learning applications on edge devices, on-device DNN training has become a major workload to execute a variety of vision tasks suited for users. Therefore, architectures employed with algorithm co-design to accelerate the training process have been steadily studied. However, previous solutions are mostly supported by extended versions of the inference studies, such as sparsity, data flow, quantization, etc. Moreover, most works examine their schemes on from-the-scratch training that cannot tolerate inaccurate computing. Accordingly, there are still factors that hinder the overall speed of the DNN training process that has not been addressed in practical workloads. In this work, we propose a runtime convergence monitor to achieve massive computational savings in the practical on-device training workloads (i.e., transfer-learning-based task adaptation). By monitoring the network output data, we determine the training intensity of incoming tasks and adaptively detect the convergence in iteration intervals for training diverse datasets. Furthermore, we enable the computation skip of converged images determined by the monitored prediction probability to enhance the training speed within an iteration. As a result, we perform an accurate but fast convergence in model training for the task adaptation with minimal overhead. Unlike the previous approximation methods, our monitoring system enables runtime optimization and can be easily applicable to any type of accelerator attaining significant speedup. Evaluation results on various datasets show geomean of 2.2x speedup when applied in any systolic architectures and further enhancement of 3.6x when applied in accelerators dedicated for on-device training.

关键词： algorithm-hardware co-design deep learning on-device training training convergence transfer learning

来源：评论

学校读者我要写书评

暂无评论

Unearthing the Potential of Spiking Neural Networks

Unearthing the Potential of Spiking Neural Networks

引用

27th design, Automation and Test in Europe conference and Exhibition (DATE)

作者： Chowdhury, Sayced Shafayet Kosta, Adarsh Kumar Sharma, Dccpika Apolinario, Marco P. E. Roy, Kaushik Purdue Univ Sch ECE W Lafayette IN 47907 USA

ISBN: (纸本)9798350348606;9783981926385

Spiking neural networks (SNNs) offer a promising alternative to traditional analog neural networks (ANNs), especially for sequential tasks, with enhanced energy efficiency. The internal memory in SNNs obtained through the membrane potential equips them with innate lightweight temporal processing capabilities. However, the unique advantages of this temporal dimension of SNNs have not yet been effectively harnessed. To that end, this article delves deeper into the what, why and where of SNNs. By considering event-based optical flow as an exemplary task in vision-based navigation, we highlight that the true potential of SNNs lies in sequential tasks. The event-driven recurrent dynamics of a spiking neuron merged harmoniously with event camera inputs enables SNNs to outperform corresponding ANNs with a lower number of parameters for optical flow. Furthermore, we demonstrate that SNNs can be synergistically combined with ANNs to form SNN-ANN hybrids to obtain the best of both worlds in terms of accuracy, energy, memory, and training efficiency. Additionally, the emergence of various near-memory and in-memory computing techniques has propelled efficient implementation of these approaches. Overall, the immediate future of SNNs looks exciting, as we discover the niche of SNNs, comprising sequential tasks with low power requirements.

关键词： SNNs SNN-ANN Hybrids Sequential Tasks Neuromorphic hardware algorithm-hardware co-design

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：