This paper proposed a hardware-algorithm co-design of an event-driven Spiking Neural Network (SNN) accelerator with structured sparsity for Dynamic Vision Sensors (DVS) applications. The accelerator can accommodate up...
详细信息
ISBN:
(数字)9798350330991
ISBN:
(纸本)9798350331004
This paper proposed a hardware-algorithm co-design of an event-driven Spiking Neural Network (SNN) accelerator with structured sparsity for Dynamic Vision Sensors (DVS) applications. The accelerator can accommodate up to 1024 neurons and 1 million synapses for a feed-forward fully connected SNN implementation. Configurable structured sparsity is introduced by modular arithmetic both in the algorithm and hardware to improve the energy efficiency, reduce the memory requirement, and balance the workload between different processing elements. With an event-driven neuron update scheme, the accelerator can fully utilize the benefits of structured sparsity and can directly process DVS output data for classification tasks without encoding. A three-layer SNN is trained through backpropagation through time (BPTT) and is implemented on Xilinx ZCU104 FPGA, which achieves 96% accuracy on the N-MNIST dataset and 79% accuracy on the DVS-Gesture dataset both at 50% sparsity. The top performance of the accelerator on ZCU104 is 3.82 GSOP/S and 5.31 GSOP/W at 250 MHz.
Transformer-based models have achieved huge success in various artificial intelligence (AI) tasks, e.g., natural language processing (NLP) and computer vision (CV). However, transformer-based models always suffer from...
Transformer-based models have achieved huge success in various artificial intelligence (AI) tasks, e.g., natural language processing (NLP) and computer vision (CV). However, transformer-based models always suffer from high computation density, making them hard to be deployed on resource-constrained devices like field-programmable gate array (FPGA). Among the overall process of transformers, self-attention contributes to most of the computation load and becomes the bottleneck of transformer-based models. In this paper, we propose TransFRU, a novel FPGA-based accelerator for self-attention mechanism with full utilization of hardware resources. Specifically, we first leverage 4-bit and 8-bit processing elements (PEs) to package multiple signed multiplications into one DSP block. Second, we skip the zero and near-zero values in the intermediate result of self-attention by a sorting engine. The sorting engine is also responsible for operand sharing to boost the computation efficiency of one DSP block. Experimental results show that our TransFRU achieves $7.86-49.16 \times$ speedup and $151.1 \times$ energy efficiency compared with CPU, $1.41 \times$ speedup and $5.9 \times$ energy efficiency compared with GPU. Furthermore, we observe $1.91- 13.56 \times$ better throughput per DSP block and $3.53-9.62 \times$ energy efficiency compared with previous FPGA accelerators.
Nowadays, with the increase resolution of Dynamic Vision Sensor (DVS), efficient compression algorithm for event stream is needed urgently. Conventional DVS system encodes event data in address event representation (A...
详细信息
ISBN:
(数字)9798350330991
ISBN:
(纸本)9798350331004
Nowadays, with the increase resolution of Dynamic Vision Sensor (DVS), efficient compression algorithm for event stream is needed urgently. Conventional DVS system encodes event data in address event representation (AER) for output while ignores the data redundancy imposed by the correlation of events. To address this challenge, this paper first analyzes the spatiotemporal characteristics of event stream and the impact of readout circuits. Based on the analysis, the context-based encoding strategies for spatial address, timestamp and polarity of events are proposed respectively with the consideration of data flow in DVS hardware. Besides, the hardware architecture with high parallelism is presented to implement the compression algorithm, which achieves high throughput at an affordable cost. The hardware is implemented in the 55nm process as part of a 512x512 resolution DVS. The experimental results demonstrate that our methods achieves higher average compression ratio compared to conventional and DVS-specific coding algorithms.
Stable Diffusion has become one of the mainstream image synthesis algorithms. The mainstream computing platform for Stable Diffusion is GPU. However, the deployment of Stable Diffusion on GPU still faces the problems ...
详细信息
ISBN:
(数字)9798350372434
ISBN:
(纸本)9798350372441
Stable Diffusion has become one of the mainstream image synthesis algorithms. The mainstream computing platform for Stable Diffusion is GPU. However, the deployment of Stable Diffusion on GPU still faces the problems of power consumption. With dedicated hardware design and optimization, FPGA based Stable Diffusion accelerator can achieve better performance of energy efficiency. In this paper, we propose SDAcc for realizing efficient inference of Stable Diffusion on FPGA. SDAcc is 4.40× faster than CPU. Compared to GPU and CPU, SDAcc achieves 1.27× and 19.66× energy efficiency improvement, respectively.
Instant-NGP is the state-of-the-art (SOTA) algorithm of Neural Radiance Field (NeRF) and shows great potential to be adopted in ARNR. However, the high cost of memory and computation limits Instant-NGP’s implementati...
Instant-NGP is the state-of-the-art (SOTA) algorithm of Neural Radiance Field (NeRF) and shows great potential to be adopted in ARNR. However, the high cost of memory and computation limits Instant-NGP’s implementation on edge devices. In light of this, we propose a novel FPGA-based accelerator to reduce power consumption, called Booth-NeRF. Booth-NeRF adopts a fully-pipelined technique and is built upon the Booth algorithm. In addition, it introduces a new instruction set to accommodate Multi-Layer Perceptrons (MLPs) of different sizes, ensuring flexibility and efficiency. Moreover, we propose an FPGA-friendly multiplier architecture for matrix multiplication which is capable of performing exact or approximate multiplication using the Booth algorithm and the select-shift-add technique. Evaluations with a Xilinx Kintex XC7K325T board show that Booth-NeRF achieves $2.20\times$ speedup and $1.31\times$ energy efficiency compared with NVIDIA Jetson Xavier NX-16G GPU.
High-frequency broadband current-mode logic static divider circuit fabricated in CMOS 40-nm is presented. In the proposed circuit, inductance peaking technology is utilized to improve the locking frequency and working...
详细信息
In this work, A lateral PIN diode is fabricated based on the RF-SOI substrate. The diode is further used in UV photodetection and high-speed optical communication. It exhibits excellent linearity under various light i...
详细信息
Field-programmable gate array (FPGA) macro placement holds a crucial role within the FPGA physical design flow since it substantially influences the subsequent stages of cell placement and routing. With the increasing...
详细信息
This paper presents a 6-bit 800MS/s successive approximation register (SAR) analog-to digital converter (ADC) in 28nm CMOS with grouped digital-to-analog converter (DAC) capacitor array. High-speed operation is achiev...
详细信息
In this work, we explored an efficient automatic layout routing algorithm for connecting the power and ground pins in analog integrated circuits. A rectilinear minimal spanning tree (RMST) algorithm for two sets of pi...
详细信息
暂无评论