Field-programmable gate array (FPGA) is an ideal candidate for accelerating graph neural networks (GNNs). However, FPGA reconfiguration is a time-consuming process when updating or switching between diverse GNN models...
Field-programmable gate array (FPGA) is an ideal candidate for accelerating graph neural networks (GNNs). However, FPGA reconfiguration is a time-consuming process when updating or switching between diverse GNN models across different applications. This paper proposes a highly integrated FPGA-based overlay processor for GNN accelerations. Graph-OPU provides excellent flexibility and software-like programmability for GNN end-users, as the executable code of GNN models are automatically compiled and reloaded without requiring FPGA reconfiguration. First, we customize the instruction sets for the inference qprocess of different GNN models. Second, we propose a microarchitecture ensuring a fully-pipelined process for GNN inference. Third, we design a unified matrix multiplication to process sparse-dense matrix multiplication and general matrix multiplication to increase the Graph-OPU performance. Finally, we implement a hardware prototype on the Xilinx Alveo U50 and test the mainstream GNN models using various datasets. Graph-OPU takes an average of only 2 minutes to switch between different GNN models, exhibiting average 128× speedup compared to related works. In addition, Graph-OPU outperforms state-of-the-art end-to-end overlay accelerators for GNN, reducing latency by an average of 1.36× and improving energy efficiency by an average of 1.41×. Moreover, Graph-OPU achieves up to 1654× and 63× speedup, as well as up to 5305× and 422× energy efficiency boosts, compared to implementations on CPU and GPU, respectively. To the best of our knowledge, Graph-OPU represents the first in-depth study of an FPGA-based overlay processor for GNNs, offering high flexibility, speedup, and energy efficiency.
The Dynamic Vision Sensor (DVS) is a new type of bionic vision image sensor that offers the advantages of low latency, low power consumption, and high dynamics range compared to conventional sensors. However, backgrou...
The Dynamic Vision Sensor (DVS) is a new type of bionic vision image sensor that offers the advantages of low latency, low power consumption, and high dynamics range compared to conventional sensors. However, background activity (BA) noise will degrade the quality of the DVS output data and lead to unnecessary bandwidth overhead. In dark environments, pixel arrays generate abundant noise, and conventional spatiotemporal filters can hardly achieve satisfactory results. To solve this problem, we exploit the difference in event density distribution between the actual event and noise and propose a denoising method that utilizes the event densities with two neighbors of different radii. Compared to spatiotemporal filters, our approach reduces the error rate on synthetic datasets by at least 35%. Meanwhile, our approach is subjectively more visually appealing. With our denoising method, the performance of DVS can be better in dark conditions.
Dynamic vision sensors (DVS) have significant potential in scenes involving high-speed motion and extreme light. However, DVS is sensitive to background active noise, which will degrade the quality of the output. The ...
Dynamic vision sensors (DVS) have significant potential in scenes involving high-speed motion and extreme light. However, DVS is sensitive to background active noise, which will degrade the quality of the output. The ordinary $O(N^{2})$ -Space spatiotemporal filter's memory complexity is high. It needs $N\times N$ memory cells ( $N\times N$ is the resolution on the sensor). Some works reduce memory complexity by sacrificing the performance of the filter. To ensure the filtering effect and reduce the filter's memory complexity, this paper proposes a novel filter: Queue-based spatiotemporal filter. Moreover, based on the Queue-based spatiotemporal filter, this paper proposes a clustering algorithm that can cluster while filtering. Experiments show that the proposed filter's performance is similar to the $O(N^{2})$ -Space spatiotemporal filter while having a lower memory complexity. Besides, using the proposed clustering algorithm, the objects in motion can be clustered with low calculation complexity.
In this paper, we present the Integer Lightweight Softmax (ILS) algorithm for approximating the Softmax activation function. The accurate implementation of Softmax on FPGA can be huge resource-intensive and memory-hun...
In this paper, we present the Integer Lightweight Softmax (ILS) algorithm for approximating the Softmax activation function. The accurate implementation of Softmax on FPGA can be huge resource-intensive and memory-hungry. Then, we present the implementation of ILS on a Xilinx XCKU040 FPGA to evaluate the effectiveness of ILS. Evaluations on CIFAR 10, CIFAR 100 and ImageNet show that ILS achieves up to $2.47\times, 40\times$ and $323\times$ speedup over CPU implementation, and $4\times, 63\times$ and $51\times$ speedup over GPU implementation, respectively. In comparison to previous FPGA-based Softmax implementations, ILS strikes a better balance between resource consumption and precision accuracy.
Modern circuits often contain standard cells of different threshold voltages (multi-VTs) to achieve a better trade-off between timing and power consumption. Due to the heterogeneous cell structures, the multi-VTs cell...
Modern circuits often contain standard cells of different threshold voltages (multi-VTs) to achieve a better trade-off between timing and power consumption. Due to the heterogeneous cell structures, the multi-VTs cells impose various implant layer constraints, further complicating the already time-consuming filler cell insertion process. In this paper, we present a fast and near-optimal algorithm to solve the filler insertion problem with complex implant layer rules and minimum filler width constraints. We first propose an inference-driven detecting algorithm to identify each design rule violation accurately. Then, a dynamic-programming-based insertion method is developed to reduce the implant layer violations. Finally, we design a contour-driven violation refinement strategy to further improve manufacturability. Experimental results show that our algorithm can reduce the number of violations significantly compared with state-of-the-art works. Besides, with our identifier in the legalization stage, we can avoid conflicts in advance and solve almost all violations after filler insertion in industrial cases.
Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion...
详细信息
ISBN:
(纸本)9781665454414
Depth completion aims to predict dense depth maps with sparse depth measurements from a depth sensor. Currently, Convolutional Neural Network (CNN) based models are the most popular methods applied to depth completion tasks. However, despite the excellent high-end performance, they suffer from a limited representation area. To overcome the drawbacks of CNNs, a more effective and powerful method has been presented: the Transformer, which is an adaptive self-attention setting sequence-to-sequence model. While the standard Transformer quadratically increases the computational cost from the key-query dot-product of input resolution which improperly employs depth completion tasks. In this work, we propose a different window-based Transformer architecture for depth completion tasks named Sparse-to-Dense Transformer (SDformer). The network consists of an input module for the depth map and RGB image features extraction and concatenation, a U-shaped encoder-decoder Transformer for extracting deep features, and a refinement module. Specifically, we first concatenate the depth map features with the RGB image features through the input model. Then, instead of calculating self-attention with the whole feature maps, we apply different window sizes to extract the long-range depth dependencies. Finally, we refine the predicted features from the input module and the U-shaped encoder-decoder Transformer module to get the enriching depth features and employ a convolution layer to obtain the dense depth map. In practice, the SDformer obtains state-of-the-art results against the CNN-based depth completion models with lower computing loads and parameters on the NYU Depth V2 and KITTI DC datasets.
Symmetric Sparse Matrix-Vector Multiplication (SSpMV) is a prevalent operation in numerous application domains (e.g., physical simulations, machine learning, and graph processing). Existing researches focus on the SSp...
Symmetric Sparse Matrix-Vector Multiplication (SSpMV) is a prevalent operation in numerous application domains (e.g., physical simulations, machine learning, and graph processing). Existing researches focus on the SSpMV implementation and its improvement on high-performance computing platforms but ignore the resource-limited edge platforms due to the main challenges: memory access overload and limited computing parallelism feasibility. To this end, this paper proposes an embedded-FPGA-based hardware accelerator for SSpMV, called eSSpMV. We first propose an optimized data format, named Symmetric Compressed Sparse Row (SCSR), to reduce memory consumption. Moreover, a fully-pipelined computation unit is proposed to be compatible with the optimized data format. Experimental results show that eSSpMV outperforms the state-of-the-art FPGA implementation for 2.9 x speedup, while still achieving a computing resource reduction of 39.3% and 32.3% for LUT and DSP, respectively. As for edge CPU and GPU implementations, eSSpMV achieves 9.3x speedup over CPU while acquiring 13.1 x better power latency product than GPU.
Placement is one of the critical stages in the physical design of very large scale integrated circuits (VLSI), which has a significant impact on the performance of subsequent stages. Modern placement algorithms need t...
详细信息
As design goes into multi-billion transistors, the synthesis runtime becomes an important issue, particularly for design verification and prototyping, as one may run the synthesis many times with design change. Module...
As design goes into multi-billion transistors, the synthesis runtime becomes an important issue, particularly for design verification and prototyping, as one may run the synthesis many times with design change. Module-by-module synthesis with multi-threading is a natural solution for fast synthesis, however, at the cost of quality of results (QoR) degradation. Besides, multi-thread speedup may not be so good due to very uneven sizes of the modules. In this paper, we propose a design hierarchy restructuring based multi-thread synthesis algorithm for large-scale designs. Small module flattening and large module partitioning are used to create moderate size design modules. Our experimental results show that our algorithm can produce results within 3% area increase and 21.3x speedup over the flat synthesis flow.
In this work, we explored an efficient automatic layout routing algorithm for connecting the power and ground pins in analog integrated circuits. A rectilinear minimal spanning tree (RMST) algorithm for two sets of pi...
In this work, we explored an efficient automatic layout routing algorithm for connecting the power and ground pins in analog integrated circuits. A rectilinear minimal spanning tree (RMST) algorithm for two sets of pins is developed, in which minimal spanning tree is used to form the initial connections between pins. The obstacle-avoiding maze routing algorithm is used to break and reconnect the power and ground nets to avoid any short circuit. The genetic algorithm (GA) is further introduced to optimize the total connection wirelength. We also expanding the wire width to avoid electromigration and IR-drop.
暂无评论