Target detection is widely applied in fields such as face recognition, autonomous driving, and industrial automation. However, when deploying target detection models based on convolutional neural networks on resource-...
详细信息
To solve the hardware deployment problem caused by the vast demanding computational complexity of convolutional layers and limited hardware resources for the hardware network inference,a look-up table(LUT)-based convo...
详细信息
To solve the hardware deployment problem caused by the vast demanding computational complexity of convolutional layers and limited hardware resources for the hardware network inference,a look-up table(LUT)-based convolution architecture built on a field-programmable gate array using integer multipliers and addition trees is *** the help of the winograd algorithm,the optimization of convolution and multiplication is realized to reduce the computational *** LUT-based operator is further optimized to construct a processing unit(PE).Simultaneously optimized storage streams improve memory access efficiency and solve bandwidth *** data toggle rate is reduced to optimize power *** experimental results show that the use of the winograd algorithm to build basic processing units can significantly reduce the number of multipliers and achieve hardware deployment acceleration,while the time-division multiplexing of processing units improves resource *** this experimental condition,compared with the traditional convolution method,the architecture optimizes computing resources by 2.25 times and improves the peak throughput by 19.3 *** LUT-based winograd accelerator can effectively solve the deployment problem caused by limited hardware resources.
Convolutional neural networks (CNNs) have proven to be promising in various applications such as audio recognition, image classification, and video understanding. winograd algorithm helps to reduce the complexity of c...
详细信息
Convolutional neural networks (CNNs) have proven to be promising in various applications such as audio recognition, image classification, and video understanding. winograd algorithm helps to reduce the complexity of computation in a convolution but suffers from poor compatibility for different convolution shapes. This work introduces a dynamic dimension-level fusion architecture based on winograd for accelerating different dimensions of CNNs. We explore this winograd architecture by designing Dimension Fusion, a dimension-level processing engine that dynamically fuses to match the convolution shape of individual CNN layers. The proposed architecture is the first work based on winograd algorithm to be compatible with all convolution shapes (dimension, stride, and filter-size) and achieves highest PE efficiency up to 1.55x and energy efficiency up to 3.3x compared with the state-of-art accelerators.
In this paper,a fast Fourier-like transform over GF(28) is developed to compute the syndromes of the transmitted codewords and the roots of the error location *** new algorithm based on the conjugates of GF(28) with r...
详细信息
In this paper,a fast Fourier-like transform over GF(28) is developed to compute the syndromes of the transmitted codewords and the roots of the error location *** new algorithm based on the conjugates of GF(28) with respect to winograd's algorithm and Goertzel-Blahut(GB) *** simplified transform decoder is over GF(28) in a program on a digital *** is expected that such a new type of transforms can be employed to syndrome evaluation for decoding of the Reed-Solomon(RS) code of block length 255.
Abstract: This paper presents an efficient memory-based fast Fourier transform processor including 35 different working sizes for LTE systems. A factorization method named high-radix-small-butterfly combined with a co...
详细信息
ISBN:
(纸本)9781479953424
Abstract: This paper presents an efficient memory-based fast Fourier transform processor including 35 different working sizes for LTE systems. A factorization method named high-radix-small-butterfly combined with a conflict-free address scheme for 2~p3~q5~r point memory-based FFT processor is proposed. The processor can not only provide conflict-free concurrent data access from different memory banks but also continuous-flow working mode. Moreover, we exploit prime factor algorithm to decrease the multiplications and twiddle factor storage. In addition, a unified winograd Fourier transform algorithm butterfly core was designed for the small 2, 3, 4, 5-point DFTs. The FFT processor was implemented in a SMIC 55nm CMOS process with core area 1.063mm~2. The chip consumes 40 8mW at 122.88MHz operating frequency with 1.08V voltage supply.
Transposed convolution, which is often used to scale up feature maps in various computer vision tasks, is a structural inverse process of convolution. Both convolution and transposed convolution, if any, account for t...
详细信息
ISBN:
(纸本)9781450361378
Transposed convolution, which is often used to scale up feature maps in various computer vision tasks, is a structural inverse process of convolution. Both convolution and transposed convolution, if any, account for the majority of computation in the inferences of deep neural networks. While convolution has been studied extensively, there are few investigations on accelerating transposed convolution. In this paper, we propose a fast algorithm, FTConv, to reduce the computation of transposed convolution using the winograd algorithm, which has also been used for convolution with small kernels. Specifically, a transposed convolution can be converted into multiple convolutions after dividing the kernel into several congruence classes. Thus, we can accelerate the multiple convolutions using a modified winograd algorithm. The transposed convolution can be obtained by interleaving output feature elements of each congruence class. We also design a winograd ALU in four pipeline stages to further accelerate the computation on FPGA. By carefully designing a sliding window for on-chip buffer reuse according to the memory access pattern of transposed convolution, we save the memory bandwidth by 88.2% compared with a straightforward method. We evaluate FTConv using FSRCNN-s, a neural network for super-resolution. The number of multiplications in the transposed convolution layer can be reduced by 69% over the direct computation of FSRCNN-s.
暂无评论