Computation of the inner products is frequently used in machine learning (ML) algorithms apart from signal processing and communication applications. distributedarithmetic (DA) has been frequently employed for area-t...
详细信息
ISBN:
(纸本)9798350330991;9798350331004
Computation of the inner products is frequently used in machine learning (ML) algorithms apart from signal processing and communication applications. distributedarithmetic (DA) has been frequently employed for area-time efficient inner-product implementations. In conventional DA-based architectures, one of the vectors is constant and known a priori. Hence, the traditional DA architectures are not suitable when both vectors are variable. However, computing the inner product of a pair of variable vectors is frequently used for matrix multiplication of various forms and convolutional neural networks. In this paper, we present a novel DA-based architecture for computing the inner product of variable vectors. To derive the proposed architecture, the inner product of any given length is decomposed into a set of short-length inner products, such that the inner product could be computed by successive accumulation of the results of shortlength inner products. We have designed a DA-based architecture for the computation of the short-length inner-product of variable vectors and used that in successive clock cycles to compute the whole inner-product by successive accumulation. The post-layout synthesis results using Cadence Innovus with a GPDK 90nm technology library show that the proposed DA-based parallel architecture offers significant advantages in area-delay product and energy consumption over the bit-serial DA architecture.
In this letter, an area-efficient architecture for the hardware implementation of the real-time prime factor Fourier transform (PFFT) is presented. In the proposed architecture, a prime length DFT module with the one-...
详细信息
In this letter, an area-efficient architecture for the hardware implementation of the real-time prime factor Fourier transform (PFFT) is presented. In the proposed architecture, a prime length DFT module with the one-point-per-cycle (OPPC) property is implemented by the parallel distributed arithmetic (DA), and a cyclic convolution feature is exploited to simplify the structure of the DA cells. Based on the proposed architecture, a real-time 65-point PFFT processor is designed, and the synthesis results show that it saves over 8% gates compared to the existing real-time 64-point DFT designs.
暂无评论