In this manuscript, a High Throughput and Low Latency DA-FIR filter design is integrated with Approximate Karatsuba Multiplier (AKM) and Variable Latency Carry Skip Adder (VLCSA) is proposed for the noise removal appl...
详细信息
In this manuscript, a High Throughput and Low Latency DA-FIR filter design is integrated with Approximate Karatsuba Multiplier (AKM) and Variable Latency Carry Skip Adder (VLCSA) is proposed for the noise removal application in SDR. In this design, an AKM and VLCSA are considered to decrease the count of partial products on DA framework, although no multiplication is clearly performed. Thus, an important reduction is accomplished under accumulation circuits. The main execution problem of the DA system is the size of lookup table (LUT) increases exponentially through the length of inner product. To further diminish the memory complexity, approximate DA architectures depending truncation approach are needed. Partial products on DA are created through truncating least significant bits (LSBs) of inputs. The proposed hybrid technique lessens a count of LUTs by truncating the least significant bits of input operands. To further reduce the latency, this manuscript deals with one of the fastest multipliers, namely Approximate Karatsuba Multiplier employed for accumulation of partial products. The proposed design is performed in Verilog using Xilinx 14.5 ISE simulation. The experimental performances of the proposed DA-FIR-Hyb AKE-VLCSA filter is evaluated under lower delay, lower static power and compared with the existing filters.
distributed arithmetic (DA) implementation for finite impulse response (FIR) filters on field-programmable gate arrays (FPGAs) is highly desirable in digital signal processing due to its fast computational speed and l...
详细信息
ISBN:
(纸本)9798350340570
distributed arithmetic (DA) implementation for finite impulse response (FIR) filters on field-programmable gate arrays (FPGAs) is highly desirable in digital signal processing due to its fast computational speed and low power consumption. However, traditional LUT-based DA implementation on FPGAs is challenging because of its high memory space requirements. To overcome this challenge, LUT-partition and MUX-incorporation techniques have been proposed to reduce memory space, but they also increase the FPGA resource utilization. Furthermore, the inherent serial nature of DA computing can limit data throughput. Parallel processing of multiple bits can improve computational performance but at the cost of chip area. Therefore, it is beneficial to combine optimization methods to achieve desired performance. This paper proposes a comprehensive approach for optimizing memory space, computational performance, and chip area by analyzing different LUT partitions and incorporating MUX configurations. The proposed method is evaluated on a Xilinx Zynq 7010 FPGA, demonstrating its effectiveness.
In this paper, an approximate distributed arithmetic (DA) based parallel MAC is proposed. First, by adopting three kinds of approximation methods, the novel structure significantly reduces hardware complexity. Then, t...
详细信息
ISBN:
(纸本)9781450393225
In this paper, an approximate distributed arithmetic (DA) based parallel MAC is proposed. First, by adopting three kinds of approximation methods, the novel structure significantly reduces hardware complexity. Then, the result is compensated according to the analysis of the probability to enhance the precision. The hardware and error metric evaluation demonstrates that the proposed MAC achieves 25% power-delay product reduction while maintaining better precision. Finally, the Gaussian Blur application is employed to verify the proposed DA-based MAC with 6dB average PSNR improvement compared with recent state-of-the-art work.
This paper presents an efficient architecture for two-dimensional (2-D) adaptive FIR filter architecture using the distributed arithmetic (DA) algorithm. DA-based filter architectures essentially require look-up table...
详细信息
This paper presents an efficient architecture for two-dimensional (2-D) adaptive FIR filter architecture using the distributed arithmetic (DA) algorithm. DA-based filter architectures essentially require look-up tables (LUT). In the proposed filter architecture, RAM- or ROM-based LUT is replaced by adders- and logic gates-based structure that generates the LUT value corresponding to the input. Therefore, the MAC unit requires fewer logic gates and adders in DA-based realization. In addition, the memory sharing concept in architecture reduces the delay elements. Moreover, the complexity of the LUT hardware of higher-order filters is reduced by parallelly dividing the internal MAC block for the DA decomposition which offers a higher degree of modularity and parallelism in the proposed architecture. Further, 2-D delayed LMS algorithm is used for the updation of the filter coefficient weights. Furthermore, two-stage pipelining is used to reduce the critical path of the architecture and it also makes critical path delay independent of the order of the filter. ASIC synthesis results reveal the advantages of the proposed structure by reducing the area, power, ADP and EDP by 54%, 48.19%, 55% and 49%, respectively, as compared with the existing architecture for filter size 8 x 8.
Approximate computing (AC) provides an efficient solution for reducing power, area, and complexity of digital systems. When backed with distributed arithmetic (DA), AC leverages the ability to implement ultra-efficien...
详细信息
ISBN:
(纸本)9781665424615
Approximate computing (AC) provides an efficient solution for reducing power, area, and complexity of digital systems. When backed with distributed arithmetic (DA), AC leverages the ability to implement ultra-efficient inner-product units in terms of area, power, and delay. Such units can be used in any inherently resilient application. This paper presents a novel scheme of approximate inner-product based on parallel DA for low-power fault-tolerant applications backed with a novel in-situ sliding window algorithm. Our model eliminates the need for an explicit error correction scheme, which further reduces the overhead while improving the accuracy. The experimental results show that our model achieves a state-of-the-art performance in terms of power delay product (PDP), area power product (APP) with a reduction of 39.26% and 48.83%, respectively.
One of the essential components of a Digital Signal Processing (DSP) system is the Finite Impulse Response (FIR) filter. FIR filter uses the Multiply and Accumulate (MAC) operation for its computation. Conventional MA...
详细信息
ISBN:
(纸本)9781665440868
One of the essential components of a Digital Signal Processing (DSP) system is the Finite Impulse Response (FIR) filter. FIR filter uses the Multiply and Accumulate (MAC) operation for its computation. Conventional MAC units are slow and consume high power, making them unsuitable for energy-constrained devices. The MAC operations in FIR filter uses constant filter coefficients as one of its inputs. This situation is well suited for a bit-serial technique such as distributed arithmetic (DA). However, the traditional DA has the drawback of using huge memory resources as the filter order increases. An efficient LUT-less Modified distributed arithmetic architecture is proposed in this paper to solve the memory problem. This architecture removes the need for precomputation of weighted sums needed for the LUT in a DA using multiplexers and adders. Also, the architecture is designed to extend the range of input values. Further, a 16-Tap FIR filter is designed, synthesized with Xilinx ISE, and implemented for an XC4VSX35-FF668-10 based FPGA to measure the performance of this architecture. Our implementation results show that the design uses fewer resources and achieves faster filtering than the filter's previous implementations.
distributed arithmetic (DA)-based approximate structures are used for efficient implementation of inner-products in various error-resilient applications. In the existing literature, most of these approximate architect...
详细信息
distributed arithmetic (DA)-based approximate structures are used for efficient implementation of inner-products in various error-resilient applications. In the existing literature, most of these approximate architectures are developed by truncating the least significant bits (LSBs) of the inputs and/or the multiplying coefficients. The existing works do not provide any analytical study to evaluate and design an approximate structure. To address this issue, an analytical framework is proposed in this paper. It is shown that the analytical results match very closely with the Monte Carlo simulation results. The proposed framework reveals that the truncation of the LSBs of partial inner-products is a promising alternative to design more efficient DA architectures with less error. Following these observations, a novel approach to truncate the LSBs of partial inner-products, namely, a weight-dependent truncation strategy and its two variants with a suitable error compensation function are presented in this paper. Synthesis results, accuracy analysis, and evaluation in two commonly used error-tolerant applications demonstrate the superiority of the proposed architectures over the state-of-the-art DA-based approximate structures.
Two real-time functions of digital subcarrier cross-connect (DSXC) are experimentally demonstrated for the first time using distributed arithmetic (DA) in a field programmable gate array (FPGA) platform. Both frequenc...
详细信息
Two real-time functions of digital subcarrier cross-connect (DSXC) are experimentally demonstrated for the first time using distributed arithmetic (DA) in a field programmable gate array (FPGA) platform. Both frequency translation and channel selection in DSXC are implemented using DA-based resampling filters, achieving flexible modulation format and fine data-rate granularity of many concurrent subcarrier channels. Compared with traditional resampling filters that leverage multipliers, the DA-based approach eliminates the need for DSP slices in the FPGA implementation and significantly reduces the hardware cost. By requiring only a few clock periods, the DA-based resampling filter is also significantly faster when compared to conventional FIR filters, whose overall latency is proportional to the filter order. The DA-based DSXC is therefore able to achieve improved spectral efficiency and programmability of multiple orthogonal subcarrier channels, while keeping low cross-connection latency and requiring low cost hardware resources when implemented in a FPGA platform.
In this article, we have proposed the internal architecture of a dedicated hardware for 1D/2D convolution-based 9/7 and 5/3 DWT filters, exploiting bit-parallel 'distributed arithmetic' (DA) to reduce the comp...
详细信息
In this article, we have proposed the internal architecture of a dedicated hardware for 1D/2D convolution-based 9/7 and 5/3 DWT filters, exploiting bit-parallel 'distributed arithmetic' (DA) to reduce the computation time of our proposed DWT design while retaining the area at a comparable level to other recent existing designs. Despite using memory extensive bit-parallel DA, we have successfully achieved 90% reduction in the memory size than that of the other notable architectures. Through our proposed architecture, both the 9/7 and 5/3 DWT filters can be realized with a selection input, mode. With the introduction of DA, we have incorporated pipelining and parallelism into our proposed convolution-based 1D/2D DWT architectures. We have reduced the area by 38.3% and memory requirement by 90% than that of the latest remarkable designs. The critical-path delay of our design is almost 50% than that of the other latest designs. We have successfully applied our prototype 2D design for real-time image decomposition. The quality of the architecture in case of real-time image decomposition is measured by 'peak signal-to-noise ratio' and 'computation time', where our proposed design outperforms other similar kind of software- and hardware-based implementations.
Novel high-speed memory optimized distributed arithmetic (DA)-based architecture is developed and modeled for 3D discrete wavelet transform (DWT). The memory requirement for the proposed architecture is designed to 9 ...
详细信息
Novel high-speed memory optimized distributed arithmetic (DA)-based architecture is developed and modeled for 3D discrete wavelet transform (DWT). The memory requirement for the proposed architecture is designed to 9 x 9N + 36 pixel dynamic memory space and 52W ROM. The proposed 3D-DWT architecture implements 9/7 Daubechies wavelet filters, synthesizes 7127 bytes of memory for temporary storage and uses 758 adders, 36 multiplexers of 16:1 and 36 up counter to realize the 3D-DWT hardware. The 3D-DWTengine is implemented and tested in a Xilinx FPGA Vertex5 XC5VLX155T with high area and power e +/- ciency. The maximumdelay in the timing path is 2.676 ns and the 3D-DWT works at maximum frequency of 381MHz clock.
暂无评论