Distributed Arithmetic (DA) plays an important role in designing digital signalprocessing modules for FPGA architectures. It allows replacing multiply-and-accumulate (MAC) operations with combinational blocks. The qu...
详细信息
Distributed Arithmetic (DA) plays an important role in designing digital signalprocessing modules for FPGA architectures. It allows replacing multiply-and-accumulate (MAC) operations with combinational blocks. The quality of implementations based on DA strongly depends on efficiency of methods that map combinational DA block into FPGA resources. Since modern FPGAs have heterogeneous structure, there is a need for quality algorithms to target these structures and the need for flexible architecture exploration aiding in appropriate mapping. The paper presents a modification of DA concept that allows for very efficient implementation in heterogeneous FPGA architectures.
signalprocessingalgorithms and architectures can use dynamic reconfiguration to exploit variations in signal statistics with the objectives of improved performance and reduced power consumption. Parameters provide a...
详细信息
signalprocessingalgorithms and architectures can use dynamic reconfiguration to exploit variations in signal statistics with the objectives of improved performance and reduced power consumption. Parameters provide a simple and formal way to characterize incremental changes to a computation and its computing mechanism. This paper examines five parameterized computations which are typically implemented in hardware for a wireless multimedia terminal: (1) motion estimation, (2) discrete cosine transform, (3) Lempel-Ziv lossless compression, (4) 3D graphics light rendering and (5) viterbi decoding. Each computation is examined for the capability of dynamically adapting the algorithm and architecture parameters to variations in their respective input signals. Dynamically reconfigurable low-power implementations of each computation are currently underway.
The real time implementation of an efficient signal compression technique, vector Quantization (vQ), is of great importance to many digital signal coding applications. In this paper, we describe a new family of bit le...
详细信息
The real time implementation of an efficient signal compression technique, vector Quantization (vQ), is of great importance to many digital signal coding applications. In this paper, we describe a new family of bit level systolic vLSI architectures which offer an attractive solution to this problem. These architectures are based on a bit serial, word parallel approach and high performance and efficiency can be achieved for vQ applications of a wide range of bandwidths. Compared with their bit parallel counterparts, these bit serial circuits provide better alternatives for vQ implementations in terms of performance and cost.
In this contribution, the potential of parallelized software that implements algorithms of digital signalprocessing on a multicore processor platform is analyzed. For this purpose various digital signalprocessing ta...
详细信息
In this contribution, the potential of parallelized software that implements algorithms of digital signalprocessing on a multicore processor platform is analyzed. For this purpose various digital signalprocessing tasks have been implemented on a prototyping platform i.e. an ARM MPCore featuring four ARM 11 processor cores. In order to analyze the effect of parallelization on the resulting performance-power ratio, influencing parameters like e.g. the number of issued program threads have been studied. For paralllelization issues the OpenMP programming model has been used which can be efficiently applied on C- level. In order to elaborate power efficient code also a functional and instruction level power model of the MPCore has been derived which features a high estimation accuracy. Using this power model and exploiting the capabilities of OpenMP a variety of exemplary tasks could be efficiently parallelized. The general efficiency potential of parallelization for multiprocessor architectures can be assembled.
Low power is an extremely important issue for future mobile radio systems. Channel decoders are essential building blocks of base-band signalprocessing units in mobile terminal architectures. Thus low power implement...
详细信息
ISBN:
(纸本)0780366336
Low power is an extremely important issue for future mobile radio systems. Channel decoders are essential building blocks of base-band signalprocessing units in mobile terminal architectures. Thus low power implementations of advanced channel decoding techniques are mandatory. In this paper we present a low power implementation of the most sophisticated channel decoding algorithm (turbo-decoding) on programmable architectures. Low power optimization is performed on two abstraction levels: on the system level by the use of an intelligent cancellation technique, and on the implementation level by the use of dynamic voltage scaling. With these techniques we can reduce the worst case energy consumption to 55% using data of state-of-the-art processors. Our approach is also applicable for hardware implementations. To the best of our knowledge, this is the first in-depth study of low power implementations of turbo-decoders based on voltage scheduling for third generation wireless systems.
Adaptive filtering constitutes an important class of DSP algorithms employed in several hand held mobile devices for applications such as echo cancellation, signal de-noising, and channel equalization. In this paper, ...
详细信息
Adaptive filtering constitutes an important class of DSP algorithms employed in several hand held mobile devices for applications such as echo cancellation, signal de-noising, and channel equalization. In this paper, a new hardware architecture using conjugate distributed arithmetic (CDA) which is suitable for high throughput hardware implementations of LMS adaptive filters is presented. Unlike a traditional distributed arithmetic (DA) implementation where all possible combination sums of the filter coefficients are stored in a look-up-table (LUT), in the CDA architecture, all possible combination sums of the input signal samples are stored in the LUT and updated at the arrival of every sample using an efficient update procedure. We describe the design of CDA adaptive filters and show that practical implementations of CDA adaptive filters have very high throughput relative to multiply and accumulate architectures. We also show that CDA adaptive filters have a potential area and power consumption advantage over DSP microprocessor architectures for a given throughput.
This paper collects the most recent parallel coprocessors and highlights the recent trends. It is shown that the single chip massively parallel processor implementations seem to disappear from the scientific investiga...
详细信息
This paper collects the most recent parallel coprocessors and highlights the recent trends. It is shown that the single chip massively parallel processor implementations seem to disappear from the scientific investigations (with the exception of low-level near-sensor image processing). Meanwhile, the formerly developed architectures have moved inside complex system-on-chips/microprocessors. The common aspect of the recent architectures is the advancedprocessing element and internal interconnection solutions, and the dominant mid-grain parallelism (i.e. up to a hundred processing element per chip).
An advanced defect tolerant systolic array implementation of the 2D convolution algorithm for real time image processing applications has been full-custom designed and fabricated using standard CMOS technology. The bi...
详细信息
An advanced defect tolerant systolic array implementation of the 2D convolution algorithm for real time image processing applications has been full-custom designed and fabricated using standard CMOS technology. The bit-serial systolic array incorporates new architectural concepts and circuit techniques fitting a defect tolerant design approach. Therefore high performance and high yield enhancement is achieved. The defect tolerance techniques are based on software controlled defect localization and reconfiguration with programmable switches by a host-processor or a vLSI-tester. The chips functionality differs to available convolution chips by the maximum kernel size of 256 taps, the ability to convolve one video signal with up to four independent coefficient masks, support of adaptive filtering, on-chip line delays and implemented special processing of frames borders. High performance implementations of signalprocessingalgorithms require large chip die sizes. The presented defect tolerance techniques and architectural concepts make systolic large area implementations of signalprocessingalgorithms feasible.
Future wireless systems are required to provide higher data rates, improved spectral efficiency and greater capacity. This can be achieved at the cost of increased signalprocessing complexity. The successful implemen...
详细信息
Future wireless systems are required to provide higher data rates, improved spectral efficiency and greater capacity. This can be achieved at the cost of increased signalprocessing complexity. The successful implementation of advancedalgorithms and dedicated hardware architectures to tackle the demanding signalprocessing tasks calls for an integrated development process. It must effectively exploit the many interrelations between the different levels of the design hierarchy and efficiently bridge the gap between system concepts and their vLSI circuit realization. This paper presents the algorithm and architecture level design of interference suppression techniques for advanced wireless receivers based on the use of multiple antenna elements in combination with appropriate signal combining. A systematic approach to architecture exploration is demonstrated which leads to efficient implementations in terms of both power consumption and silicon area.
暂无评论