Floating-point division is a very costly operation in FPGA designs. High-frequency implementations of the classic digit-recurrence algorithms for division have long latencies (of the order of the number fraction bits)...
详细信息
Floating-point division is a very costly operation in FPGA designs. High-frequency implementations of the classic digit-recurrence algorithms for division have long latencies (of the order of the number fraction bits) and consume large amounts of logic. Additionally, these implementations require important routing resources, making timing closure difficult in complete designs. In this paper we present two multiplier-based architectures for division which make efficient use of the DSP resources in recent Altera FPGAs. By balancing resource usage between logic, memory and DSP blocks, the presented architectures maintain high frequencies is full designs. Additionally, compared to classical algorithms, the proposed architectures have significantly lower latencies. The architectures target faithfully rounded results, similar to most elementary functions implementations for FPGAs but can also be transformed into correctly rounded architectures with a small overhead. The presented architectures are built using the Altera DSP Builder advanced framework and will be part of the default blockset.
This paper collects the most recent parallel coprocessors and highlights the recent trends. It is shown that the single chip massively parallel processor implementations seem to disappear from the scientific investiga...
详细信息
This paper collects the most recent parallel coprocessors and highlights the recent trends. It is shown that the single chip massively parallel processor implementations seem to disappear from the scientific investigations (with the exception of low-level near-sensor image processing). Meanwhile, the formerly developed architectures have moved inside complex system-on-chips/microprocessors. The common aspect of the recent architectures is the advancedprocessing element and internal interconnection solutions, and the dominant mid-grain parallelism (i.e. up to a hundred processing element per chip).
Low power is an extremely important issue for future mobile radio systems. Channel decoders are essential building blocks of base-band signalprocessing units in mobile terminal architectures. Thus low power implement...
详细信息
ISBN:
(纸本)0780366336
Low power is an extremely important issue for future mobile radio systems. Channel decoders are essential building blocks of base-band signalprocessing units in mobile terminal architectures. Thus low power implementations of advanced channel decoding techniques are mandatory. In this paper we present a low power implementation of the most sophisticated channel decoding algorithm (turbo-decoding) on programmable architectures. Low power optimization is performed on two abstraction levels: on the system level by the use of an intelligent cancellation technique, and on the implementation level by the use of dynamic voltage scaling. With these techniques we can reduce the worst case energy consumption to 55% using data of state-of-the-art processors. Our approach is also applicable for hardware implementations. To the best of our knowledge, this is the first in-depth study of low power implementations of turbo-decoders based on voltage scheduling for third generation wireless systems.
The next generation radar systems have high performance demands on the signalprocessing chain. Examples include the advanced image creating sensor systems in which complex calculations are to be performed on huge set...
详细信息
The next generation radar systems have high performance demands on the signalprocessing chain. Examples include the advanced image creating sensor systems in which complex calculations are to be performed on huge sets of data in real time. Many core architectures are gaining attention as a means to overcome the computational requirements of the complex radar signalprocessing by exploiting massive parallelism inherent in the algorithms in an energy efficient manner. In this paper, we evaluate a many core architecture, namely a 16-core Epiphany processor, by implementing two significantly large case studies, viz. an auto focus criterion calculation and the fast factorized back-projection algorithm, both key components in modern synthetic aperture radar systems. The implementation results from the two case studies are compared on the basis of achieved performance and programmability. One of the Epiphany implementations demonstrates the usefulness of the architecture for the streaming based algorithm (the auto focus criterion calculation) by achieving a speedup of 8.9x over a sequential implementation on a state-of-the-art general-purpose processor of a later silicon technology generation and operating at a 2.7x higher clock speed. On the other case study, a highly memory-intensive algorithm (fast factorized back projection), the Epiphany architecture shows a speedup of 4.25x. For embedded signalprocessing, low power dissipation is equally important as computational performance. In our case studies, the Epiphany implementations of the two algorithms are, respectively, 78x and 38x more energy efficient.
We describe the VLSI implementation of MIMO detectors that exhibit close-to optimum error-rate performance, but still achieve high throughput at low silicon area. In particular, algorithms and VLSI architectures for s...
详细信息
ISBN:
(纸本)9783981080100
We describe the VLSI implementation of MIMO detectors that exhibit close-to optimum error-rate performance, but still achieve high throughput at low silicon area. In particular, algorithms and VLSI architectures for sphere decoding (SD) and K-best detection are considered, and the corresponding trade-offs between uncoded error-rate performance, silicon area, and throughput are explored. We show that SD with a per-block run-time constraint is best suited for practical implementations
We present the "direct inverse scale transform" which is the extension to the "direct scale transform" method originally proposed by Williams, Zalubas and Hero III (see advancedsignalprocessing A...
详细信息
We present the "direct inverse scale transform" which is the extension to the "direct scale transform" method originally proposed by Williams, Zalubas and Hero III (see advancedsignalprocessingalgorithms, architectures and implementations vi, SPIE, vol.2846, p.262-72, 1996). This scheme completes the calculation of analysis and synthesis equations for the scale transform pair which is suitable especially for non-integer values of dilation or compression of signals. Several examples of transformed and reconstructed synthetic and real 1-D and 2-D signals are included.
暂无评论