The authors propose an extension to the popular {2/sup n/-1, 2/sup n/, 2/sup n/+1} moduli set by adding a fourth modulus "2/sup n+1/+1. This extension leads to higher parallelism while keeping the forward convers...
详细信息
The authors propose an extension to the popular {2/sup n/-1, 2/sup n/, 2/sup n/+1} moduli set by adding a fourth modulus "2/sup n+1/+1. This extension leads to higher parallelism while keeping the forward conversion and modular arithmetic units simple. The main challenge of efficient reverse conversion is met by three techniques described for the first time. Firstly, we reverse convert linear combinations of moduli hence reducing the number of non-zero bits in the Booth encoded multiplicands from n to merely 2. Secondly, it is shown that division by 3, if introduced at the right stage, can be implemented very efficiently and can, in turn, reduce the cost of the converter. To implement VLSI efficient modulo reduction, we propose two techniques-multiple split tables (MST) and a modified division algorithm (MDA). It is shown that the MST can reduce exponential ROM requirements to quadratic ROM requirements while the MDA can reduce these further to linear requirements. As a result of these innovations, the proposed reverse converter uses simple shift and add operations and needs a lookup with only 6 entries. The delay of the converter is approximately 10n+13 full adder delays and the area cost is quadratic in n.
This paper reports how VLSI cost metrics (area, delay, power) of residue reverse converters scale with the cardinality and dynamic range of moduli sets. The study uses CMAC reverse converters, reported previously by t...
详细信息
This paper reports how VLSI cost metrics (area, delay, power) of residue reverse converters scale with the cardinality and dynamic range of moduli sets. The study uses CMAC reverse converters, reported previously by the authors to be the most efficient known to date in terms of area and delay. In all, 134 reverse converters with dynamic ranges from 32 to 120 bits and set cardinalities ranging from 4 to 20 are actually constructed and analyzed. It is seen that area, delay and power costs are cardinality insensitive once the cardinality exceeds a threshold (usually between five to eight). For cardinalities beyond this threshold, conversion costs are essentially dynamic range dependent. This insensitivity is explained in detail by noting the counterbalancing effects of the various sub-units of a CMAC reverse converter. Since practical implementations of RNS usually employ cardinalities beyond the abovementioned thresholds, the significance of this study is its conclusion that increasing the set cardinality in most implementations will have a marginal, if any, effect on VLSI reverse conversion costs.
This paper reports our ongoing investigation of a new paradigm to realize high performance DSP architectures suitable for embedded ASics. The reasons for the significant gap between achievable MAC bandwidth and that d...
详细信息
This paper reports our ongoing investigation of a new paradigm to realize high performance DSP architectures suitable for embedded ASics. The reasons for the significant gap between achievable MAC bandwidth and that delivered by current embedded DSP architectures are analyzed in detail. A processing engine composed of a general purpose DSP core closely coupled with an application-specific version of Renaissance-our previously developed vector co-processor with a residue arithmetic datapath-is proposed as a solution to close this gap. In the first step, code transformations are applied to firmware to expose the vector-like nature of DSP computation. Then, Renaissance's instruction set, datapath and control are personalized for vector primitives thus exposed. The most important advantages of this approach are that it is highly amenable to automation, it captures most of the compute intensive routines (>70%) quite well and makes the Renaissance reusable across applications. This paradigm has resulted in throughput gains ranging from 33% to over 200% when firmware for actual communications and speech coding applications was recoded. In Ren-AC, a Renaissance version optimized for a modem bank application, the system-wide increase in MAC throughput was higher than 50%.
暂无评论