In this paper, we present a new retiming-based technology mapping algorithm for look-up table-based fieldprogrammablegatearrays. the algorithm is based on a novel iterative procedure for computing all k-cuts of all...
详细信息
In this paper, we present a new retiming-based technology mapping algorithm for look-up table-based fieldprogrammablegatearrays. the algorithm is based on a novel iterative procedure for computing all k-cuts of all nodes in a sequential circuit, in the presence of retiming. the algorithm completely avoids flow computation which is the bottleneck of previous algorithms. Due to the fact that k is very small in practice, the procedure for computing all k-cuts is very fast. Experimental results indicate the overall algorithm is very efficient in practice.
Memory-related constraints (memory bandwidth, cache size) are nowadays the performance bottleneck of most computational applications. Especially in the scenario of multiple cores, the performance does not scale with t...
详细信息
ISBN:
(纸本)9781450305549
Memory-related constraints (memory bandwidth, cache size) are nowadays the performance bottleneck of most computational applications. Especially in the scenario of multiple cores, the performance does not scale withthe number of cores in many cases. In our work, we present our FPGA-based solution for the 3D Reverse Time Migration (RTM) algorithm. As the most computationally demanding imaging algorithm in current oil and gas exploration, RIM involves various computational challenges, such as a high demand for storage size and bandwidth, and a poor cache behavior. Combining optimizations from boththe algorithmic and architectural perspectives, our FPGA-based solution manages to remove the memory constraints and provide a high performance that can scale well withthe amount of computational resources available. Compared with an optimized CPU implementation using two quad-core Intel Nehalem CPUs, our solution achieves 4x speedup on two Virtex-5 FPGAs, and 8x speedup on two Virtex-6 FPGAs. Our projection demonstrates that the performance will continue to scale withthe future increase of FPGA capacities.
Due to their generic and highly programmable nature, FPGAs provide the ability to implement a wide range of applications. However, it is this nonspecific nature that has limited the use of FPGAs in scientific applicat...
详细信息
ISBN:
(纸本)1595932925
Due to their generic and highly programmable nature, FPGAs provide the ability to implement a wide range of applications. However, it is this nonspecific nature that has limited the use of FPGAs in scientific applications that require floating-point arithmetic. Even simple floating-point operations consume a large amount of computational resources. In this paper, we introduce embedding floating-point multiply-add units in an island style FPGA. this has shown to have an average area savings of 55.0% and an average increase of 40.7% in clock rate over existing architectures. Copyright 2006 acm.
the routing channels of an FPGA consist of wire segments of various types providing the tradeoff between performance and routability. In the routing architectures of recently developed FPGAs (e.g., Virtex-II), there a...
详细信息
the routing channels of an FPGA consist of wire segments of various types providing the tradeoff between performance and routability. In the routing architectures of recently developed FPGAs (e.g., Virtex-II), there are more versatile wire types and richer connections between them than those of the older generations of FPGAs (e.g. XC4000). To fully exploit the potential of the new routing architectures, it is beneficial to perform wire type assignment for all channels as an intermediate stage between global routing and detailed routing. In this paper, we present a wire-type assignment algorithm that is based on iteratively applying min-cost maxflow technique to simultaneously route many nets. At each stage of the network flow computation, we have guaranteed optimal result in terms of routability and delay cost. We use the routing architecture of the Virtex-II FPGAs from Xilinx as a target architecture in our experiments. Experimental results show that our algorithm outperforms the traditional sequential net-by-net approach.
We consider packing in the commercial FPGA context and examine the speed, performance and power trade-offs associated with packing in a state-of-the art FPGA - the Xilinx (R) Virtex (TM) -5 FPGA. Two aspects of packin...
详细信息
ISBN:
(纸本)9781595939340
We consider packing in the commercial FPGA context and examine the speed, performance and power trade-offs associated with packing in a state-of-the art FPGA - the Xilinx (R) Virtex (TM) -5 FPGA. Two aspects of packing are discussed: 1) packing for general logic blocks, and 2) packing for large IP blocks. Virtex-5 logic blocks contain dual-output 6-input look-up-tables (LUTs). Such LUTs call implement any single logic function requiring no more than 6 inputs, or any two logic functions requiring no more than 5 distinct inputs. the second LUT Output is associated with slower speed, and therefore, must be used judiciously. We present placement-based techniques for dual-output LUT packing that;lead to improved area-efficiency and power, with minimal performance degradation. We then move on to address packing for large IP blocks, specifically, block RAMs and DSPs. We present a packing optimization that is widely applicable in DSP designs that leads to significantly improved design performance.
In this paper, we present techniques for energy-efficient design at the algorithm level using FPGAs. We then use these techniques to create energy-efficient designs for two signal processing kernel applications: fast ...
详细信息
In this paper, we present techniques for energy-efficient design at the algorithm level using FPGAs. We then use these techniques to create energy-efficient designs for two signal processing kernel applications: fast Fourier transform (FFT) and matrix multiplication. We evaluate the performance, in terms of both latency and energy efficiency, of FPGAs in performing these tasks. Using a Xilinx Virtex-II as the target FPGA. we compare the performance of our designs to those from the Xilinx library as well as to conventional algorithms run on the PowerPC core embedded in the Virtex-II Pro and the Texas Instruments TMS320C6415. Our evaluations are done boththrough estimation based on energy and latency equations and through low-level simulation. For FFT, our designs dissipated an average of 60% less energy than the design from the Xilinx library and 56% less than the DSP. Our designs showed a factor of 10 improvement over the embedded processor. these results provide concrete evidence to substantiate the idea that FPGAs can outperform DSPs and embedded processors in signal processing. Further, they show that FPGAs can achieve this performance while still dissipating less energy than the other two types of devices.
Clock network power in field-programmablegatearrays (FP- ) is considered and two complementary approaches for power reduction in the Xilinx RVirtexTM-5 FPGA are. the approaches are unique in that they lever- specifi...
详细信息
ISBN:
(纸本)9781605584102
Clock network power in field-programmablegatearrays (FP- ) is considered and two complementary approaches for power reduction in the Xilinx RVirtexTM-5 FPGA are. the approaches are unique in that they lever- specific architectural aspects of Virtex-5 to achieve re- in dynamic power consumed by the clock network. first approach comprises a placement-based technique reduce interconnect resource usage on the clock network, reducing capacitance and power (up to 12%). the approach borrows the "clock gating" notion from the domain and applies it to FPGAs. Clock enable sig- on flip-flops are selectively migrated to use the dedi- clock enable available on the FPGA's built-in clock, leading to reduced toggling on the clock intercon- and lower power (up to 28%). Power reductions are achieved without any performance penalty, on average. Copyright 2009 acm.
Division is one of the most complicated and expensive arithmetic operations. Both clock frequency and operation delay are limited by the memory wall, even in LUT-based FPGA devices. To conquer the memory limitation, w...
详细信息
ISBN:
(纸本)1595932925
Division is one of the most complicated and expensive arithmetic operations. Both clock frequency and operation delay are limited by the memory wall, even in LUT-based FPGA devices. To conquer the memory limitation, we propose a hybrid division algorithm which employs Prescaling, Series expansion and Taylor expansion (PST) algorithms. the proposed algorithm boosts very-high radix division efficiently. the algorithm is multiplicative, and feasible for the modern FPGA devices with build-in multipliers. the algorithm is implemented in Altera StratixII FPGA devices and compared withthe division IP core generated by Mega Wizard. the result shows that the PST algorithm has higher clock frequency, lower execution time and also lower power consumption. Copyright 2006 acm.
To improve FPGA performance for arithmetic circuits, this paper proposes a new architecture for FPGA logic cells that includes a 6:2 compressor. the new cell features additional fast carry-chains that concatenate adja...
详细信息
ISBN:
(纸本)9781595939340
To improve FPGA performance for arithmetic circuits, this paper proposes a new architecture for FPGA logic cells that includes a 6:2 compressor. the new cell features additional fast carry-chains that concatenate adjacent compressors and can be routed locally without the global routing network. Unlike previous carry-chains for binary and ternary addition, the carry chain used by the new cell only spans 2 logic blocks, which significantly improves the delay of multi-input addition operations mapped onto the FPGA. the delay and area overhead that arises from augmenting a traditional FPGA logic cell withthe new compressor structure is minimal. Using this new cell, we observed an average speedup in combinational delay of 1.41 compared to adder trees synthesized using ternary adders. Copyright 2008 acm.
the paper presents several improvements to state-of-the-art in FPGA technology mapping exemplified by a recent advanced technology mapper DAOmap [Chen and Cong, ICCAD '04]. Improved cut enumeration computes all K-...
详细信息
ISBN:
(纸本)1595932925
the paper presents several improvements to state-of-the-art in FPGA technology mapping exemplified by a recent advanced technology mapper DAOmap [Chen and Cong, ICCAD '04]. Improved cut enumeration computes all K-feasible cuts without pruning for up to 7 inputs for the largest MCNC benchmarks, A new technique for on-the-fly cut dropping reduces by orders of magnitude memory needed to represent cuts for large designs. Improved area recovery leads to mappings with area on average 7% smaller than DAOmap, while preserving delay optimality when starting from the same optimized netlists. Applying mapping with structural choices derived by a synthesis flow on average reduces delay by 7% and area by 14%, compared to DAOmap. Copyright 2006 acm.
暂无评论