this paper describes the methodology and algorithms behind extra pipeline analysis tools released in the Xilinx Vivado Design Suite version 2015.3. Extra pipelining is one of the most effective ways to improve perform...
详细信息
ISBN:
(纸本)9782839918442
this paper describes the methodology and algorithms behind extra pipeline analysis tools released in the Xilinx Vivado Design Suite version 2015.3. Extra pipelining is one of the most effective ways to improve performance of FPGA applications. Manual pipelining, however, often requires significant efforts from FPGA designers who need to explore various changes in the RTL and re-run the flow iteratively. the automatic pipelining approach described in this paper, in contrast, allows FPGA users to explore latency vs. performance trade-offs of their designs before investing time and effort into modifying RTL. We describe algorithms behind these tools which use simple cut heuristics to maximize performance improvement while minimizing additional latency and register overhead. To demonstrate the effectiveness of the proposed approach, we analyse a set of 93 commercial FPGA applications and IP blocks mapped to Xilinx UltraScale+ and UltraScale generations of FPGAs. the results show that extra pipelining can provide from 18% to 29% potential Fmax improvement on average. It also shows that the distribution of improvements is bimodal, with almost half of benchmark suite designs showing no improvement due to the presence of large loops. Finally, we demonstrate that highly-pipelined designs map well to UltraScale+ and UltraScale FPGA architectures. Our approach demonstrates 19% and 20% Fmax improvement potential for the UltraScale+ and UltraScale architectures respectively, withthe majority of applications reaching their loop limit through pipelining.
Exploiting the underutilisation of variable-length DSP algorithms during normal operation is vital, when seeking to maximise the achievable functionality of an application within peak power budget. A system level, low...
详细信息
ISBN:
(纸本)9781424419609
Exploiting the underutilisation of variable-length DSP algorithms during normal operation is vital, when seeking to maximise the achievable functionality of an application within peak power budget. A system level, low power design methodology for FPGA-based, variable length DSP IP cores is presented. Algorithmic commonality is identified and resources mapped with a configurable datapath, to increase achievable functionality. It is applied to a digital receiver application where a 100% increase in operational capacity is achieved in certain modes without significant power or area budget increases. Measured results show resulting architectures requires 19% less peak power, 33% fewer multipliers and 12% fewer slices than existing architectures.
the traditional approach to FPGA packing and CLB-level placement has been shown to yield significantly worse quality than approaches which allow BLES to move during placement. In practice, however, modern FPGA archite...
详细信息
ISBN:
(纸本)9781424410590
the traditional approach to FPGA packing and CLB-level placement has been shown to yield significantly worse quality than approaches which allow BLES to move during placement. In practice, however, modern FPGA architectures require expensive DRC checks which can render full BLE-level placement impractical. We address this problem by proposing a novel clustering framework that uses physical information to produce better initial packings which can, in turn, reduce the amount Of BLE-level placement that is required. We quantify our packing technique across accepted benchmarks and show that it produces results with 16% less wire length, 19% smaller minimum channel widths, and 8% less critical path delay, on average, than leading methods.
Recent generation of FPGA devices takes advantage of speed and density benefits resulted from heterogeneous FPGA architecture, in which several basic LUTs can be combined to form one larger size LUT called Macro. Larg...
详细信息
ISBN:
(纸本)9781424419609
Recent generation of FPGA devices takes advantage of speed and density benefits resulted from heterogeneous FPGA architecture, in which several basic LUTs can be combined to form one larger size LUT called Macro. Large Macros not only decrease network depth efficiently but also reduce area. In this paper, a new technology mapping algorithm, named MacroMap is proposed for the heterogeneous FPGAs with effective area estimation to overcome the main disadvantage that traditional technology mapping algorithms only generate one kind of typical K-LUT and cannot make full use of LUTs with different sizes (basic LUTs and Macros). Experimental results show that MacroMap can obtain 19% gain on area while keeping the network depth optimal compared withthe existing heterogeneous FPGA mapping algorithm heteromap([8]).
Quaternary logic has shown to be a promising alternative for implementing FPGAs, since voltage mode quaternary circuits can reduce the circuits' cost and at the same time reduce its power consumption. In this pape...
详细信息
this paper describes an FPGA-based accelerator for maze routing applications such as integrated circuit detailed routing. the accelerator efficiently supports multiple layers, multi-terminal nets, and rip up and rerou...
详细信息
ISBN:
(纸本)9781424410590
this paper describes an FPGA-based accelerator for maze routing applications such as integrated circuit detailed routing. the accelerator efficiently supports multiple layers, multi-terminal nets, and rip up and reroute. By time-multiplexing multiple layers over a two-dimensional array of processing elements, this approach can support multi-layer grids large enough for detailed routing while providing at 1-2 orders of magnitude speedup over software running on a modern desktop computer. the current implementation supports a 32 X 32 routing grid with up to 16 layers in a single Xilinx XC2V6000 FPGA. Up to 64 X 64 routing grids are feasible in larger commercially available FPGAs. Performance measurements (including interface overhead) show a speedup of 29X-93X over the classic Lee Algorithm and 5X-19X over the A* Algorithm. An improved interface design could yield significantly larger speedups.
A method for the development of a test plan for BIST-based exhaustive testing of a circuit implemented with an in-system reconfigurable FPGA is presented. A test plan for application-dependent testing of an FPGA is ba...
详细信息
We present an extension of a procedure for self-testing of an FPGA that implements a user-defined function. this extension, intended to improve the detectability of FPGA delay faults, exploits the reconfigurability of...
详细信息
Wireless communication systems operate within a wide variety of dynamic ranges, variable bandwidths and carrier frequencies. New high density re-programmablelogic arrays are a suitable technology basis providing suff...
详细信息
this paper presents a new type of coarse-grained reconfigurable architecture (CGRA) for the object inference domain in machine learning. the proposed CGRA is optimized for stream processing and a correspondent program...
详细信息
this paper presents a new type of coarse-grained reconfigurable architecture (CGRA) for the object inference domain in machine learning. the proposed CGRA is optimized for stream processing and a correspondent programming model called dual-track model is proposed. the CGRA is realized in Verilog HDL and implemented in SMIC 55 nm process, withthe footprint of 3.79 mm 2 and consuming 1.79 W at 500 MHz. To evaluate the performance, eight machine-learning algorithms including HOG, CNN, k-means, PCA, SPM, linear-SVM, Softmax and Joint-Bayesian are selected as benchmarks. these algorithms cover a general machine learning flow in object inference domain: feature extraction, feature selection and inference. the experimental results show that the proposed CGRA can gain 1443× average energy efficiency comparing to the Intel i7-3770 CPU and 7.82× energy efficiency comparing to a high performance FPGA solution [19].
暂无评论