Placement and routing run-times continue to dominate the automated FPGA design flow. As the size of FPGA architectures continue to grow exponentially, it remains critical to develop parallel tools for FPGA design wher...
详细信息
Placement and routing run-times continue to dominate the automated FPGA design flow. As the size of FPGA architectures continue to grow exponentially, it remains critical to develop parallel tools for FPGA design where the amount of exposed concurrent work scales withthe size of the designs to be synthesized. In this paper, we propose a novel algorithm for parallel placement, based on simulated annealing, where the amount of parallel work directly scales withthe size of the net-list to be placed. Our approach concurrently evaluates and conditionally applies very large sets of non-conflicting swaps using common parallel computing primitives, including stream compaction, category reduction, and sort. While our design is suitable for targeting all modern parallel computing platforms, we present results from our implementation which targets NVIDIA's CUDA platform, where we achieve a mean speed-up of 19x over VPR with post-routing critical-path-delay and wire-length quality that matches or exceeds VPR. We believe that this work is an important step towards the development of a scalable, high-quality placement tool.
this paper describes the methodology and algorithms behind extra pipeline analysis tools released in the Xilinx Vivado Design Suite version 2015.3. Extra pipelining is one of the most effective ways to improve perform...
详细信息
this paper describes the methodology and algorithms behind extra pipeline analysis tools released in the Xilinx Vivado Design Suite version 2015.3. Extra pipelining is one of the most effective ways to improve performance of FPGA applications. Manual pipelining, however, often requires significant efforts from FPGA designers who need to explore various changes in the RTL and re-run the flow iteratively. the automatic pipelining approach described in this paper, in contrast, allows FPGA users to explore latency vs. performance trade-offs of their designs before investing time and effort into modifying RTL. We describe algorithms behind these tools which use simple cut heuristics to maximize performance improvement while minimizing additional latency and register overhead. To demonstrate the effectiveness of the proposed approach, we analyse a set of 93 commercial FPGA applications and IP blocks mapped to Xilinx UltraScale+ and UltraScale generations of FPGAs. the results show that extra pipelining can provide from 18% to 29% potential Fmax improvement on average. It also shows that the distribution of improvements is bimodal, with almost half of benchmark suite designs showing no improvement due to the presence of large loops. Finally, we demonstrate that highly-pipelined designs map well to UltraScale+ and UltraScale FPGA architectures. Our approach demonstrates 19% and 20% Fmax improvement potential for the UltraScale+ and UltraScale architectures respectively, withthe majority of applications reaching their loop limit through pipelining.
暂无评论