In recent years, the RapidSmith CAD tool [1] has been used with ISE to create custom CAD tools targeting Xilinx FPGAs. this tool flow was based on the Xilinx Design Language (XDL), a human-readable representation of a...
详细信息
Molecular dynamics (MD) is of central importance to computational chemistry. Here the authors show that MD can be implemented efficiently on a commercial off-the-shelf (COTS) fieldprogrammable gate array (FPGA) board...
详细信息
Molecular dynamics (MD) is of central importance to computational chemistry. Here the authors show that MD can be implemented efficiently on a commercial off-the-shelf (COTS) fieldprogrammable gate array (FPGA) board, and that speed-ups from 31 x to 88 x over a PC implementation can be obtained. Although the extent of speed-up depends on the stability required, 46x can be obtained with virtually no detriment, and the upper end of the range is apparently viable in many cases. the authors sketch the FPGA implementations and describe the effects of precision on the trade-off between the performance and quality of the MD simulation.
the latest published studies with extensive explorations of look-up table and cluster sizes are now more than a decade old. However, CMOS technology as well as CAD and transistor modeling tools have improved so much s...
详细信息
ISBN:
(纸本)9789090304281
the latest published studies with extensive explorations of look-up table and cluster sizes are now more than a decade old. However, CMOS technology as well as CAD and transistor modeling tools have improved so much since that it is reasonable to wonder whether the conclusions of such studies still hold. One of the major difficulties of conducting these studies, especially in academia, is producing credible delay and area models. In this paper, we take advantage of a recently developed architecture modeling tool to re-evaluate the effect of the various cluster parameters on the FPGA. We considerably extend the exploration space beyond that of the classic studies to include sparse crossbars and fracturable LUTs, and show some results that go against the current tenets of FPGA architecture.
Most internal FPGA debug methods require the use of Block-RAM (BRAM) memory for trace buffers. Recent work has shown the viability of replacing BRAMs with distributed, LUT based memory. Distributed memory (DIME) trace...
详细信息
ISBN:
(纸本)9781728148847
Most internal FPGA debug methods require the use of Block-RAM (BRAM) memory for trace buffers. Recent work has shown the viability of replacing BRAMs with distributed, LUT based memory. Distributed memory (DIME) trace buffers are lean and can be utilized in large designs where other debug methods are unlikely to fit. Since LUTs are abundant on FPGA devices, there are nearly always some left unused after the user's design is placed, even for designs that utilize more than 90% of the FPGA's resources. DIME trace buffers are inserted into highly utilized designs within minutes using RapidWright. In this paper we contrast the previously used method of scavenging leftover LUT resources with a preallocation scheme that ensures a certain amount of memory LUTs are left available for distributed memory trace buffers. While causing virtually no penalty to the user design, preallocating memory LUT resources allows the very largest designs to utilize higher numbers of distributed memory trace buffers at lower timing penalties. We also show that depth of DIME trace buffers can be extended from 16 to 256 bits.
Deploying advanced Simultaneous Localisation and Mapping, or SLAM, algorithms in autonomous low-power robotics will enable emerging new applications which require an accurate and information rich reconstruction of the...
详细信息
ISBN:
(纸本)9782839918442
Deploying advanced Simultaneous Localisation and Mapping, or SLAM, algorithms in autonomous low-power robotics will enable emerging new applications which require an accurate and information rich reconstruction of the environment. this has not been achieved so far because accuracy and dense 3D reconstruction come with a high computational complexity. this paper discusses custom hardware design on a novel platform for embedded SLAM, an FPGA-SoC, combining an embedded CPU and programmablelogic on the same chip. the use of programmablelogic, tightly integrated with an efficient multicore embedded CPU stands to provide an effective solution to this problem. In this work an average framerate of more than 4 frames/second for a resolution of 320x240 has been achieved with an estimated power of less than 1 Watt for the custom hardware. In comparison to the software-only version, running on a dual-core ARM processor, an acceleration of 2x has been achieved for LSD-SLAM, without any compromise in the quality of the result.
Firstly, we present VTR-to-Bitstream v2.0, the latest version of our open-source toolchain that takes Verilog input and produces a packed, placed-and now routed - solution that can be programmed onto the Xilinx commer...
详细信息
ISBN:
(纸本)9781467381239
Firstly, we present VTR-to-Bitstream v2.0, the latest version of our open-source toolchain that takes Verilog input and produces a packed, placed-and now routed - solution that can be programmed onto the Xilinx commercial FPGA architecture. Secondly, we apply this updated tool to measure the gap between academic and industrial FPGA tools by examining the quality of results at each of the three main compilation stages: synthesis, packing & placement, routing. Our findings indicate that the delay gap (according to Xilinx static timing analysis) for academic tools breaks down into a 31% degradation with synthesis, 10% with packing & placement, and 15% with routing. this leads us to believe that opportunities for improvement exist not only within VPR, but also in the front-end tools that lie upstream.
this paper describes an approach to the placement of self-timed circuits onto commercial FPGAs, using only conventional synchronous tools available on the market. Different parts of the design are constrained in order...
详细信息
ISBN:
(纸本)9781424438914
this paper describes an approach to the placement of self-timed circuits onto commercial FPGAs, using only conventional synchronous tools available on the market. Different parts of the design are constrained in order to maintain the timing relationship required for guaranteeing the correct circuit functionality and to keep the wiring influence on system delays bounded and fixed across the different iterations. this work is part of the extension to the CodeSimulink co-design environment we made in order to allow the synthesis of asynchronous circuits from Simulink specifications.
High-Level Synthesis (HLS) tools enable rapid hardware development, but design expertise and effort are necessary to tune the high-level descriptions into optimized circuits. To improve designer productivity, automate...
详细信息
ISBN:
(数字)9781538685174
ISBN:
(纸本)9781538685174
High-Level Synthesis (HLS) tools enable rapid hardware development, but design expertise and effort are necessary to tune the high-level descriptions into optimized circuits. To improve designer productivity, automated design-space exploration techniques have been proposed. However, the optimization processes sample expensive CAD flows. In this paper, we adapt multi-fidelity optimization methods to incorporate low-fidelity estimates available in the FPGA CAD flow and speed up tuning of HLS parameters. We find that multi-fidelity optimization techniques can significantly reduce optimization time compared to previous approaches.
this paper presents an analytical model that relates the architectural parameters of an FPGA to the place-and-route runtimes of the FPGA CAD tools. We consider both a simulated annealing based placement algorithm empl...
详细信息
ISBN:
(纸本)9781424438914
this paper presents an analytical model that relates the architectural parameters of an FPGA to the place-and-route runtimes of the FPGA CAD tools. We consider both a simulated annealing based placement algorithm employing a bounding-box wirelength cost function, and a negotiation-based A* router. We also show an example application of the model in early architecture evaluation.
Bufferless, deflection-routed, Butterfly Fat Trees (BFTs) can outperform state-of-the-art FPGAs overlay NoCs such as Hoplite by as much as 2-5 x on throughput and approximate to 5 x on worst-case latency at identical ...
详细信息
ISBN:
(纸本)9789090304281
Bufferless, deflection-routed, Butterfly Fat Trees (BFTs) can outperform state-of-the-art FPGAs overlay NoCs such as Hoplite by as much as 2-5 x on throughput and approximate to 5 x on worst-case latency at identical PE counts, and by approximate to 1.5 x on throughput at identical resource costs > 16K LUTs for statistical traffic patterns. In this paper, we show how to modify the tree connectivity and routing function to support deflection routing on the BFT topology. We introduce the idea of localized deflections that trap deflected packets within a single level of a multi-level BFT to avoid the long round-trip penalty traditionally associated with deflection routing. Across a range of statistical traffic patterns, we show a sustained throughput improvement of 2-5 x over for Hoplite for system sizes as large at 512 PEs at above 20% injection rates when using localized deflections. We also show how the configurable bisection bandwidth of the BFT, modeled withthe Rent parameter 0
th
e best performing NoC at a desired cost. For instance, our NoC generator can produce simple trees (P=0) for low-cost applications <2K LUTs, mesh-equivalent BFTs (P=0.5) for real-world applications with locality at <10K LUTs, and crossbars (P=1) when cost is not a constraint >64K LUTs. For workloads with locality, we recommend the BFT topology with p = 0.67.
暂无评论