The purpose of this paper is to introduce a modified packing and placement algorithm for FPGAs that utilizes logic duplication to improve performance. The modified packing algorithm was designed to leave unused basic ...
详细信息
The purpose of this paper is to introduce a modified packing and placement algorithm for FPGAs that utilizes logic duplication to improve performance. The modified packing algorithm was designed to leave unused basic logic elements (BLEs) in timing critical clusters, to allow potential targets for logic duplication. The modified placement algorithm consists of a new stage after placement in which logic duplication is performed to shorten the length of the critical path. In this paper, we show that in a representative FPGA architecture using .18 μm technology, the length of the final critical path can be reduced by an average of 14.1%. Approximately half of this gain comes directly from the changes to the packing algorithm while the other half comes from the logic duplication performed during placement.
This paper describes the hardware implementation of a real-time, large-scale, multi-chip FPGA (fieldprogrammablegate Array) based emulation engine with a capacity of 10 million ASIC (Application Specific Integrated ...
详细信息
This paper describes the hardware implementation of a real-time, large-scale, multi-chip FPGA (fieldprogrammablegate Array) based emulation engine with a capacity of 10 million ASIC (Application Specific Integrated Circuits) equivalent gates. Attainable system operation frequency can exceed 60 MHz, and the system throughput has been empirically verified to achieve 600 billion 16-bit additions per second. The emulator is custom designed to maximize the performance and resource utilization for a range of telecommunication and digital signal processing applications. With its high-speed interconnect architecture and large external I/O bandwidth, the emulator excels in prototyping real-time systems that have strict timing, logic capacity, and data rate requirements. Our development efforts are guided by such ongoing projects as ultra-wide band (UWB) and multi-channel-multi-antenna (MCMA) radio systems research.
This paper describes the Altera Stratix logic and routing architecture. The primary goals of the architecture were to achieve high performance and logic density. We give an overview of the entire device, and then focu...
详细信息
This paper describes the Altera Stratix logic and routing architecture. The primary goals of the architecture were to achieve high performance and logic density. We give an overview of the entire device, and then focus on the logic and routing architecture. The Stratix logic architecture is based on a cluster of ten 4-input LUTs and its routing consists of staggered routing lines. We describe the development of the routing architecture, including its directional bias, its direct-drive routing which reduces both area and delay. The logic array block and logic cell design is also described, and new routing structures with in the logic array block, and logic element features are described.
As integrated circuits become more and more complex, the ability to make post-fabrication changes will become more and more attractive. This ability can be realized using programmable logic cores. Currently, such core...
详细信息
As integrated circuits become more and more complex, the ability to make post-fabrication changes will become more and more attractive. This ability can be realized using programmable logic cores. Currently, such cores are available from vendors in the form of a "hard" layout. In this paper, we focus on an alternative approach: vendors supply a synthesizable version of their programmable logic core (a "soft" core) and the integrated circuit designer synthesizes the programmable logic fabric using standard cells. Although this technique suffers increased speed, density, and power overhead, the task of integrating such cores is far easier than the task of integrating "hard" cores into an ASIC. For very small amounts of logic, this ease of use may be more important than the increased overhead. This paper presents two synthesizable programmable logic core architectures, describes the associated place and route CAD tools, and compares the two architectures to each other, and to a "hard" programmable logic core. It also shows how these cores can be made more efficient by creating a non-rectangular architecture, an option not available to "hard" core vendors.
How does multilevel metalization impact the design of FPGA interconnect? The availability of a growing number of metal layers presents the opportunity to use wiring in the third-dimension to reduce switch requirements...
详细信息
How does multilevel metalization impact the design of FPGA interconnect? The availability of a growing number of metal layers presents the opportunity to use wiring in the third-dimension to reduce switch requirements. Unfortunately, traditional FPGA wiring schemes are not designed to exploit these additional metal layers. We introduce an alternate topology, based on Leighton's Mesh-of-Trees, which carefully exploits hierarchy to allow additional metal layers to support arbitrary device scaling. When wiring layers grow sufficiently fast with aggregate network size (N), our network requires only O(N) area: this is in stark contrast to traditional. Manhattan FPGA routing schemes where switching requirements alone grow superlinearly in N. In practice, we show that, even for the admittedly small designs in the Toronto "FPGA Place and Route Challenge," the Mesh-of-Trees networks require 10% less switches than the standard, Manhattan FPGA routing scheme.
To truly exploit FPGAs for rapid turn-around development and prototyping, placement times must be reduced to seconds;late-bound, reconfigurable computing applications may demand placement times as short as microsecond...
详细信息
To truly exploit FPGAs for rapid turn-around development and prototyping, placement times must be reduced to seconds;late-bound, reconfigurable computing applications may demand placement times as short as microseconds. In this paper, we show how a systolic structure can accelerate placement by assigning one processing element to each possible location for an FPGA LUT from a design netlist. We demonstrate that our technique approaches the same quality point as traditional simulated annealing as measured by a simple linear wirelength metric. Experimental results look ahead to compare quality against VPR's fast placer when considering the minimum channel width required to route as the primary optimization criteria. Preliminary results from an FPGA implementation show the feasibility of accelerating simulated annealing by three orders of magnitude using this approach. This means we can place the largest design in the University of Toronto's "FPGA Placement and Routing Challenge" in around 4ms.
This paper presents a flexible FPGA architecture evaluation framework, named fpgaEVA-LP, for power efficiency analysis of LUT-based FPGA architectures. Our work has several contributions: (i) We develop a mixed-level ...
详细信息
This paper presents a flexible FPGA architecture evaluation framework, named fpgaEVA-LP, for power efficiency analysis of LUT-based FPGA architectures. Our work has several contributions: (i) We develop a mixed-level FPGA power model that combines switch-level models for interconnects and macromodels for LUTs;(ii) We develop a tool that automatically generates a back-annotated gate-level netlist with post-layout extracted capacitances and delays;(iii) We develop a cycle-accurate power simulator based on our power model. It carries out gate-level simulation under real delay model and is able to capture glitch power;(iv) Using the frame work fpgaEVA-LP, we study the power efficiency of FPGAs, in 0.10um technology, under various settings of architecture parameters such as LUT sizes, cluster sizes and wire segmentation schemes and reach several important conclusions. We also present the detailed power consumption distribution among different FPGA components and shed light on the potential opportunities of power optimization for future FPGA designs (e.g., ≤ 0.10um technology).
This paper presents a new power saving, high speed FPGA design enhancing a previous SiGe CML FPGA based on the Xilinx 6200 FPGA. The design aims at having a higher performance but minimizing power consumption. The new...
详细信息
This paper presents a new power saving, high speed FPGA design enhancing a previous SiGe CML FPGA based on the Xilinx 6200 FPGA. The design aims at having a higher performance but minimizing power consumption. The new SiGe process has traded off the circuit's performance for reduced power consumption. The power supply voltage has been reduced from 3.4 V to 2.0 V. The structure of the Basic Cell, including the Configurable Logic Block (CLB) and routing multiplexers (MUXs), has been modified so that the supply voltage reduction can be attained. Simulations have shown that the gate delay of the new Basic Cell is reduced from 130 ps in the prior design to 51 ps. The total power consumption for each Basic Cell has been reduced 94% from 71 mW to 4.2 mW. making a large scale FPGA feasible. This design is currently under fabrication for testing.
FPGA place and route is time consuming, often serving as the major obstacle inhibiting a fast edit-compile-test loop in prototyping and development and the major obstacle preventing late-bound hardware and design mapp...
详细信息
FPGA place and route is time consuming, often serving as the major obstacle inhibiting a fast edit-compile-test loop in prototyping and development and the major obstacle preventing late-bound hardware and design mapping for reconfigurable systems. Previous work showed that hardware-assisted routing can accelerate fanout-free routing on Fat-Trees by three orders of magnitude with modest modifications to the network itself. In this paper, we show how these techniques can be applied to any FPGA and how they can be implemented on top of LUT networks in cases where modification of the FPGA itself is not justified. We further show how to accommodate fanout and how to achieve comparable route quality to software-based methods. For a tree network, we estimate an FPGA implementation of our routing logic could route the Toronto Place and Route Benchmarks at least two orders of magnitude faster than a software Pathfinder while achieving within 3% of the aggregate quality. Preliminary results on small mesh benchmarks achieve within one track of vpr -fast.
In this paper, we study the problem of placement-driven technology mapping for table-lookup based FPGA architectures to optimize circuit performance. Early work on technology mapping for FPGAs such as Chortle-d[14] an...
详细信息
In this paper, we study the problem of placement-driven technology mapping for table-lookup based FPGA architectures to optimize circuit performance. Early work on technology mapping for FPGAs such as Chortle-d[14] and Flowmap[3] aim to optimize the depth of the mapped solution without consideration of interconnect delay. Later works such as Flowmap-d[7], Bias-Clus[4] and EdgeMap consider interconnect delays during mapping, but do not take into consideration the effects of their mapping solution on the final placement. Our work focuses on the interaction between the mapping and placement stages. First, the interconnect delay information is estimated from the placement, and used during the labeling process. A placement-based mapping solution which considers both global cell congestion and local cell congestion is then developed. Finally, a legalization step and detailed placement is performed to realize the design. We have implemented our algorithm in a LUT based FPGA technology mapping package named PDM (Placement-Driven Mapping) and tested the implementation on a set of MCNC benchmarks. We use the tool VPR[1][2] for placement and routing of the mapped netlist. Experimental results show the longest path delay on a set of large MCNC benchmarks decreased by 12.3% on the average.
暂无评论