fpga place and route is time consuming, often serving as the major obstacle inhibiting a fast edit-compile-test loop in prototyping and development and the major obstacle preventing late-bound hardware and design mapp...
详细信息
fpga place and route is time consuming, often serving as the major obstacle inhibiting a fast edit-compile-test loop in prototyping and development and the major obstacle preventing late-bound hardware and design mapping for reconfigurable systems. Previous work showed that hardware-assisted routing can accelerate fanout-free routing on Fat-Trees by three orders of magnitude with modest modifications to the network itself. In this paper, we show how these techniques can be applied to any fpga and how they can be implemented on top of LUT networks in cases where modification of the fpga itself is not justified. We further show how to accommodate fanout and how to achieve comparable route quality to software-based methods. For a tree network, we estimate an fpga implementation of our routing logic could route the Toronto Place and Route Benchmarks at least two orders of magnitude faster than a software Pathfinder while achieving within 3% of the aggregate quality. Preliminary results on small mesh benchmarks achieve within one track of vpr -fast.
In this paper, we study the problem of placement-driven technology mapping for table-lookup based fpga architectures to optimize circuit performance. Early work on technology mapping for fpgas such as Chortle-d[14] an...
详细信息
In this paper, we study the problem of placement-driven technology mapping for table-lookup based fpga architectures to optimize circuit performance. Early work on technology mapping for fpgas such as Chortle-d[14] and Flowmap[3] aim to optimize the depth of the mapped solution without consideration of interconnect delay. Later works such as Flowmap-d[7], Bias-Clus[4] and EdgeMap consider interconnect delays during mapping, but do not take into consideration the effects of their mapping solution on the final placement. Our work focuses on the interaction between the mapping and placement stages. First, the interconnect delay information is estimated from the placement, and used during the labeling process. A placement-based mapping solution which considers both global cell congestion and local cell congestion is then developed. Finally, a legalization step and detailed placement is performed to realize the design. We have implemented our algorithm in a LUT based fpga technology mapping package named PDM (Placement-Driven Mapping) and tested the implementation on a set of MCNC benchmarks. We use the tool VPR[1][2] for placement and routing of the mapped netlist. Experimental results show the longest path delay on a set of large MCNC benchmarks decreased by 12.3% on the average.
This paper describes the Altera Stratix logic and routing architecture. The primary goals of the architecture were to achieve high performance and logic density. We give an overview of the entire device, and then focu...
详细信息
This paper describes the Altera Stratix logic and routing architecture. The primary goals of the architecture were to achieve high performance and logic density. We give an overview of the entire device, and then focus on the logic and routing architecture. The Stratix logic architecture is based on a cluster of ten 4-input LUTs and its routing consists of staggered routing lines. We describe the development of the routing architecture, including its directional bias, its direct-drive routing which reduces both area and delay. The logic array block and logic cell design is also described, and new routing structures with in the logic array block, and logic element features are described.
The routing channels of an fpga consist of wire segments of various types providing the tradeoff between performance and routability. In the routing architectures of recently developed fpgas (e.g., Virtex-II), there a...
详细信息
The routing channels of an fpga consist of wire segments of various types providing the tradeoff between performance and routability. In the routing architectures of recently developed fpgas (e.g., Virtex-II), there are more versatile wire types and richer connections between them than those of the older generations of fpgas (e.g. XC4000). To fully exploit the potential of the new routing architectures, it is beneficial to perform wire type assignment for all channels as an intermediate stage between global routing and detailed routing. In this paper, we present a wire-type assignment algorithm that is based on iteratively applying min-cost maxflow technique to simultaneously route many nets. At each stage of the network flow computation, we have guaranteed optimal result in terms of routability and delay cost. We use the routing architecture of the Virtex-II fpgas from Xilinx as a target architecture in our experiments. Experimental results show that our algorithm outperforms the traditional sequential net-by-net approach.
As integrated circuits become more and more complex, the ability to make post-fabrication changes will become more and more attractive. This ability can be realized using programmable logic cores. Currently, such core...
详细信息
As integrated circuits become more and more complex, the ability to make post-fabrication changes will become more and more attractive. This ability can be realized using programmable logic cores. Currently, such cores are available from vendors in the form of a "hard" layout. In this paper, we focus on an alternative approach: vendors supply a synthesizable version of their programmable logic core (a "soft" core) and the integrated circuit designer synthesizes the programmable logic fabric using standard cells. Although this technique suffers increased speed, density, and power overhead, the task of integrating such cores is far easier than the task of integrating "hard" cores into an ASIC. For very small amounts of logic, this ease of use may be more important than the increased overhead. This paper presents two synthesizable programmable logic core architectures, describes the associated place and route CAD tools, and compares the two architectures to each other, and to a "hard" programmable logic core. It also shows how these cores can be made more efficient by creating a non-rectangular architecture, an option not available to "hard" core vendors.
We present a pipelining-aware router for fpgas. The problem of routing pipelined signals is different from the conventional fpga routing problem. For example, the two terminal N-Delay pipelined routing problem is to f...
详细信息
We present a pipelining-aware router for fpgas. The problem of routing pipelined signals is different from the conventional fpga routing problem. For example, the two terminal N-Delay pipelined routing problem is to find the lowest cost route between a source and sink that goes through at least N (N > 1) distinct pipelining resources. In the case of a multi-terminal pipelined signal, the problem is to find a Minimum Spanning Tree that contains sufficient pipelining resources such that the delay constraint at each sink is satisfied. We begin this work by proving that the two terminal N-Delay problem is NP-Complete. We then propose an optimal algorithm for finding a lowest cost 1-Delay route. Next, the optimal 1-Delay router is used as the building block for a greedy two terminal N-Delay router. Finally, a multi-terminal routing algorithm (PipeRoute) that effectively leverages the 1-Delay and N-Delay routers is proposed. PipeRoute's performance is evaluated by routing a set of retimed benchmarks on the RaPiD [2] architecture. Our results show that the architecture overhead incurred in routing retimed netlists on RaPiD is less than a factor of two. Further, the results indicate a possible trend between the architecture overhead and the percentage of pipelined signals in a netlist.
Although fpgas are a cost-efficient alternative for both ASICs and general purpose processors, they still result in designs which are more than an order of magnitude more costly and slower than their equivalents imple...
详细信息
Although fpgas are a cost-efficient alternative for both ASICs and general purpose processors, they still result in designs which are more than an order of magnitude more costly and slower than their equivalents implemented in dedicated logic. This efficiency gap makes fpgas less suitable for high-volume cost-sensitive applications (e.g. embedded systems). We show that the intrinsic cost of traditional general-purpose fpgas can be reduced if they are designed to target an application domain or a class of applications only. We propose a method of the application-domain characterization and apply it to characterize DSP. A novel fpga logic block architecture derived based on such an analysis, and which exploits properties of target applications, is presented. Its key feature is the 'mixed-level granularity' being a trade-off between fine and coarse granularity required for the implementation of datapath and random logic functions, respectively. This leads to a factor of four improvement in the LUT memory size compared to commercial fpgas. and, assuming a standard-cell implementation, a 1.6-2.8 lower datapath mapping cost. A modified mixed-grain architecture with the ALU-like functionality reduces the LUT memory size by a factor of 16 compared to commercial fpgas, and mapped onto standard cells has a 1.9-3.3 times higher datapath mapping efficiency. For these reasons, the proposed fpga architectures may be an interesting alternative to the traditional general-purpose fpga devices, especially if characteristics of a target application domain are known a priori.
In this paper, we present techniques for energy-efficient design at the algorithm level using fpgas. We then use these techniques to create energy-efficient designs for two signal processing kernel applications: fast ...
详细信息
In this paper, we present techniques for energy-efficient design at the algorithm level using fpgas. We then use these techniques to create energy-efficient designs for two signal processing kernel applications: fast Fourier transform (FFT) and matrix multiplication. We evaluate the performance, in terms of both latency and energy efficiency, of fpgas in performing these tasks. Using a Xilinx Virtex-II as the target fpga. we compare the performance of our designs to those from the Xilinx library as well as to conventional algorithms run on the PowerPC core embedded in the Virtex-II Pro and the Texas Instruments TMS320C6415. Our evaluations are done both through estimation based on energy and latency equations and through low-level simulation. For FFT, our designs dissipated an average of 60% less energy than the design from the Xilinx library and 56% less than the DSP. Our designs showed a factor of 10 improvement over the embedded processor. These results provide concrete evidence to substantiate the idea that fpgas can outperform DSPs and embedded processors in signal processing. Further, they show that fpgas can achieve this performance while still dissipating less energy than the other two types of devices.
One of the most difficult and time-consuming steps in the creation of an fpga is its transistor-level design and physical layout. Modern commercial fpgas typically consume anywhere from 50 to 200 man-years simply in t...
详细信息
One of the most difficult and time-consuming steps in the creation of an fpga is its transistor-level design and physical layout. Modern commercial fpgas typically consume anywhere from 50 to 200 man-years simply in the layout step. To date, automated tools have only been employed in small parts of the periphery and programming circuitry. The core tiles, which are repeated many times, are subject to painstaking manual design and layout. In this paper we present a new system (called GILES, for Good Instant Layout of Erasable Semiconductors) that automatically generates a transistor-level schematic from a high-level architectural specification of an fpga. It also generates a cell-level netlist that is placed and routed automatically. The architectural specification is the one used as input to the VPR [3] architectural exploration tool. The output is the mask-level layout of a single tile that can be replicated to form an fpga array. We describe a new placement tool that simultaneously places and compacts the layout to minimize white space and wiring demand, and a special-purpose router built for this task. GILES can place and route a tile consisting of four 4-input LUT logic cells and all of its programmable wires in a 0.18 μm CMOS process using 8 layers of metal and 25983 μm2 of area. When we generate the layout of an architecture similar to-the Xilinx Virtex-E fpga (built in a 0.1 μm process) GILES requires only 47% more area than the original. The layout area of an architecture similar to the Altera Apex 20K400E (also built in a 0.18 μm process) constructed by GILES requires 97% more area than the original.
暂无评论