In this paper we present a performance-driven mapping algorithm, PLAmap, for CPLD architectures which consist of a large number of PLA-style logic cells. The primary goal of our mapping algorithm is to minimize the de...
详细信息
ISBN:
(纸本)9781581133417
In this paper we present a performance-driven mapping algorithm, PLAmap, for CPLD architectures which consist of a large number of PLA-style logic cells. The primary goal of our mapping algorithm is to minimize the depth of the mapped circuit. Meanwhile, we have successfully reduced the area of the mapped circuits by applying several heuristic techniques, including threshold control of PLA fanouts and product terms, slack-time relaxation, and PLA-packing. We compare our PLAmap with a recently-published algorithm TEMPLA [1] and a commercial tool, Altera's MAX+PLUS II [16]. Experimental results on various MCNC benchmarks show that overall TEMPLA uses 8 to 11% less area at the cost of 96 to 106% more mapping depth, and MAX+PLUS II uses 12% less area but 58% more delay compared with our mapper.
In this paper we discuss new techniques for timing-driven placement and adaptive delay computation for hierarchical PLD architectures. Our algorithm follows the natural recursive k-way partitioning-based approach to p...
详细信息
ISBN:
(纸本)9781581133417
In this paper we discuss new techniques for timing-driven placement and adaptive delay computation for hierarchical PLD architectures. Our algorithm follows the natural recursive k-way partitioning-based approach to placement on such devices. Our contributions include a specification of the overall TDC (timing-driven compilation) algorithm, an analysis of heuristics such as a variant of multi-start partitioning, a new method for adaptive delay computation, and a discussion of the structure of critical paths and sub-graphs on modern PLD designs. This algorithm has been implemented in a production quality commercial tool, and we report on the results with and without the implementation of the new techniques. The basic result is a substantial 38.5% average (36.3% median) improvement in register-to-register performance across a range of real designs in modern density ranges, at a cost of approximately 3.65X average (2.88X median) place-and-route CPU time. (These improvements and costs are relative to the same tool prior to the efforts described in this paper.) A partial implementation of the new algorithm shows approximately half the performance gain, with approximately half the compile time cost.
In mapping the k-means algorithm to fpga hardware, we examined algorithm level transforms that dramatically increased the achievable parallelism. We apply the k-means algorithm to multi-spectral and hyper-spectral ima...
详细信息
ISBN:
(纸本)9781581133417
In mapping the k-means algorithm to fpga hardware, we examined algorithm level transforms that dramatically increased the achievable parallelism. We apply the k-means algorithm to multi-spectral and hyper-spectral images, which have tens to hundreds of channels per pixel of data. K-means is an iterative algorithm that assigns assigns to each pixel a label indicating which of K clusters the pixel belongs to. K-means is a common solution to the segmentation of multi-dimensional data. The standard software implementation of k-means uses floating-point arithmetic and Euclidean distances. Floating point arithmetic and the multiplication-heavy Euclidean distance calculation are fine on a general purpose processor, but they have large area and speed penalties when implemented on an fpga. In order to get the best performance of k-means on an fpga, the algorithm needs to be transformed to eliminate these operations. We examined the effects of using two other distance measures, Manhattan and Max, that do not require multipliers. We also examined the effects of using fixed precision and truncated bit widths in the algorithm. It is important to explore algorithmic level transforms and tradeoffs when mapping an algorithm to reconfigurable hardware. A direct translation of the standard software implementation of k-means would result in a very inefficient use of fpga hardware resources. Analysis of the algorithm and data is necessary for a more efficient implementation. Our resulting implementation exhibits approximately a 200 times speed up over a software implementation.
The routing architecture of an fpga consists of the length of the wires, the type of switch used to connect wires (buffered, unbuffered, fast or slow) and the topology of the interconnection of the switches and wires....
详细信息
ISBN:
(纸本)9781581133417
The routing architecture of an fpga consists of the length of the wires, the type of switch used to connect wires (buffered, unbuffered, fast or slow) and the topology of the interconnection of the switches and wires. fpga routing architecture has a major influence on the logic density and speed of fpga devices. Previous work [1] based on a 0.35um CMOS process has suggested that an architecture consisting of length 4 wires (where the length of a wire is measured in terms of the number of logic blocks it passes before being switched) and half of the programmable switches are active buffers, and half are pass transistors. In that work, however, the topology of the routing architecture prevented buffered tracks from connecting to pass-transistor tracks. This restriction prevents the creation of interconnection trees for high fanout nets that have a mixture of buffers and pass transistors. Electrical simulations suggest that connections closer to the leaves on interconnection trees are faster using pass transistors, but it is essential to buffer closer to the source. This latter effect is well known in regular ASIC routing [2]. In this work we propose a new routing architecture that allows liberal switching between buffered and pass transistor tracks. We explore various versions of the architecture to determine the density-speed trade-off. We show that one version of the new architecture results in fpgas with 10% faster critical path delay yet uses the same area as the previous architecture that does not allow such switching. We also show that the new architecture allows a useful area-speed trade off and several versions of the new architecture result in fpgas with 8% gain in area-delay product than the previous architecture that does not allow the switching.
As the complexity of integrated circuits increases, the ability to make post-fabrication changes to fixed ASIC chips will become more and more attractive. This ability can be realized using programmable logic cores. T...
详细信息
ISBN:
(纸本)9781581133417
As the complexity of integrated circuits increases, the ability to make post-fabrication changes to fixed ASIC chips will become more and more attractive. This ability can be realized using programmable logic cores. These cores are blocks of programmable logic that can be embedded into a fixed-function ASIC or a custom chip. Such cores differ from stand-alone fpgas in that they can take on a variety of shapes and sizes. With this in mind, we investigate the detailed routing characteristics of rectangular programmable logic cores. We quantify the effects of having different x and y channel capacities, and show that the optimum ratio between the x and y channel widths for a rectangular core is between 1.2 and 1.5. We also present a new switch block family optimized for rectangular cores. Compared to a simple extension of an existing switch block, our new architecture leads to an 8.7% improvement in density with little effect on speed. Finally, we show that if the channel widths and switch block are chosen carefully the penalty for using a rectangular core (compared to a square core with the same logic capacity) is small;for a core with an aspect ratio of 2:1, the area penalty is 1.6% and the speed penalty is 1.1%.
The Streams-C compiler ([5]) synthesizes hardware circuits for reconfigurable fpga-based computers from parallel C programs. The Streams-C language consists of a small number of libraries and intrinsic functions added...
详细信息
ISBN:
(纸本)9781581133417
The Streams-C compiler ([5]) synthesizes hardware circuits for reconfigurable fpga-based computers from parallel C programs. The Streams-C language consists of a small number of libraries and intrinsic functions added to a synthesizable subset of C, and supports a communicating process programming model. The processes may be either software or hardware processes, and the compiler manages communication among the processes transparently to the programmer. For the hardware processes, the compiler generates Register-Transfer-Level (RTL) VHDL, targeting multiple fpgas with dedicated memories. For the software processes, a multi-threaded software program is generated. The Streams-C language and compiler offer a very high level of expressivity for reconfigurable computing application development, particularly for stream-processing applications. We find this is reflected in productivity, for a factor of up to 10 times improvement in time to produce a program. However, use of the tool in the "real world" is predicated on performance: only if such a compiler can deliver performance comparalble to hand-coded performance will it be used in practice. This paper presents an application study of the Streams-C compiler. Four applications have been written in Streams-C and compiled to the AMC Wildforce board containing Xilinx 4036's. Those same applications have been hand-coded in a combination of RTL and structural VHDL. We compare performance of the generated code with the hand-optimized code. Our study shows that the compiler-generated designs are 1.37-4 times the area and 1/2-1 times the clock frequency of the hand designs. We find that the compiler, based on the SUIF infrastructure, can be greatly improved through various standard compiler optimizations that are not currently being exploited. Thus we are currently rewriting a public domain version of Streams-C to better optimize and target the Virtex chip.
fpgas have been growing at a rapid rate in the past few years. Their ever-increasing gate densities and performance capabilities are making them very popular in the design of digital systems. In this paper we discuss ...
详细信息
ISBN:
(纸本)9781581133479
fpgas have been growing at a rapid rate in the past few years. Their ever-increasing gate densities and performance capabilities are making them very popular in the design of digital systems. In this paper we discuss the state-of-the-art in fpga physical design. Compared to physical design in traditional ASICs, fpgas pose a different set of requirements and challenges. Consequently the algorithms in fpga physical design have evolved differently from their ASIC counterparts. Apart from allowing fpga users to implement their designs on fpgas, fpga physical design is also used extensively in developing and evaluating new fpga architectures. Finally, the future of fpga physical design is discussed along with how it is interacting with the latest fpga technologies.
A Boolean-based router expresses the routing constraints as a Boolean function which is satisfiable if and only if the layout is routable. Compared to traditional routers, Boolean-based routers offer two unique featur...
详细信息
A Boolean-based router expresses the routing constraints as a Boolean function which is satisfiable if and only if the layout is routable. Compared to traditional routers, Boolean-based routers offer two unique features: (1) simultaneous embedding of all nets regardless of net ordering, and (2) ability to demonstrate routing infeasibility by proving the unsatisfiability of the generated routing constraint Boolean function. In this paper, we introduce a new Boolean-based fpga detailed routing formulation that yields an easy-to-evaluate and more scalable routability Boolean function than the previous methods. The routability constraints are expressed in terms of a set of "route" variables each of which designating a specific detailed route for a given net. Experimental results clearly show the superiority of this formulation over an earlier formulation that expressed the constraints in terms of "track" variables.
The fpga architectural issue of the effect of logic block functionality on fpga performance and density is investigated. In particular, in the context of lookup tables (LUT), cluster-based island-style fpgas, the effe...
详细信息
The fpga architectural issue of the effect of logic block functionality on fpga performance and density is investigated. In particular, in the context of lookup tables (LUT), cluster-based island-style fpgas, the effect of LUT size and cluster size on the speed and logic density of an fpga is analyzed. A fully timing-driven experimental flow, in which a set of benchmark circuits are synthesized, is used into different cluster based logic book architectures, which contain groups of LUTs and flip-flops.
fieldprogrammablegatearrays (fpgas) are being used to provide fast Internet Protocol (IP) packet routing and advanced queuing in a highly scalable network switch. A new module, called the field-programmable Port Ex...
详细信息
ISBN:
(纸本)9781581131932
fieldprogrammablegatearrays (fpgas) are being used to provide fast Internet Protocol (IP) packet routing and advanced queuing in a highly scalable network switch. A new module, called the field-programmable Port Extender (FPX), is being built to augment the Washington University Gigabit Switch (WUGS) with reprogrammable logic. FPX modules reside at the edge of the WUGS switching fabric. Physically, the module is inserted between an optical line card and the WUGS gigabit switch back-plane. The hardware used for this project allows ports of the switch populated with an FPX to operate at rates up to 2.4 Gigabits/second. The aggregate throughput of the system scales with the number of switch ports. Logic on the FPX module is implemented with two fpga devices. The first device is used to interface between the switch and the line card, while the second is used to prototype new networking functions and protocols. The logic on the second fpga can be re-programmed dynamically via control cells sent over the network.
暂无评论