The placement phase of the compile process and an ultrafast placement algorithm targeted to fieldprogrammablegatearrays (fpga) are presented. The algorithm is based on a combination of multiple-level, bottom-up clu...
详细信息
The placement phase of the compile process and an ultrafast placement algorithm targeted to fieldprogrammablegatearrays (fpga) are presented. The algorithm is based on a combination of multiple-level, bottom-up clustering and hierarchical simulated annealing. It provides superior area results over a known high-quality placement tool on a set of large benchmark circuits, when both are restricted to a short run time. In addition, operating on its fastest mode, this tool can provide an accurate estimate of the wirelength achievable with good quality placement. This can be used in conjunction with a routing predictor, to determine the routability of a given circuit on a given fpga device.
This paper describes the hardware implementation of the Generalized Profile Search algorithm using online arithmetic and redundant data representation. This is part of the GenStorm project, aimed at providing a dedica...
详细信息
This paper describes the hardware implementation of the Generalized Profile Search algorithm using online arithmetic and redundant data representation. This is part of the GenStorm project, aimed at providing a dedicated computer for biological sequence processing based on reconfigurable hardware using fpgas. The serial evaluation of the result made possible by a redundant data representation leads to a significant increase of data throughput in comparison with standard non redundant data coding.
A fpga configuration method named configuration cloning is developed to exploit spatial and temporal regularity and locality in algorithms and architectures by copying and operating on the configuration bit-stream alr...
详细信息
A fpga configuration method named configuration cloning is developed to exploit spatial and temporal regularity and locality in algorithms and architectures by copying and operating on the configuration bit-stream already resident in a fpga. The method resulted in speed and power improvement over off-chip partial reconfiguration techniques, while not requiring additional interconnects and control hardware. Cloning requires only a small amount of hardware overhead. Digital signal processing applications are discussed to demonstrate the order of magnitude reductions in configuration time and power.
Striped fpga, or pipeline-reconfigurable fpga provides hardware virtualization by supporting fast run-time reconfiguration. In this paper we show that the performance of striped fpga depends on the reconfiguration pat...
详细信息
Striped fpga, or pipeline-reconfigurable fpga provides hardware virtualization by supporting fast run-time reconfiguration. In this paper we show that the performance of striped fpga depends on the reconfiguration pattern, the run time scheduling of configurations through the fpga. We study two main configuration scheduling approaches- Configuration Caching and Data Caching. We present the quantitative analysis of these scheduling techniques to compute their total execution cycles taking into account the overhead caused by the IO with the external memory. Based on the analysis we can determine which scheduling technique works better for the given application and for the given hardware parameters.
fpga users often view the ability of an fpga to route designs with high LUT (gate) utilization as a feature, leading them to demand high gate utilization from vendors. We present initial evidence from a hierarchical a...
详细信息
fpga users often view the ability of an fpga to route designs with high LUT (gate) utilization as a feature, leading them to demand high gate utilization from vendors. We present initial evidence from a hierarchical array design showing that high LUT utilization is not directly correlated with efficient silicon usage. Rather, since interconnect resources consume most of the area on these devices (often 80-90%), we can achieve more area efficient designs by allowing some LUTs to go unused - allowing us to use the dominant resource, interconnect, more efficiently. This extends the `Sea-of-gates' philosophy, familiar to mask programmablegatearrays, to fpgas. Also introduced in this work is an algorithm for `depopulating' the gates in a hierarchical network to match the limited wiring resources.
One of the major overheads in reconfigurable computing is the time it takes to reconfigure the devices in the system. The configuration compression algorithm presented in our previous paper [Hauck98c] is one efficient...
详细信息
One of the major overheads in reconfigurable computing is the time it takes to reconfigure the devices in the system. The configuration compression algorithm presented in our previous paper [Hauck98c] is one efficient technique for reducing this overhead. In this paper, we develop an algorithm for finding Don't Care bits in configurations to improve the compatibility of the configuration data. With the help of the Don't Cares, higher configuration compression ratios can be achieved by using our modified configuration compression algorithm. This improves compression ratios of a factor of 7, where our original algorithm only achieved a factor of 4.
A reconfigurable architecture optimized for media processing, and based on 4-bit arithmetic logic unit (ALU) and interconnect is described. Together, these allow the area devoted to configuration bits and routing swit...
详细信息
A reconfigurable architecture optimized for media processing, and based on 4-bit arithmetic logic unit (ALU) and interconnect is described. Together, these allow the area devoted to configuration bits and routing switches to be about 50% of the area of the basic CHESS array, leaving the rest available for user-visible functional units. CHESS flexibility in application mapping is largely due to the ability to feed ALU with instruction streams generated within the array, generous provision of embedded block random access memory, and the ability to trade routing switches for small memories.
A new reprogrammablefpga architecture is described which is specifically designed to be of very low cost. It covers a range of 35 K to a million usable gates. In addition, it delivers high performance and it is synth...
详细信息
A new reprogrammablefpga architecture is described which is specifically designed to be of very low cost. It covers a range of 35 K to a million usable gates. In addition, it delivers high performance and it is synthesis efficient. This architecture is loosely based on an earlier reprogrammable Actel architecture named ES. By changing the structure of the interconnect and by making other improvements, we achieved an average cost reduction by a factor of three per usable gate. The first member of the family based on this architecture is fabricated on a 2.5 V standard 0.25μ CMOS technology with a gate count of up to 130 K which also includes 36 K bits of two port RAM. The gate count of this part is verified in a fully automatic design flow starting from a high level description followed by synthesis, technology mapping, place and route, and timing extraction.
The Embedded System Block (ESB) of the APEX20K programmable logic device family from Altera Corporation includes the capability of implementing product term macrocells in addition to flexibly configurable ROM and dual...
详细信息
The Embedded System Block (ESB) of the APEX20K programmable logic device family from Altera Corporation includes the capability of implementing product term macrocells in addition to flexibly configurable ROM and dual port RAM. In product term mode, each ESB has 16 macrocells built out of 32 product terms with 32 literal inputs. The ability to reconfigure memory blocks in this way represents a new and innovative use of resources in a programmable logic device, requiring creative solutions in both the hardware and software domains. The architecture and features of this Embedded System Block are described.
This paper presents the emulation of an embedded system with hard real time constraints and response times of about 220 μs. We show that for such fast reactive systems, the software overhead of a Real Time Operating ...
详细信息
This paper presents the emulation of an embedded system with hard real time constraints and response times of about 220 μs. We show that for such fast reactive systems, the software overhead of a Real Time Operating System (RTOS) becomes a limiting factor, consuming up to 77% of the total execution performance. We analyze features of different fpga architectures in order to solve the system performance bottleneck. We show that moving functionality from software to hardware through exploiting the fine grained on-chip SRAM capability of the Xilinx XC4000 architecture, that feature eliminates the RTOS overhead by only a slight increase of about 28% of the used fpga CLB resources. These investigations have been conducted using our own emulation environment called SPYDER-CORE-P1.
暂无评论