This paper presents the emulation of an embedded system with hard real time constraints and response times of about 220 μs. We show that for such fast reactive systems, the software overhead of a Real Time Operating ...
详细信息
This paper presents the emulation of an embedded system with hard real time constraints and response times of about 220 μs. We show that for such fast reactive systems, the software overhead of a Real Time Operating System (RTOS) becomes a limiting factor, consuming up to 77% of the total execution performance. We analyze features of different FPGA architectures in order to solve the system performance bottleneck. We show that moving functionality from software to hardware through exploiting the fine grained on-chip SRAM capability of the Xilinx XC4000 architecture, that feature eliminates the RTOS overhead by only a slight increase of about 28% of the used FPGA CLB resources. These investigations have been conducted using our own emulation environment called SPYDER-CORE-P1.
This paper describes the Vantis VF1 FPGA architecture, an innovative architecture based on 0.25 u (drawn) (0.18 u Leff)/4-metal technology. It was designed from scratch for high performance, routability and ease-of-us...
详细信息
This paper describes the Vantis VF1 FPGA architecture, an innovative architecture based on 0.25 u (drawn) (0.18 u Leff)/4-metal technology. It was designed from scratch for high performance, routability and ease-of-use. It supports system level functions (including wide gating functions, dual-port SRAMs, high speed carry chains, and high speed IO blocks) with a symmetrical structure. Additionally, the architecture of each of the critical elements including: variable-grain logic blocks, variable-length-interconnects, dual-port embedded SRAM blocks, I/O blocks and on-chip PLL functions will be described.
Floorplanning is an important problem in FPGA circuit mapping. As FPGA capacity grows, new innovative approaches will be required for efficiently mapping circuits to FPGAs. In this paper we present a macro based floor...
详细信息
Floorplanning is an important problem in FPGA circuit mapping. As FPGA capacity grows, new innovative approaches will be required for efficiently mapping circuits to FPGAs. In this paper we present a macro based floorplanning methodology suitable for mapping large circuits to large, high density FPGAs. Our method uses clustering techniques to combine macros into clusters, and then uses a tabu search based approach to place clusters while enhancing both circuit routability and performance. Our method is capable of handling both hard (fixed size and shape) macros and sob (fixed size and variable shape) macros. We demonstrate our methodology on several macro based circuit designs and compare the execution speed and quality of results with commercially available CAE tools. Our approach shows a dramatic speedup in execution time without any negative impact on quality.
There is no inherent characteristic forcing fieldprogrammablegate Array (FPGA) or Reconfigurable Computing (RC) Array cycle times to be greater than processors in the same process. Modern FPGAs seldom achieve applic...
详细信息
There is no inherent characteristic forcing fieldprogrammablegate Array (FPGA) or Reconfigurable Computing (RC) Array cycle times to be greater than processors in the same process. Modern FPGAs seldom achieve application clock rates close to their processor cousins because (1) resources in the FPGAs are not balanced appropriately for high-speed operation, (2) FPGA CAD does not automatically provide the requisite transforms to support this operation, and (3) interconnect delays can be large and vary almost continuously, complicating high frequency mapping. We introduce a novel reconfigurable computing array, the High-Speed, Hierarchical Synchronous Reconfigurable Array (HSRA), and its supporting tools. This package demonstrates that computing arrays can achieve efficient, high-speed operation. We have designed and implemented a prototype component in a 0.4 μm logic design on a DRAM process which will support 250 MHz operation for CAD mapped designs.
Procedural textures can be effectively used to enhance the visual realism of computer rendered images. Procedural textures can provide higher realism for 3-D objects than traditional hardware texture mapping methods w...
详细信息
Procedural textures can be effectively used to enhance the visual realism of computer rendered images. Procedural textures can provide higher realism for 3-D objects than traditional hardware texture mapping methods which use memory to store 2-D texture images. This paper proposes a new method of hardware texture mapping in which texture images are synthesized using FPGAs. This method is very efficient for texture mapping procedural textures of more than two input variables. By synthesizing these textures on the fly, the large amount of memory required to store their multidimensional texture images is eliminated, making texture mapping of 3-D textures and parameterized textures feasible in hardware. This paper shows that using FPGAs, procedural textures can be synthesized at high speed, with a small hardware cost. Data on the performance and the hardware cost of synthesizing procedural textures in FPGAs are presented. This paper also presents, the FPGA implementations of two Perlin noise based 3-D procedural textures.
In this paper, we investigate the speed and area-efficiency of FPGAs employing `logic clusters' containing multiple LUTs and registers as their logic block. We introduce a new, timing-driven tool (T-VPack) to `pac...
详细信息
In this paper, we investigate the speed and area-efficiency of FPGAs employing `logic clusters' containing multiple LUTs and registers as their logic block. We introduce a new, timing-driven tool (T-VPack) to `pack' LUTs and registers into these logic clusters, and we show that this algorithm is superior to an existing packing algorithm. Then, using a realistic routing architecture and sophisticated delay and area models, we empirically evaluate FPGAs composed of clusters ranging in size from one to twenty LUTs, and show that clusters of size seven through ten provide the best area-delay trade-off. Compared to circuits implemented in an FPGA composed of size one clusters, circuits implemented in an FPGA with size seven clusters have 30% less delay (a 43% increase in speed) and require 8% less area, and circuits implemented in an FPGA with size ten clusters have 34% less delay (a 52% increase in speed), and require no additional area.
As custom computing machines evolve, it is clear that a major bottleneck is the slow interconnection architecture between the logic and memory. This paper describes the architecture of a custom computing machine that ...
详细信息
As custom computing machines evolve, it is clear that a major bottleneck is the slow interconnection architecture between the logic and memory. This paper describes the architecture of a custom computing machine that overcomes the interconnection bottle-neck by closely integrating a fixed-logic processor, a reconfigurable logic array, and memory into a single chip, called OneChip-98. The OneChip-98 system has a seamless programming model that enables the programmer to easily specify instructions without additional complex instruction decoding hardware. As well, there is a simple scheme for mapping instructions to the corresponding programming bits. To allow the processor and the reconfigurable array to execute concurrently, the programming model utilizes a novel memory-consistency scheme implemented in the hardware. To evaluate the feasibility of the OneChip-98 architecture, a 32-bit MIPS-like processor and several performance enhancement applications were mapped to the Transmogrifier-2 fieldprogrammable system. For two typical applications, the 2-dimensional discrete cosine transform and the 64-tap FIR filter, we were capable of achieving a performance speedup of over 30 times that of a stand-alone state-of-the-art processor.
In this work we investigate the routing architecture of FPGAs, focusing primarily on determining the best distribution of routing segment lengths and the best mix of pass transistor and tri-state buffer routing switch...
详细信息
In this work we investigate the routing architecture of FPGAs, focusing primarily on determining the best distribution of routing segment lengths and the best mix of pass transistor and tri-state buffer routing switches. While most commercial FPGAs contain many length 1 wires (wires that span only one logic block) we find that wires this short lead to FPGAs that are inferior in terms of both delay and routing area. Our results show instead that it is best for FPGA routing segments to have lengths of 4 to 8 logic blocks. We also show that 50% to 80% of the routing switches in an FPGA should be pass transistors, with the remainder being tri-state buffers. Architectures that employ the best segmentation distributions and the best mixes of pass transistor and tri-state buffer switches found in this paper are not only 11% to 18% faster than a routing architecture very similar to that of the Xilinx XC4000X but also considerably simpler. These results are obtained using an architecture investigation infrastructure that contains a fully timing-driven router and detailed area and delay models.
Cut enumeration is a common approach used in a number of FPGA synthesis and mapping algorithms for consideration of various possible LUT implementations at each node in a circuit. Such an approach is very general and ...
详细信息
Cut enumeration is a common approach used in a number of FPGA synthesis and mapping algorithms for consideration of various possible LUT implementations at each node in a circuit. Such an approach is very general and flexible, but often suffers high computational complexity and poor scalability. In this paper, we develop several efficient and effective techniques on cut enumeration, ranking and pruning. These techniques lead to much better runtime and scalability of the cut-enumeration based algorithms;they can also be used to compute a tight lower-bound on the size of an area-minimum mapping solution. For area-oriented FPGA mapping, experimental results show that the new techniques lead to over 160 X speed-up over the original optimal duplication-free mapping algorithm, achieve mapping solutions with 5-21% smaller area for heterogeneous FPGAs compared to those by Chortle-crf [6], MIS-pga-new [9], and TOS-TUM [4], yet with over 100X speed-up over MIS-pga-new [9] and TOS-TUM [4].
Multi-FPGA systems are used as custom computing machines to solve compute intensive problems and also in the verification and prototyping of large circuits. In this paper, we address the problem of routing multi-termi...
详细信息
Multi-FPGA systems are used as custom computing machines to solve compute intensive problems and also in the verification and prototyping of large circuits. In this paper, we address the problem of routing multi-terminal nets in a multi-FPGA system that uses partial crossbars as interconnect structures. First, we model the multi-terminal routing problem as a partitioned bin packing problem and formulate it as an integer linear programming problem where the number of variables is exponential. A fast heuristic is applied to compute an upper bound on the routing solution. Then, a column generation technique is used to solve the linear relaxation of the initial master problem in order to obtain a lower bound on the routing solution. This is followed by an iterative branch-and-price procedure that attempts to find a routing solution somewhere between the two established bounds. In this regard, the proposed algorithm guarantees an exact routing solution by searching a branch-and-price tree. Due to the tightness of the bounds, the branch-and-price tree is small resulting in shorter execution times. Experimental results are provided for different netlists and board configurations in order to demonstrate the algorithm's performance. The obtained results show that the algorithm finds an exact routing solution in a very short time.
暂无评论