The performance benefits of a monolithically stacked 3D-fpga, whereby the programming overhead of an fpga is stacked on top of a standard CMOS layer containing the logic blocks and interconnects, are investigated. A V...
详细信息
ISBN:
(纸本)1595932925
The performance benefits of a monolithically stacked 3D-fpga, whereby the programming overhead of an fpga is stacked on top of a standard CMOS layer containing the logic blocks and interconnects, are investigated. A Virtex-II style 2D-fpga fabric is used as a baseline for quantifying the relative improvements in logic density, delay, and power consumption achieved by such a 3D-fpga. It is assumed that only the pass-transistor switches and configuration memory cells can be moved to the top layers and that the 3D-fpga employs the same logic block and programmable interconnect architecture as the baseline 2D-fpga. Assuming a configuration memory cell that is ≤ 0.7 the area of an SRAM cell and pass-transistor switches having the same characteristics as nMOS devices in the CMOS layer are used, it is shown that a monolithically stacked 3D-fpga can achieve 3.2 times higher logic density, 1.7 times lower critical path delay, and 1.7 times lower total dynamic power consumption than the baseline 2D-fpga fabricated in the same 65nm technology node. Copyright 2006 acm.
To implement high-density and high-speed fpga circuits, designers need tight control over the circuit implementation process. However, current design tools are unsuited for this purpose as they lack fast turnaround ti...
详细信息
ISBN:
(纸本)9780897919784
To implement high-density and high-speed fpga circuits, designers need tight control over the circuit implementation process. However, current design tools are unsuited for this purpose as they lack fast turnaround times, interactiveness, and integration. We present a system for the Xilinx XC6200 fpga, which addresses these issues. It consists of a suite of tightly integrated tools for the XC6200 architecture centered around an architecture-independent tool framework. The system lets the designer easily intervene at various stages of the design process and features design cycle times (from an HDL specification to a complete layout) in the order of seconds.
Sparse matrix vector multiplication (SpMV) is an important kernel in many scientific applications. To improve the performance and applicability of fpga based SpMV, we propose an approach for exploiting properties of t...
详细信息
ISBN:
(纸本)9781450338561
Sparse matrix vector multiplication (SpMV) is an important kernel in many scientific applications. To improve the performance and applicability of fpga based SpMV, we propose an approach for exploiting properties of the input matrix to generate optimised custom architectures. The architectures generated by our approach are between 3.8 to 48 times faster than the worst case architectures for each matrix, showing the benefits of instance specific design for SpMV.
Technology scaling makes metal delay ever more problematic, but routing between Look-Up Tables (LUTs) still passes through a series of transistors. It seems wise to avoid the corresponding delay whenever possible. Dir...
详细信息
ISBN:
(纸本)9781450370998
Technology scaling makes metal delay ever more problematic, but routing between Look-Up Tables (LUTs) still passes through a series of transistors. It seems wise to avoid the corresponding delay whenever possible. Direct connections between LUTs, both within and across multiple clusters, can eschew the transistor delays of crossbars, connection blocks, and switch blocks. In this paper we investigate the usefulness of enhancing classical field-programmablegate Array (fpga) architectures with direct connections between LUTs. We present an efficient algorithm for searching automatically the most interesting patterns of such direct connections. Despite our methods being fairly conservative and relying on the use of unmodified standard CAD tools, we obtain a 2.77% improvement of the geometric mean critical path delay of a standard benchmark set, with improvement ranging from -0.17% to 7.3% for individual circuits. As modest as these results may seem at first glance, we believe that they position direct connections between LUTs as a promising topic for future research. Extending this work with dedicated CAD algorithms and exploiting the increased possibilities for optimal buffering, diagonal routing, and pipelining could prove direct connections important to the continuation of performance improvement into next generation fpgas.
Dynamically reconfigurable systems increase design density and flexibility by allowing hardware modules to be swapped at run time. Systems that employ checkpointing, periodic or phased execution, preemptive multitaski...
详细信息
ISBN:
(纸本)9781450311557
Dynamically reconfigurable systems increase design density and flexibility by allowing hardware modules to be swapped at run time. Systems that employ checkpointing, periodic or phased execution, preemptive multitasking and resource defragmentation, may also need to be able to save and restore the state of a module that is being reconfigured. Existing tools verify the functionality of a system that is undergoing reconfiguration. These tools can also be employed if state is accessed using application logic. However, when state is accessed via the configuration port, functional verification is hindered because the fpga fabric, which mediates the transfer of state between the application logic and the configuration port, is not being simulated. We describe how to efficiently simulate those aspects of the fabric that are used in accessing module state. To the best of our knowledge, this work is the first to allow cycle-accurate simulation of a system partially reconfiguring both its logic and state and a case study shows that our method is effective in detecting device independent design errors.
This paper presents a vector generation approach for testing interconnects in configurable (SRAM-based) fieldprogrammablegatearrays (fpgas). The proposed approach detects bridging faults and is based on quiescent c...
详细信息
ISBN:
(纸本)9780897919784
This paper presents a vector generation approach for testing interconnects in configurable (SRAM-based) fieldprogrammablegatearrays (fpgas). The proposed approach detects bridging faults and is based on quiescent current (IDDQ) monitoring. Compared with previous voltage-based methods, IDDQ testing has the advantage of utilizing a small number of programming phases for configuring the fpga during the test process with negligible observability requirements, even under multiple faults. Algorithms for test generation which exploit the homogeneous nature of the fpga array, are described. An example using the XC4000 is described in detail. For testing the XC4000 series interconnect, a total of 20 phases and 11 vectors are required: 11 phases for S (switch) block testing, and 9 phases for C (connection) block testing.
This work describes the architecture of a new fpga DSP block supporting both fixed and floating point arithmetic. Each DSP block can be configured to provide one single precision IEEE-754 floating multiplier and one I...
详细信息
QR decomposition (QRD) is of increasing importance for many current applications, such as wireless and radar. Data dependencies in known algorithms and approaches, combined with the data access patterns used in many o...
详细信息
ISBN:
(纸本)9781450356145
QR decomposition (QRD) is of increasing importance for many current applications, such as wireless and radar. Data dependencies in known algorithms and approaches, combined with the data access patterns used in many of these methods, restrict the achievable performance in software programmable targets. Some fpga architectures now incorporate hard floating-point (HFP) resources, and in combination with distributed memories, as well as the flexibility of internal connectivity, can support high-performance matrix arithmetic. In this work, we present the mapping to parallel structures with inter-vector connectivity of a new QRD algorithm. Based on a Modified Gram-Schmidt (MGS) algorithm, this new algorithm has a different loop organization, but the dependent functional sequences are unchanged, so error analysis and numerical stability are unaffected. This work has a theoretical sustained-to-peak performance close to 100% for large matrices, which is roughly three times the functional density of the previously best known implementations. Mapped to an Intel Arria 10 device, we achieve 80us for a 256x256 single precision real matrix, for a 417 GFLOP equivalent. This corresponds to a 95% sustained to peak ratio, for the portion of the device used for this work.
In this paper, we present the first exact algorithm to solve the constrained I/O placement problem for fpgas that support multiple I/O standards. We derive a compact integer linear programming formulation for the cons...
详细信息
In this paper, we present the first exact algorithm to solve the constrained I/O placement problem for fpgas that support multiple I/O standards. We derive a compact integer linear programming formulation for the constrained I/O placement problem. The size of the integer linear program derived is independent of the number of I/O objects to be placed and hence is scalable to very large design instances. For example, for a Xilinx Virtex-E fpga, the number of integer variables required is never more than 32 and is much smaller for practical design instances. Extensive experimental results using a non-commercial integer linear program solver shows that it only takes seconds to solve the resultant integer linear program in practice.
fpgas provide a speed advantage in processing for embedded systems, especially when processing is moved close to the sensors. Perhaps the ultimate embedded system is a neural prosthetic, where probes are inserted into...
详细信息
fpgas provide a speed advantage in processing for embedded systems, especially when processing is moved close to the sensors. Perhaps the ultimate embedded system is a neural prosthetic, where probes are inserted into the brain and recorded electrical activity is analyzed to determine which neurons have fired. In turn, this information can be used to manipulate an external device such as a robot arm or a computer mouse. To make the detection of these signals possible, some baseline data must be processed to correlate impulses to particular neurons. One method for processing this data uses a statistical clustering algorithm called Expectation Maximization, or EM. In this paper, we examine the EM clustering algorithm, determine the most computationally intensive portion, map it onto a reconfigurable device, and show several areas of performance gain.
暂无评论