In this paper we present an implementation of a Cholesky decomposition core, with IEEE754 single precision arithmetic. The datapaths are generated using fused datapath synthesis, created with an experimental floating ...
详细信息
ISBN:
(纸本)9781605584102
In this paper we present an implementation of a Cholesky decomposition core, with IEEE754 single precision arithmetic. The datapaths are generated using fused datapath synthesis, created with an experimental floating point compiler tool, capable of fitting hundreds of floating point operators into a single device. We present a scalable architecture for both real and complex matrixes, on which we will report results for up to 128128 real matrices. The concepts of fused datapath synthesis for fpga floating point designs will be reviewed, and the application to the Cholesky algorithm detailed. Experimental results will be given to show that the accuracy of this method is superior to those expected from a traditional IEEE754 core based design flow. Copyright 2009 acm.
Clock network power in field-programmablegatearrays (FP- ) is considered and two complementary approaches for power reduction in the Xilinx RVirtexTM-5 fpga are. The approaches are unique in that they lever- specifi...
详细信息
ISBN:
(纸本)9781605584102
Clock network power in field-programmablegatearrays (FP- ) is considered and two complementary approaches for power reduction in the Xilinx RVirtexTM-5 fpga are. The approaches are unique in that they lever- specific architectural aspects of Virtex-5 to achieve re- in dynamic power consumed by the clock network. first approach comprises a placement-based technique reduce interconnect resource usage on the clock network, reducing capacitance and power (up to 12%). The approach borrows the "clock gating" notion from the domain and applies it to fpgas. Clock enable sig- on flip-flops are selectively migrated to use the dedi- clock enable available on the fpga's built-in clock, leading to reduced toggling on the clock intercon- and lower power (up to 28%). Power reductions are achieved without any performance penalty, on average. Copyright 2009 acm.
Via-programmablegatearrays (VPGAs) offer a middle ground application specific integrated circuits and fieldprogrammablearrays in terms of flexibility, manufactuing , speed, power and area. In this paper, we presen...
详细信息
ISBN:
(纸本)9781605584102
Via-programmablegatearrays (VPGAs) offer a middle ground application specific integrated circuits and fieldprogrammablearrays in terms of flexibility, manufactuing , speed, power and area. In this paper, we present a VPGA logic cell, the complementary universal logic (CULG) which can be used to implement both sequential combinatorial elements. Its performance is compared a number of other designs including transmission , differential cascode voltage switch with pass gate, standard cell. The CULG is found to have comparable delay product and process variation sensitivity to the other designs while offering the lowest power consumption. Copyright 2009 acm.
Performance of fieldprogrammablegatearrays (fpgas) used for floating-point applications is poor due to the complexity of floating-point arithmetic. Implementing floatingpoint units on fpgas consume a large amount o...
详细信息
ISBN:
(纸本)9781605584102
Performance of fieldprogrammablegatearrays (fpgas) used for floating-point applications is poor due to the complexity of floating-point arithmetic. Implementing floatingpoint units on fpgas consume a large amount of resources. This makes fpgas less attractive for use in floating-point intensive applications. Therefore, there is a need for embedded floating-point units (FPUs) in fpgas. However, if unutilized, embedded FPUs waste space on the fpga die. To overcome this issue, we propose a flexible multi-mode embedded FPU for fpgas that can be configured to perform a wide range of operations. The floating-point adder and multiplier in our embedded FPU can each be configured to perform one double-precision operation or two single-precision operations in parallel. To increase flexibility further, access to the large integer multiplier, adder and shifters in the FPU is provided. Benchmark circuits were implemented on both a standard Xilinx Virtex-II fpga and on our fpga with embedded FPU blocks. The results using our embedded FPUs showed a mean area improvement of 5.2 times and a mean delay improvement of 5.8 times for the doubleprecision benchmarks, and a mean area improvement of 4.4 times and a mean delay improvement of 4.2 times for the single-precision benchmarks. Copyright 2009 acm.
fpga user clocks are slow enough that only a fraction of the interconnect's is actually used. There may be an opportunity use throughput-oriented interconnect to decrease routing and wire area using on-chip serial...
详细信息
ISBN:
(纸本)9781605584102
fpga user clocks are slow enough that only a fraction of the interconnect's is actually used. There may be an opportunity use throughput-oriented interconnect to decrease routing and wire area using on-chip serial signaling, especially datapath designs which operate on words instead of bits. To so, these links must operate reliably at very high bit rates. We wave pipelining and surfing source-synchronous schemes the presence of power supply and crosstalk noise. In particular, noise is a critical modeling challenge;better models are for fpga power grids. Our results show that wave pipelining operate at rates as high as 5Gbps for short links, but it is sensitive to noise in longer links and must run much slower to reliable. In contrast, surfing achieves a stable operating bit rate of 3Gbps and is relatively insensitive to noise. Copyright 2009 acm.
This paper presents a new architecture for time-to-digital enabling a time resolution of 17ps over a range 50ns with a conversion rate of 20MS/s. The proposed , implemented in a 65nm fpga system, consists a pipelined ...
详细信息
ISBN:
(纸本)9781605584102
This paper presents a new architecture for time-to-digital enabling a time resolution of 17ps over a range 50ns with a conversion rate of 20MS/s. The proposed , implemented in a 65nm fpga system, consists a pipelined interpolating time-to-digital converter (TDC). The TDC comprises a coarse time discriminator and ne delay line, capable of sustained operation at a clock of 300MHz. A Turbo version of the circuit implements pipelined interpolating TDC with suppressed dead to reach a conversion rate of 300MS/s at the expense a systematic asymmetry that requires fast error correction. TDCs proposed in this paper can be compensated process, voltage, and temperature (PVT) variations using conventional charge pump based feedback or a digital technique. Results demonstrate the suitability the approach for a variety of applications involving precision ultra-fast time discrimination, such as optical sensing, time-of-ight cameras, high throughput comlinks, RADARs, etc. Copyright 2009 acm.
Carbon nanotubes (CNTs), with their unique electronic properties, are promising materials for building nanoscale circuits. In this paper, we present a new CNT-based fpga architecture known as FPCNA. We define novel CN...
详细信息
ISBN:
(纸本)9781605584102
Carbon nanotubes (CNTs), with their unique electronic properties, are promising materials for building nanoscale circuits. In this paper, we present a new CNT-based fpga architecture known as FPCNA. We define novel CNT and nanoswitch based components and characterize these components considering nanospecific process variations, including the variation caused by the random mixture of metallic and semiconducting CNTs. To evaluate the architecture, we develop a variation-aware physicaldesign flow which can handle both Gaussian and non-Gaussian random variables using variation-aware placement and routing. When FPCNA is evaluated with this CAD flow, we see a 2.67 performance gain over a baseline CMOS fpga at the same technology node (at a 95% performance yield). In addition, FPCNA offers a 4.5 footprint reduction compared to the baseline fpga. These results demonstrate the potential of using CNTs and nanoswitches to build high performance fpga circuits. Copyright 2009 acm.
Packet classification is an important operation for applications such as routers, firewalls or intrusion detection systems. Many algorithms and hardware architectures for packet classification have been created, but n...
详细信息
ISBN:
(纸本)9781605584102
Packet classification is an important operation for applications such as routers, firewalls or intrusion detection systems. Many algorithms and hardware architectures for packet classification have been created, but none of them cancompete with the speed of TCAMs in the worst case. We propose new hardware-based algorithm for packet classification. The solution is based on problem decomposition and is aimed at the highest network speeds. A unique property of the algorithm is the constant time complexity in terms of external memory accesses. The algorithm performs exactly two external memory accesses to classify a packet. Using fpga and one commodity SRAM chip, a throughput of 150 million packets per second can be achieved. This makes throughput of 100 Gbps for the shortest packets. Further performance scaling is possible with more or faster SRAM chips. Copyright 2009 acm.
In recent years, the maximum logic capacity of each successive fpga family has been increasing by more than 50%, which motivates scalable solutions. Meanwhile, academic research in logic synthesis has been fruitful, b...
详细信息
ISBN:
(纸本)9781605584102
In recent years, the maximum logic capacity of each successive fpga family has been increasing by more than 50%, which motivates scalable solutions. Meanwhile, academic research in logic synthesis has been fruitful, but these advances have been demonstrated on academic architectures and benchmark designs which are not representative of modern industrial fpgas. This paper presents a framework (SmartOpt) for mapping complex fpga architectures to a simple netlist model, which can be supported by academic tools. SmartOpt was applied to leverage the algorithms implemented in the ABC package and to study their relative contributions. This work is integrated into the Xilinx ISE 11.1 software flow for fpgas and shows significant improvements in both the LUT count and performance of large industrial circuits described in HDL. Xilinx Synthesis Technology (XST) reference flow was compared experimentally against the same flow augmented by SmartOpt. When applied to a set of 20 large industrial Virtex-5 benchmarks ranging from 17K to 69K 6- LUTs, the augmented flow produced 8.3% fewer LUTs and led to 2.1% higher operating frequency while keeping runtimes reasonable. With dual-LUT-merging, the LUT count is reduced by 22.7%, while increasing the operating frequency only by 0.7%. Copyright 2009 acm.
Aggressive scaling increases the number of devices we can integrate per square millimeter but makes it increasingly difficult to guarantee that each device fabricated has the intended operational characteristics. With...
详细信息
ISBN:
(纸本)9781605584102
Aggressive scaling increases the number of devices we can integrate per square millimeter but makes it increasingly difficult to guarantee that each device fabricated has the intended operational characteristics. Without careful mitigation, component yield rates will fall, potentially negating the economic benefits of scaling. The fine-grained reconfigurability inherent in fpgas is a powerful tool that can allow us to drop the stringent requirement that every device be fabricated perfectly in order for a component to be useful. To exploit inherent fpga reconfigurability while avoiding full CAD mapping, we propose lightweight techniques compatible with the current single bitstream model that can avoid defective devices, reducing yield loss at high defect rates. In particular, by embedding testing operations and alternative path configurations into the bitstream, each fpga can avoid defects by making only simple, greedy decisions at bitstream load time. With 20% additional tracks above the minimum routable hannel width, routes can tolerate 0.01% switch defect rates, raising yield from essentially 0% to near 100%. Copyright 2009 acm.
暂无评论