Via-programmablegatearrays (VPGAs) offer a middle ground application specific integrated circuits and fieldprogrammablearrays in terms of flexibility, manufactuing , speed, power and area. In this paper, we presen...
详细信息
ISBN:
(纸本)9781605584102
Via-programmablegatearrays (VPGAs) offer a middle ground application specific integrated circuits and fieldprogrammablearrays in terms of flexibility, manufactuing , speed, power and area. In this paper, we present a VPGA logic cell, the complementary universal logic (CULG) which can be used to implement both sequential combinatorial elements. Its performance is compared a number of other designs including transmission , differential cascode voltage switch with pass gate, standard cell. The CULG is found to have comparable delay product and process variation sensitivity to the other designs while offering the lowest power consumption. Copyright 2009 acm.
In this paper we present an implementation of a Cholesky decomposition core, with IEEE754 single precision arithmetic. The datapaths are generated using fused datapath synthesis, created with an experimental floating ...
详细信息
ISBN:
(纸本)9781605584102
In this paper we present an implementation of a Cholesky decomposition core, with IEEE754 single precision arithmetic. The datapaths are generated using fused datapath synthesis, created with an experimental floating point compiler tool, capable of fitting hundreds of floating point operators into a single device. We present a scalable architecture for both real and complex matrixes, on which we will report results for up to 128128 real matrices. The concepts of fused datapath synthesis for FPGA floating point designs will be reviewed, and the application to the Cholesky algorithm detailed. Experimental results will be given to show that the accuracy of this method is superior to those expected from a traditional IEEE754 core based design flow. Copyright 2009 acm.
Clock network power in field-programmablegatearrays (FP- ) is considered and two complementary approaches for power reduction in the Xilinx RVirtexTM-5 FPGA are. The approaches are unique in that they lever- specifi...
详细信息
ISBN:
(纸本)9781605584102
Clock network power in field-programmablegatearrays (FP- ) is considered and two complementary approaches for power reduction in the Xilinx RVirtexTM-5 FPGA are. The approaches are unique in that they lever- specific architectural aspects of Virtex-5 to achieve re- in dynamic power consumed by the clock network. first approach comprises a placement-based technique reduce interconnect resource usage on the clock network, reducing capacitance and power (up to 12%). The approach borrows the "clock gating" notion from the domain and applies it to FPGAs. Clock enable sig- on flip-flops are selectively migrated to use the dedi- clock enable available on the FPGA's built-in clock, leading to reduced toggling on the clock intercon- and lower power (up to 28%). Power reductions are achieved without any performance penalty, on average. Copyright 2009 acm.
We present in this paper the first reported FPGA implementation of the Position Specific Iterated BLAST (PSI-BLAST) algorithm. The latter is a heuristic biological sequence alignment algorithm that is widely used in t...
详细信息
ISBN:
(纸本)9781605584102
We present in this paper the first reported FPGA implementation of the Position Specific Iterated BLAST (PSI-BLAST) algorithm. The latter is a heuristic biological sequence alignment algorithm that is widely used in the bioinformatics and computational biology world in order to detect weak homologs. The architecture of our FPGA implementation is parameterized in terms of sequence lengths, scoring matrix, gap penalties and cut-off and threshold values. It is composed of various blmocks each of which performs one step of the algorithm in parallel. This results in high performance implementations, which easily outperform equivalent software implementations by one order of magnitude or more. Furthermore, the core was captured in an FPGA-platformindependent language, namely the Handel-C language, to which no specific resource inference or placement constraints were applied. This makes our core portable across different FPGA families and architectures. Copyright 2009 acm.
Performance of fieldprogrammablegatearrays (FPGAs) used for floating-point applications is poor due to the complexity of floating-point arithmetic. Implementing floatingpoint units on FPGAs consume a large amount o...
详细信息
ISBN:
(纸本)9781605584102
Performance of fieldprogrammablegatearrays (FPGAs) used for floating-point applications is poor due to the complexity of floating-point arithmetic. Implementing floatingpoint units on FPGAs consume a large amount of resources. This makes FPGAs less attractive for use in floating-point intensive applications. Therefore, there is a need for embedded floating-point units (FPUs) in FPGAs. However, if unutilized, embedded FPUs waste space on the FPGA die. To overcome this issue, we propose a flexible multi-mode embedded FPU for FPGAs that can be configured to perform a wide range of operations. The floating-point adder and multiplier in our embedded FPU can each be configured to perform one double-precision operation or two single-precision operations in parallel. To increase flexibility further, access to the large integer multiplier, adder and shifters in the FPU is provided. Benchmark circuits were implemented on both a standard Xilinx Virtex-II FPGA and on our FPGA with embedded FPU blocks. The results using our embedded FPUs showed a mean area improvement of 5.2 times and a mean delay improvement of 5.8 times for the doubleprecision benchmarks, and a mean area improvement of 4.4 times and a mean delay improvement of 4.2 times for the single-precision benchmarks. Copyright 2009 acm.
This paper presents a new architecture for time-to-digital enabling a time resolution of 17ps over a range 50ns with a conversion rate of 20MS/s. The proposed , implemented in a 65nm FPGA system, consists a pipelined ...
详细信息
ISBN:
(纸本)9781605584102
This paper presents a new architecture for time-to-digital enabling a time resolution of 17ps over a range 50ns with a conversion rate of 20MS/s. The proposed , implemented in a 65nm FPGA system, consists a pipelined interpolating time-to-digital converter (TDC). The TDC comprises a coarse time discriminator and ne delay line, capable of sustained operation at a clock of 300MHz. A Turbo version of the circuit implements pipelined interpolating TDC with suppressed dead to reach a conversion rate of 300MS/s at the expense a systematic asymmetry that requires fast error correction. TDCs proposed in this paper can be compensated process, voltage, and temperature (PVT) variations using conventional charge pump based feedback or a digital technique. Results demonstrate the suitability the approach for a variety of applications involving precision ultra-fast time discrimination, such as optical sensing, time-of-ight cameras, high throughput comlinks, RADARs, etc. Copyright 2009 acm.
Packet classification is an important operation for applications such as routers, firewalls or intrusion detection systems. Many algorithms and hardware architectures for packet classification have been created, but n...
详细信息
ISBN:
(纸本)9781605584102
Packet classification is an important operation for applications such as routers, firewalls or intrusion detection systems. Many algorithms and hardware architectures for packet classification have been created, but none of them cancompete with the speed of TCAMs in the worst case. We propose new hardware-based algorithm for packet classification. The solution is based on problem decomposition and is aimed at the highest network speeds. A unique property of the algorithm is the constant time complexity in terms of external memory accesses. The algorithm performs exactly two external memory accesses to classify a packet. Using FPGA and one commodity SRAM chip, a throughput of 150 million packets per second can be achieved. This makes throughput of 100 Gbps for the shortest packets. Further performance scaling is possible with more or faster SRAM chips. Copyright 2009 acm.
FPGA user clocks are slow enough that only a fraction of the interconnect's is actually used. There may be an opportunity use throughput-oriented interconnect to decrease routing and wire area using on-chip serial...
详细信息
ISBN:
(纸本)9781605584102
FPGA user clocks are slow enough that only a fraction of the interconnect's is actually used. There may be an opportunity use throughput-oriented interconnect to decrease routing and wire area using on-chip serial signaling, especially datapath designs which operate on words instead of bits. To so, these links must operate reliably at very high bit rates. We wave pipelining and surfing source-synchronous schemes the presence of power supply and crosstalk noise. In particular, noise is a critical modeling challenge;better models are for FPGA power grids. Our results show that wave pipelining operate at rates as high as 5Gbps for short links, but it is sensitive to noise in longer links and must run much slower to reliable. In contrast, surfing achieves a stable operating bit rate of 3Gbps and is relatively insensitive to noise. Copyright 2009 acm.
Carbon nanotubes (CNTs), with their unique electronic properties, are promising materials for building nanoscale circuits. In this paper, we present a new CNT-based FPGA architecture known as FPCNA. We define novel CN...
详细信息
ISBN:
(纸本)9781605584102
Carbon nanotubes (CNTs), with their unique electronic properties, are promising materials for building nanoscale circuits. In this paper, we present a new CNT-based FPGA architecture known as FPCNA. We define novel CNT and nanoswitch based components and characterize these components considering nanospecific process variations, including the variation caused by the random mixture of metallic and semiconducting CNTs. To evaluate the architecture, we develop a variation-aware physicaldesign flow which can handle both Gaussian and non-Gaussian random variables using variation-aware placement and routing. When FPCNA is evaluated with this CAD flow, we see a 2.67 performance gain over a baseline CMOS FPGA at the same technology node (at a 95% performance yield). In addition, FPCNA offers a 4.5 footprint reduction compared to the baseline FPGA. These results demonstrate the potential of using CNTs and nanoswitches to build high performance FPGA circuits. Copyright 2009 acm.
The future of high-performance computing is likely to rely the ability to efficiently exploit huge amounts of paral- . One way of taking advantage of this parallelism is formulate problems as "embarrassingly para...
详细信息
ISBN:
(纸本)9781605584102
The future of high-performance computing is likely to rely the ability to efficiently exploit huge amounts of paral- . One way of taking advantage of this parallelism is formulate problems as "embarrassingly parallel" Monte- simulations, which allow applications to achieve a lin- speedup over multiple computational nodes, without re- a super-linear increase in inter-node communication. , such applications are reliant on a cheap supply high quality random numbers, particularly for the three maximum entropy distributions: uniform, used as a source of randomness;Gaussian, for discrete-time;and exponential, for discrete-event simulations. this paper we look at four different types of platform: multi-core CPUs (Intel Core2);GPUs (NVidia 200);FPGAs (Xilinx Virtex-5);and Massively Paral- Processor arrays (Ambric AM2000). For each platform determine the most appropriate algorithm for generat- each type of number, then calculate the peak generation rate and estimated power efficiency for each device. Copyright 2009 acm.
暂无评论