The ability to measure delay of arbitrary circuits on fpga offers many opportunities for on-chip characterisation and optimisation. This paper describes an improved delay measurement method by monitoring the transitio...
详细信息
ISBN:
(纸本)9781450305549
The ability to measure delay of arbitrary circuits on fpga offers many opportunities for on-chip characterisation and optimisation. This paper describes an improved delay measurement method by monitoring the transition probability at. the output nodes as the operating frequency is swept. The new method uses optimised test vector generation to improve the accuracy of the test method. It is effectively demonstrated on a 4th order IIR filter circuit implemented on an Altera Cyclone III fpga.
This paper introduces the Delaware Enhanced Emulation Platform (DEEP) - a fpga-based emulation system for hardware/software co-verification of many-core chip architectures. This platform exhibits the following three c...
详细信息
ISBN:
(纸本)9781450305549
This paper introduces the Delaware Enhanced Emulation Platform (DEEP) - a fpga-based emulation system for hardware/software co-verification of many-core chip architectures. This platform exhibits the following three characteristics: fast compilation of logic designs, debugging support, and affordability. It is based on a novel iterative emulation methodology for hardware design and verification. We also conducted a logic design and integration of a new architectural feature that provides Full/Empty bit fine-grain synchronization for the IBM Cyclops-64 many-core architecture and evaluated its performance against existing synchronization constructs.
High-speed IP lookup remains a challenging problem in next generation routers due to the ever increasing line rate and routing table size. In addition, the evolution towards IPv6 also requires long prefix length, spar...
详细信息
ISBN:
(纸本)9781450311557
High-speed IP lookup remains a challenging problem in next generation routers due to the ever increasing line rate and routing table size. In addition, the evolution towards IPv6 also requires long prefix length, sparse prefix distribution, and potentially very large routing tables. In this paper, we propose a novel Combined Length-Infix Pipelined Search (CLIPS) architecture for IPv6 routing table lookup on fpga. CLIPS solves the longest prefix match (LPM) problem by combining both prefix length and infix pattern search. Binary search in prefix length is performed on the 64-bit routing prefix of IPv6 down to an 8-bit length range in log(64/8)=3 phases; each phase performs a fully-pipelined infix pattern search with only one external memory access. A fourth and the last phase then finds the LPM (if any) within the 8-bit length range in a compressed multi-bit *** describe the algorithms and data structures used for the CLIPS construction, run-time operation, dynamic update and false-positive avoidance. The proposed solution improves the on-chip memory efficiency on fpga and maximizes the external SRAM utilization; additional properties for ensuring the practicality of our scheme include the modular construction, easy dynamic update, and simple resource allocation. Using a state-of-the-art fpga, our CLIPS prototype supports up to 2.7 millioin IPv6 prefixes when employing 33 Mbits of BRAM and 4 channels of external SRAM. The prototype achieves a sustained throughput of 264 million IPv6 lookups per second, or 135 Gbps with minimum size (64-byte) packets.
The popularity of fpgas is rapidly growing due to the unique advantages that they offer. However, their distinctive features also raise new questions concerning the security and communication capabilities of an fpga-b...
详细信息
ISBN:
(纸本)9781450305549
The popularity of fpgas is rapidly growing due to the unique advantages that they offer. However, their distinctive features also raise new questions concerning the security and communication capabilities of an fpga-based hardware platform. In this paper, we explore some of the limits of fpga side-channel communication. Specifically, we identify a previously unexplored capability that significantly increases both the potential benefits and risks associated with side-channel communication on an fpga: an in-device receiver. We designed and implemented three new communication mechanisms: speed modulation, timing modulation and pin hijacking. These non-traditional interfacing techniques have the potential to provide reliable communication with an estimated maximum bandwidth of 3.3 bit/sec, 8 Kbits/sec, and 3.4 Mbits/sec, respectively.
Monte-Carlo arithmetic is a form of self-validating arithmetic that accounts for the effect of rounding errors. We have implemented a floating point unit that can perform either IEEE 754 or Monte-Carlo floating point ...
详细信息
ISBN:
(纸本)9781450305549
Monte-Carlo arithmetic is a form of self-validating arithmetic that accounts for the effect of rounding errors. We have implemented a floating point unit that can perform either IEEE 754 or Monte-Carlo floating point computation, allowing hardware accelerated validation of results during execution. Experiments show that our approach has a. modest hardware overhead and allows the propagation of rounding error to be accurately estimated.
The programmable interconnection resources are one aspect that distinguishes fpgas from other devices. The abundance of these resources in modern devices almost always assures us that the most complex design can be ro...
详细信息
ISBN:
(纸本)9781450305549
The programmable interconnection resources are one aspect that distinguishes fpgas from other devices. The abundance of these resources in modern devices almost always assures us that the most complex design can be routed. This underutilized resource can be used for other unintended purposes. One such use, explored here, is to concatenate large networks together to form pseudo-equipotential geometric shapes. These shapes can then be evaluated in terms of their ability to radiate (modulated) energy off the chip to a nearby receiver. In this paper, an unconventional method of building such transmitters on an fpga is proposed. Arbitrary shaped antennas are created using a unique flow involving an experimental router and binary images. An experiment setup is used to measure the performance of the antennas created.
Memory-related constraints (memory bandwidth, cache size) are nowadays the performance bottleneck of most computational applications. Especially in the scenario of multiple cores, the performance does not scale with t...
详细信息
ISBN:
(纸本)9781450305549
Memory-related constraints (memory bandwidth, cache size) are nowadays the performance bottleneck of most computational applications. Especially in the scenario of multiple cores, the performance does not scale with the number of cores in many cases. In our work, we present our fpga-based solution for the 3D Reverse Time Migration (RTM) algorithm. As the most computationally demanding imaging algorithm in current oil and gas exploration, RIM involves various computational challenges, such as a high demand for storage size and bandwidth, and a poor cache behavior. Combining optimizations from both the algorithmic and architectural perspectives, our fpga-based solution manages to remove the memory constraints and provide a high performance that can scale well with the amount of computational resources available. Compared with an optimized CPU implementation using two quad-core Intel Nehalem CPUs, our solution achieves 4x speedup on two Virtex-5 fpgas, and 8x speedup on two Virtex-6 fpgas. Our projection demonstrates that the performance will continue to scale with the future increase of fpga capacities.
This paper analyses different hardware sorting architectures in order to implement a highly scaleable sorter for solving huge problems at high performance up to the GB range in linear time complexity. It will be prove...
详细信息
ISBN:
(纸本)9781450305549
This paper analyses different hardware sorting architectures in order to implement a highly scaleable sorter for solving huge problems at high performance up to the GB range in linear time complexity. It will be proven that a combination of a FIFO-based merge sorter and a tree-based merge sorter results in the best performance at low cost. Moreover, we will demonstrate how partial run-time reconfiguration can be used for saving almost half the fpga resources or alternatively for improving the speed. Experiments show a sustainable sorting throughput of 2GB/s for problems fitting into the on-chip fpga memory and 1 GB/s when using external memory. These values surpass the best published results on large problem sorting implementations on fpgas, GPUs, and the Cell processor.
In recent years, the classic method of Coordinate Rotation by Digital Computer (CORDIC) arithmetic has been widely implemented as part of the computational requirements of the well known QR-RLS (Recursive Least Square...
详细信息
ISBN:
(纸本)9781450305549
In recent years, the classic method of Coordinate Rotation by Digital Computer (CORDIC) arithmetic has been widely implemented as part of the computational requirements of the well known QR-RLS (Recursive Least Squares) algorithm. In order to operate Givens rotation on a complex number system, double angle complex rotation (DACR) was adopted to simplify the computational requirement of Complex Givens Rotation. This paper presents a new architecture of high speed CORDIC based single Processor Element (PE) that can be used to accomplish the complex value QR update based RLS. The implementation results on Xilinx fpga implementaton demonstrates that the proposed structure results in a lower latency and lower cost.
Long fpga CAD runtime has emerged as a limitation to the future scaling of fpga densities. Already, compile times on the order of a day are common, and the situation will only get worse as fpgas get larger. Without a ...
详细信息
ISBN:
(纸本)9781450305549
Long fpga CAD runtime has emerged as a limitation to the future scaling of fpga densities. Already, compile times on the order of a day are common, and the situation will only get worse as fpgas get larger. Without a concerted effort to reduce compile times, further scaling of fpgas will eventually become impractical. Previous works have presented fast CAD tools that tradeoff quality of result for compile time. In this paper, we take a different but complementary approach. We show that the architecture of the fpga itself can be designed to be amenable to fast-compile. If not done carefully, this can lead to lower-quality mapping results, so a careful tradeoff between area, delay, power, and compile run-time is essential. We investigate the extent to which run-time can be reduced by employing high-capacity logic blocks. We extend previous studies on logic block architectures by quantifying the area, delay and CAD runtime trade-offs for large capacity blocks, and also investigate some multi-level logic block architectures. In addition, we present an analytically derived equation to guide the design of logic block I/O requirements.
暂无评论