This paper introduces a method of constructing random number generators from four of the basic primitives provided by FPCAs: Flip-Flips, Lookup-Tables, Shift Registers, and RAMs. The construction method is designed to...
详细信息
ISBN:
(纸本)9781595939340
This paper introduces a method of constructing random number generators from four of the basic primitives provided by FPCAs: Flip-Flips, Lookup-Tables, Shift Registers, and RAMs. The construction method is designed to ensure maximum clock rates, while using the Minimum Of resources, and providing statistical quality at the level of the best software generators. In all platforms tested, the generators are limited in speed only by the clock distribution network or the maximum clock speed of the underlying RAM primitives, using a platform independent VHDL description with no placement or other hints. The area utilisation is also very low, with a Virtex-5 generator requiring just one Block-RAM and 41. slices to produce 48Gb/s at 550MHz: over 14 times faster than the commonly used Mersenne-Twister RNG on an Opteron at 2.2GHz, while providing the same level of quality.
Leakage power has been overshadowed by dynamic power minimization techniques in fpgas, and is a growing concern in programmable logic. This paper proposes a dual threshold voltage implementation of the fpga architectu...
详细信息
Leakage power has been overshadowed by dynamic power minimization techniques in fpgas, and is a growing concern in programmable logic. This paper proposes a dual threshold voltage implementation of the fpga architecture for leakage power reduction. A CAD flow is developed for assigning high threshold voltage to the logic elements within the logic blocks of the fpga for leakage power reduction. The CAD flow ensures that all the logic blocks remain identical with respect to the number of high and low threshold voltage logic elements that each logic block contains. This CAD flow leads to a dual threshold voltage implementation for the fpga architecture. Results indicate that over 95% of the logic elements in the fpga can be assigned high threshold voltage. On an average leakage savings of 60% and up to 70% for some benchmarks can be achieved. The proposed CAD flow forms a basis on which other dual threshold voltage implementations of fpga can be evaluated. We investigate the design trade-offs between the ratio of the number of high and number of low-Vt logic elements in a cluster and the leakage savings. We also investigate the impact of cluster size on leakage savings for the dual threshold voltage implementation.
fpgas have been used in many applications to achieve orders-of-magnitude improvement in absolute performance and energy efficiency relative to conventional microprocessors. Despite their promise in both processing per...
详细信息
ISBN:
(纸本)9781450305549
fpgas have been used in many applications to achieve orders-of-magnitude improvement in absolute performance and energy efficiency relative to conventional microprocessors. Despite their promise in both processing performance and efficiency, fpgas have not yet gained widespread acceptance as mainstream computing devices. A fundamental obstacle to fpga-based computing today is the fpga's lack of a common, scalable memory architecture. When developing applications for fpgas, designers are often directly responsible for crafting the application-specific infrastructure logic that manages and transports data to and from the processing kernels. This infrastructure not only increases design time and effort but will frequently lock a design to a particular fpga product line, hindering scalability and portability. We propose a new fpga memory architecture called Connected RAM (CoRAM) to serve as a portable bridge between the distributed computation kernels and the external memory interfaces. In addition to improving performance and efficiency, the CoRAM architecture provides a virtualized memory environment as seen by the hardware kernels to simplify development and to improve an application's portability and scalability.
We consider active leakage power dissipation in fpgas and present a "no cost" approach for active leakage reduction. It is well-known that the leakage power consumed by a digital CMOS circuit depends strongl...
详细信息
We consider active leakage power dissipation in fpgas and present a "no cost" approach for active leakage reduction. It is well-known that the leakage power consumed by a digital CMOS circuit depends strongly on the state of its inputs. Our leakage reduction technique leverages a fundamental property of basic fpga logic elements (look-up-tables) that allows a logic signal in an fpga design to be interchanged with its complemented form without any area or delay penalty. We apply this property to select polarities for logic signals so that fpga hardware structures spend the majority of time in low leakage states. In an experimental study, we optimize active leakage power in circuits mapped into a state-of-the-art 90nm commercial fpga. Results show that the proposed approach reduces active leak-age by 25%, on average.
The purpose of this paper is to introduce a modified packing and placement algorithm for fpgas that utilizes logic duplication to improve performance. The modified packing algorithm was designed to leave unused basic ...
详细信息
The purpose of this paper is to introduce a modified packing and placement algorithm for fpgas that utilizes logic duplication to improve performance. The modified packing algorithm was designed to leave unused basic logic elements (BLEs) in timing critical clusters, to allow potential targets for logic duplication. The modified placement algorithm consists of a new stage after placement in which logic duplication is performed to shorten the length of the critical path. In this paper, we show that in a representative fpga architecture using .18 μm technology, the length of the final critical path can be reduced by an average of 14.1%. Approximately half of this gain comes directly from the changes to the packing algorithm while the other half comes from the logic duplication performed during placement.
In this paper, we present a new functional unit to replace the LUT in an fpga-like computational fabric designed specifically for use to accelerate instance-specific sparse integer matrix multiplication. We use a suit...
详细信息
This paper introduces a methodology for prototyping Globally Asynchronous Locally Synchronous (GALS) circuits on synchronous commercial fpgas. A library of required elements for implementing GALS circuits is proposed ...
详细信息
This paper introduces a methodology for prototyping Globally Asynchronous Locally Synchronous (GALS) circuits on synchronous commercial fpgas. A library of required elements for implementing GALS circuits is proposed and general design considerations to successfully implement a GALS circuit on fpga are discussed. The library includes clock generators and arbiters, and different port controllers. Different implementations of these circuits and their advantages and disadvantages are explored. At the end we present a GALS Reed-Solomon decoder as a practical example. The results show that the GALS approach improves the performance of the circuit by 11% and reduces the power consumption by 18.7% to 19.6% considering different error rates. On the other hand, the area of the circuit is increased by 51% which is acceptable considering that a pure synchronous circuit including a central controller is decomposed to generate GALS system and 29% of this overhead belongs to distributing controller in different modules. Deploying better decomposition methods can reduce this overhead substantially.
In designing fpgas, it is important to achieve a good balance between the number of logic blocks, such as Look-Up Tables (LUTs), and wiring resources. It is difficult to find an optimal solution. In this paper, we pre...
详细信息
In designing fpgas, it is important to achieve a good balance between the number of logic blocks, such as Look-Up Tables (LUTs), and wiring resources. It is difficult to find an optimal solution. In this paper, we present an fpga design methodology to efficiently find well-balanced fpga architectures. The method covers all aspects of fpga development from the architecture-decision process to physical implementation. It has been used to develop a new fpga that can implement circuits that are twice as large as those implementable with the previous version but with half the number of logic blocks. This indicates that the methodology is effective in developing well-balanced fpgas.
As programmable logic grows more viable for implementing full design systems, performance has become a primary issue for programmable logic device architectures. This paper presents the high-level design of Dali, a PL...
详细信息
ISBN:
(纸本)9781581134520
As programmable logic grows more viable for implementing full design systems, performance has become a primary issue for programmable logic device architectures. This paper presents the high-level design of Dali, a PLD architecture specifically aimed at performance-driven applications. We will present significant portions of the background research that contributed to our architectural decisions, an overview of the core routing architecture and benchmarking experiments used to evaluate the prototype device.
This paper presents a new approach to timing optimization for fpga designs, namely incremental physical resynthesis, to answer the challenge of effectively integrating logic and physical optimizations without incurrin...
详细信息
This paper presents a new approach to timing optimization for fpga designs, namely incremental physical resynthesis, to answer the challenge of effectively integrating logic and physical optimizations without incurring unmanageable runtime complexity. Unlike previous approaches to this problem which limit the types of operations and/or architectural features, we take advantage of many architectural characteristics of modern fpga devices, and utilize many types of optimizations including cell repacking, signal rerouting, resource retargeting, and logic restructuring, accompanied by efficient incremental placement, to gradually transform a design via a series of localized logic and physical optimizations that verifiably improve overall compliance with timing constraints. This procedure works well on small and large designs, and can be administered through either an automatic optimizer, or an interactive user interface. Our preliminary experiments showed that this approach is very effective in fixing or reducing timing violations that cannot be reduced by other optimization techniques: For a set of test cases to which this is applicable, the worst timing violation is reduced by an average of 42.8%.
暂无评论