We present the design of a high-performance, highly pipelined asynchronous fpga. We describe a very fine-grain pipelined logic block and routing interconnect architecture, and show how asynchronous logic can efficient...
详细信息
We present the design of a high-performance, highly pipelined asynchronous fpga. We describe a very fine-grain pipelined logic block and routing interconnect architecture, and show how asynchronous logic can efficiently take advantage of this large amount of pipelining. Our fpga, which does not use a clock to sequence computations, automatically "self-pipelines" its logic without the designer needing to be explicitly aware of all pipelining details. This property makes our fpga ideal for throughput-intensive applications and we require minimal place and route support to achieve good performance. Benchmark circuits taken from both the asynchronous and clocked design communities yield throughputs in the neighborhood of 300-400 MHz in a TSMC 0.25μm process and 500-700 MHz in a TSMC 0.18μm process.
Convolutional neural networks (CNNs) gained great success in machine learning applications and much attention was paid to their acceleration on fieldprogrammablegatearrays (fpgas). The most demanding computational ...
详细信息
ISBN:
(纸本)9781450361378
Convolutional neural networks (CNNs) gained great success in machine learning applications and much attention was paid to their acceleration on fieldprogrammablegatearrays (fpgas). The most demanding computational complexity of CNNs is found in the convolutional layers, which account for 90% of the total operations. The fact that parameters in convolutional layers do not change over a long time interval in weight stationary CNNs allows the use of reconfiguration to reduce the resource requirements. This work proposes several alternative reconfiguration schemes that significantly reduce the complexity of sum-of-products operations. The proposed direct configuration schemes provide the least resource requirements and fast reconfiguration times of 32 clock cycles but require additional memory for the pre-computed configurations. The proposed online reconfiguration scheme uses an online computation of the LUT contents to avoid this memory overhead. Finally, a scheme that duplicates the reconfigurable LUTs is proposed for which the reconfiguration time can be completely hidden in the computation time. Combined with a few online reconfiguration circuits, this provides the same configuration memory and configuration time as a conventional parallel kernel but offers large resource reductions of up to 80% of the LUTs.
This paper discusses architectural issues arising from the use of dynamic reconfiguration and shows a possible use of dynamic reconfiguration to extend and accelerate a computation performed in system-on-a-chip design...
详细信息
This paper discusses architectural issues arising from the use of dynamic reconfiguration and shows a possible use of dynamic reconfiguration to extend and accelerate a computation performed in system-on-a-chip designs with microprocessors with fixed instruction sets. Further a sample application is discussed that uses a dynamically reconfigurable fpga to implement different floating-point calculations in hardware, reconfigured as required by the execution of the user code. The implementation data for two dynamically reconfigurable platforms available on the market - the Xilinx Virtex2 family fpgas and the Atmel FPSLIC family fpgas - is compared in terms of resource requirements, operating frequency, and power consumption.
In this work, we parameterize and explore the interconnect structure of pipelined fpgas. Specifically, we explore the effects of interconnect register population, length of registered routing track segments, registere...
详细信息
In this work, we parameterize and explore the interconnect structure of pipelined fpgas. Specifically, we explore the effects of interconnect register population, length of registered routing track segments, registered 10 terminals of logic units, and the flexibility of the interconnect structure on the performance of a pipelined fpga. Our experiments with the RaPiD architecture identify tradeoffs that must be made while designing the interconnect structure of a pipelined fpga. The post-exploration architecture that we found shows a 19% improvement over RaPiD, while the area overhead incurred in placing and routing benchmarks netlists on the post-exploration architecture is 18%.
This paper presents an analysis of the potential yield loss in fpga due to random defects in metal layers. A proven yield model is adapted to target the fpga interconnect layers in order to predict the manufacturing y...
详细信息
ISBN:
(纸本)9781595930293
This paper presents an analysis of the potential yield loss in fpga due to random defects in metal layers. A proven yield model is adapted to target the fpga interconnect layers in order to predict the manufacturing yield. Defect parameters from the 2003 SIA roadmap are used to investigate the trend in yield loss due to defects in interconnect layers in the future. It is shown that the low yield predicted for the 45nm technology node and beyond is a cause for concern. The potential impact on yield using two different approaches, namely redundant circuits and fault tolerant design, is also presented. Copyright 2005 acm.
Advanced Microelectronics Department at Sandia National Laboratories. We present an automatic logic synthesis method targeted for high-performance asynchronous fpga (Afpga) architectures. Our method transforms sequent...
详细信息
ISBN:
(纸本)9781595930293
Advanced Microelectronics Department at Sandia National Laboratories. We present an automatic logic synthesis method targeted for high-performance asynchronous fpga (Afpga) architectures. Our method transforms sequential programs as well as high-level descriptions of asynchronous circuits into fine-grain asynchronous process netlists suitable for an Afpga. The resulting circuits are inherently pipelined, and can be physically mapped onto our Afpga with standard partitioning and place-and-route algorithms. For a wide variety of benchmarks, our automatic synthesis method not only yields comparable logic densities and performance to those achieved by hand placement, but also attains a throughput close to the peak performance of the fpga. Copyright 2005 acm.
This paper presents an fpga-specific implementation of the floating-point tangent function. The implementation inputs values in the interval [-π/2,π/2], targets the IEEE-754 single-precision format and has an accura...
详细信息
Recent developments have shown fpgas to be effective for data centre applications, but debugging support in that environment has not evolved correspondingly. This presents an additional barrier to widespread adoption....
详细信息
In floating-point datapaths synthesized on fpgas, the shifters that perform mantissa alignment and normalization consume a disproportionate number of LUTs. Shifters are implemented using several rows of small multiple...
详细信息
ISBN:
(纸本)9781450311557
In floating-point datapaths synthesized on fpgas, the shifters that perform mantissa alignment and normalization consume a disproportionate number of LUTs. Shifters are implemented using several rows of small multiplexers;unfortunately, multiplexer-based logic structures map poorly onto LUTs. fpgas, meanwhile, contain a large number of multiplexers in the programmable routing network;these multiplexer are placed under static control of the fpga's configuration bitstream. In this work, we modify some of the routing multiplexers in the intra-cluster routing network of a CLB in an fpga to implement shifters for floating-point mantissa alignment and normalization;the number of CLBs required for these operations is reduced by 67%. If shifting is not required, the routing multiplexers that have been modified can be configured to operate as normal routing multiplexers, so no functionality is sacrificed. The area overhead incurred by these modifications is small, and there is no need to modify every routing multiplexer in the fpga. Experiments show that there is no negative impact in terms of clock frequency or routability for benchmarks that do not use the dynamic multiplexers.
Memory-related constraints (memory bandwidth, cache size) are nowadays the performance bottleneck of most computational applications. Especially in the scenario of multiple cores, the performance does not scale with t...
详细信息
ISBN:
(纸本)9781450305549
Memory-related constraints (memory bandwidth, cache size) are nowadays the performance bottleneck of most computational applications. Especially in the scenario of multiple cores, the performance does not scale with the number of cores in many cases. In our work, we present our fpga-based solution for the 3D Reverse Time Migration (RTM) algorithm. As the most computationally demanding imaging algorithm in current oil and gas exploration, RIM involves various computational challenges, such as a high demand for storage size and bandwidth, and a poor cache behavior. Combining optimizations from both the algorithmic and architectural perspectives, our fpga-based solution manages to remove the memory constraints and provide a high performance that can scale well with the amount of computational resources available. Compared with an optimized CPU implementation using two quad-core Intel Nehalem CPUs, our solution achieves 4x speedup on two Virtex-5 fpgas, and 8x speedup on two Virtex-6 fpgas. Our projection demonstrates that the performance will continue to scale with the future increase of fpga capacities.
暂无评论