Exploiting the underutilisation of variable-length DSP algorithms during normal operation is vital, when seeking to maximise the achievable functionality of an application within peak power budget. A system level, low...
详细信息
ISBN:
(纸本)9781424419609
Exploiting the underutilisation of variable-length DSP algorithms during normal operation is vital, when seeking to maximise the achievable functionality of an application within peak power budget. A system level, low power design methodology for FPGA-based, variable length DSP IP cores is presented. Algorithmic commonality is identified and resources mapped with a configurable datapath, to increase achievable functionality. It is applied to a digital receiver application where a 100% increase in operational capacity is achieved in certain modes without significant power or area budget increases. Measured results show resulting architectures requires 19% less peak power, 33% fewer multipliers and 12% fewer slices than existing architectures.
Fast carry chains featuring dedicated adder circuitry is a distinctive feature of modern FPGAs. the carry chains bypass the general routing network and are embedded in the logic blocks of FPGAs for fast addition. Conv...
详细信息
ISBN:
(纸本)9781424438914
Fast carry chains featuring dedicated adder circuitry is a distinctive feature of modern FPGAs. the carry chains bypass the general routing network and are embedded in the logic blocks of FPGAs for fast addition. Conventional intuition is that such carry chains can be used only for implementing carry-propagate addition;state-of-the-art FPGA synthesizers can only exploit the carry chains for these specific circuits. this paper demonstrates that the carry chains can be used to build compressor trees, i.e., multi-input addition circuits used for parallel accumulation and partial product reduction for parallel multipliers implemented in FPGA logic. the key to our technique is to program the lookup tables (LUTs) in the logic blocks to stop the propagation of carry bits along the carry chain at appropriate points. this approach improves the area of compressor trees significantly compared to previous methods that synthesized compressor trees solely on LUTs, without compromising the performance gain over trees built from ternary carry-propagate adders.
Convolutional neural networks (CNNs) are revolutionizing a variety of machine learning tasks, but they present significant computational challenges. Recently, FPGA-based accelerators have been proposed to improve the ...
详细信息
ISBN:
(纸本)9782839918442
Convolutional neural networks (CNNs) are revolutionizing a variety of machine learning tasks, but they present significant computational challenges. Recently, FPGA-based accelerators have been proposed to improve the speed and efficiency of CNNs. Current approaches construct an accelerator optimized to maximize the overall throughput of iteratively computing the CNN layers. However, this approach leads to dynamic resource underutilization because the same accelerator is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator design that improves the dynamic resource utilization. Using the same FPGA resources, we build multiple accelerators, each specialized for specific CNN layers. Our design achieves 1.3x higher throughput than the state of the art when evaluating the convolutional layers of the popular AlexNet CNN on a Xilinx Virtex-7 FPGA.
Resistive Random Access Memory (RRAM)-based FPGA architectures employ RRAMs not only as memories to store the configuration but embed them in the datapaths of programmable routing resources to propagate signals with i...
详细信息
ISBN:
(纸本)9781467381239
Resistive Random Access Memory (RRAM)-based FPGA architectures employ RRAMs not only as memories to store the configuration but embed them in the datapaths of programmable routing resources to propagate signals with improved performances. Sources of power consumption have been intensively studied for conventional Static Random Access Memories (SRAM)-based FPGAs. However, very limited works focused so far on studying the power characteristics of RRAM-based FPGAs. In this paper, we first analyze the power characteristics of RRAM-based multiplexer at circuit level and then use electrical simulations to study power consumption of RRAM-based FPGA architectures. Experimental results show that RRAM-based FPGAs achieve a Power-Delay Product reduced by 50% compared to SRAM-based FPGA at nominal voltage and 20% compared to near-V-t SRAM-based FPGA, respectively.
field-programmable Gate Arrays (FPGAs) have gained wide acceptance among low- to medium-volume applications. However, there are gaps between FPGA and custom implementations in terms of area, performance and power cons...
详细信息
ISBN:
(纸本)9781424419609
field-programmable Gate Arrays (FPGAs) have gained wide acceptance among low- to medium-volume applications. However, there are gaps between FPGA and custom implementations in terms of area, performance and power consumption. In recent years, specialized blocks - memories and multipliers in particular - have been shown to help reduce this gap. However, their usefulness has not been studied formally on a broad spectrum of designs. As FPGAs are prefabricated, an FPGA family must contain members of various sizes and combinations of specialized blocks to satisfy diverse design resource requirements. We formulate the family selection process as an "FPGA family composition" problem and propose an efficient algorithm to solve it. the technique was applied to an architecture similar to Xilinx Virtex FPGAs. the results show that smart composition technique can reduce the expected silicon area up to 55%. the benefit of providing multiplier blocks in FPGAs is also shown to reduce total area by 20% using the proposed algorithm.
FPGAs are promising platforms to efficiently execute distributed graph algorithms. Unfortunately, they are notoriously hard to program, especially when the problem size and system complexity increases. In this paper, ...
详细信息
ISBN:
(纸本)9782839918442
FPGAs are promising platforms to efficiently execute distributed graph algorithms. Unfortunately, they are notoriously hard to program, especially when the problem size and system complexity increases. In this paper, we propose GraVF, a high-level design framework for distributed graph processing on FPGAs. It leverages the vertex-centric paradigm, which is naturally distributed and requires the user to define only very small kernels and their associated message semantics for the target application. the user design may subsequently be elaborated and compiled to the target system automatically by the framework. To demonstrate the flexibility and capabilities of the proposed framework, 4 graph algorithms with distinct requirements have been implemented, namely breadth-first search, PageRank, single source shortest path, and connected component. Results show that the proposed framework is capable of producing FPGA designs with performance comparable to similar custom designs while requiring only minimal input from the user.
We demonstrate a hybrid reconfigurable cluster-on-chip architecture with a cross-platform Message Passing Interface (MPI), a cross-platform parallel image processing library and a sample application. We describe the s...
详细信息
ISBN:
(纸本)9781424410590
We demonstrate a hybrid reconfigurable cluster-on-chip architecture with a cross-platform Message Passing Interface (MPI), a cross-platform parallel image processing library and a sample application. We describe the system, network architecture, MPI library and the parallel image processing library implementations. We validate the performance, scalability and suitability of MPI as a software interface to enable cross-platform application parallelism on reconfigurable hybrid cluster-on-chip systems and desktop cluster systems. the presented results are promising, showing the suitability, scalability and performance of parallelisation of image processing algorithms with a cross-platform MPI implementation.
Rapid increases in transistor density, clock speeds and competition with custom ICs have escalated the demand for aggressive solutions to battle rising operating temperatures in programmable fabrics. In this work, we ...
详细信息
ISBN:
(纸本)9781424419609
Rapid increases in transistor density, clock speeds and competition with custom ICs have escalated the demand for aggressive solutions to battle rising operating temperatures in programmable fabrics. In this work, we make several key contributions to temperature management in FPGAs. We develop a novel and robust simulation framework exploring adaptive techniques to reduce on chip temperatures in the reconfigurable core. We implement a thermal driven voltage scaling algorithm based on temperature and performance feedback. Our performance estimation model is an accurate empirical relation between delay, supply voltage and temperature with an average error of 9%. Our final results show significant temperature reductions of upto 13.37 degrees C accompanied by the added benefit of power savings averaging 13.48%. Overheads are limited to an average reduction in worst case operating frequency of 10.78% and a voltage swing of 0.61V.
this paper describes an analytical model, based principally on Rent's Rule, that relates logic architectural parameters to the area efficiency of an FPGA. In particular, the model relates the lookup-table size, th...
详细信息
ISBN:
(纸本)9781424419609
this paper describes an analytical model, based principally on Rent's Rule, that relates logic architectural parameters to the area efficiency of an FPGA. In particular, the model relates the lookup-table size, the cluster size, and the number of inputs per cluster to the amount of logicthat can be packed into each lookup-table and cluster, and the number of used inputs per cluster. Comparison to experimental results show that our models are accurate. this accuracy combined withthe simple form of the equations make them a powerful tool for FPGA architects to better understand and guide the development of future FPGA architectures.
this paper explores hardware acceleration to significantly improve the runtime of computing the forward algorithm on Pair-HMM models, a crucial step in analyzing mutations in sequenced genomes. We describe 1) the desi...
详细信息
ISBN:
(纸本)9789090304281
this paper explores hardware acceleration to significantly improve the runtime of computing the forward algorithm on Pair-HMM models, a crucial step in analyzing mutations in sequenced genomes. We describe 1) the design and evaluation of a novel accelerator architecture that can efficiently process real sequence data without performing wasteful work;and 2) aggressive memoization techniques that can significantly reduce the number of invocations of, and the amount of data transferred to the accelerator. We describe our demonstration of the design on a Xilinx Virtex 7 FPGA in an IBM Power8 system. Our design achieves a 14.85x higher throughput than an 8-core CPU baseline (that uses SIMD and multi-threading) and a 147.49x improvement in throughput per unit of energy expended on the NA12878 sample.
暂无评论