Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Man...
详细信息
ISBN:
(数字)9781538685174
ISBN:
(纸本)9781538685174
Matrix-matrix multiplication is a key computational kernel for numerous applications in science and engineering, with ample parallelism and data locality that lends itself well to high-performance implementations. Many matrix multiplication dependent applications can use reduced-precision integer or fixedpoint representations to increase their performance and energy efficiency while still offering adequate quality of results. However, precision requirements may vary between different application phases or depend on input data, rendering constant-precision solutions ineffective. We present BISMO, a vectorized bit serial matrix multiplication overlay for reconfigurable computing. BISMO utilizes the excellent binary-operation performance of FPGAs to offer a matrix multiplication performance that scales with required precision and parallelism. We characterize the resource usage and performance of BISMO across a range of parameters to build a hardware cost model, and demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.
this work introduces an FPGA implementation for vessel-tree extraction on retinal images. the retinal vessel-tree can be used in disease diagnoses, e.g. diabetes, or in person authentication. In such cases, a portable...
详细信息
ISBN:
(纸本)9781424438914
this work introduces an FPGA implementation for vessel-tree extraction on retinal images. the retinal vessel-tree can be used in disease diagnoses, e.g. diabetes, or in person authentication. In such cases, a portable device with a high performance may be a need. the FPGA implementation discussed here, although application-oriented, features a fully programmable SIMD architecture, allowing for an efficient realization of low-level image processing algorithms. It is mapped onto a Spartan 3, amounting to 90 processing elements. the on-chip memory utilized was 1.4MB and stores 8 gray images of 144 x 160px. the working frequency is 53MHz, allowing for a 3 x 3 convolution in less than 110 mu s.
Fast carry chains featuring dedicated adder circuitry is a distinctive feature of modern FPGAs. the carry chains bypass the general routing network and are embedded in the logic blocks of FPGAs for fast addition. Conv...
详细信息
ISBN:
(纸本)9781424438914
Fast carry chains featuring dedicated adder circuitry is a distinctive feature of modern FPGAs. the carry chains bypass the general routing network and are embedded in the logic blocks of FPGAs for fast addition. Conventional intuition is that such carry chains can be used only for implementing carry-propagate addition;state-of-the-art FPGA synthesizers can only exploit the carry chains for these specific circuits. this paper demonstrates that the carry chains can be used to build compressor trees, i.e., multi-input addition circuits used for parallel accumulation and partial product reduction for parallel multipliers implemented in FPGA logic. the key to our technique is to program the lookup tables (LUTs) in the logic blocks to stop the propagation of carry bits along the carry chain at appropriate points. this approach improves the area of compressor trees significantly compared to previous methods that synthesized compressor trees solely on LUTs, without compromising the performance gain over trees built from ternary carry-propagate adders.
Driven by the strong need in data processing applications, fieldprogrammable Gate Arrays (FPGAs) are playing an ever-increasing role as programmable accelerators in modern computing systems. To fully unlock processin...
详细信息
ISBN:
(纸本)9781728148847
Driven by the strong need in data processing applications, fieldprogrammable Gate Arrays (FPGAs) are playing an ever-increasing role as programmable accelerators in modern computing systems. To fully unlock processing capabilities for domain-specific applications, FPGA architectures have to be tailored for seamless cooperation with other computing resources. However, prototyping and bringing to production a customized FPGA is a costly and complex endeavor even for industrial vendors. In this paper, we introduce OpenFPGA, an open-source framework that enables rapid prototyping of customizable FPGA architectures through a semi-custom design approach. We propose an XML-to-Prototype design flow, where the Verilog netlists of a full FPGA fabric can be autogenerated using an extension of the XML language from the VTR framework and then fed into a back-end flow to generate production-ready layouts. OpenFPGA also includes a general-purpose Verilog-to-Bitstream generator for any FPGA described by the XML language. We demonstrate the capability of this automatic design flow with a Stratix IV-like FPGA architecture using a commercial 40nm technology node, and perform a detailed comparison to its academic and commercial counterparts. Compared to the current state-of-art academic results, our FPGA fabric reduces the area by 1.75x and the delay by 3x on average. In addition, OpenFPGA significantly reduces the gap between semi-custom-designed FPGAs and fully-optimized commercial products with a penalty of only 60% in area and 30% in delay, respectively.
Convolutional neural networks (CNNs) are revolutionizing a variety of machine learning tasks, but they present significant computational challenges. Recently, FPGA-based accelerators have been proposed to improve the ...
详细信息
ISBN:
(纸本)9782839918442
Convolutional neural networks (CNNs) are revolutionizing a variety of machine learning tasks, but they present significant computational challenges. Recently, FPGA-based accelerators have been proposed to improve the speed and efficiency of CNNs. Current approaches construct an accelerator optimized to maximize the overall throughput of iteratively computing the CNN layers. However, this approach leads to dynamic resource underutilization because the same accelerator is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator design that improves the dynamic resource utilization. Using the same FPGA resources, we build multiple accelerators, each specialized for specific CNN layers. Our design achieves 1.3x higher throughput than the state of the art when evaluating the convolutional layers of the popular AlexNet CNN on a Xilinx Virtex-7 FPGA.
Resistive Random Access Memory (RRAM)-based FPGA architectures employ RRAMs not only as memories to store the configuration but embed them in the datapaths of programmable routing resources to propagate signals with i...
详细信息
ISBN:
(纸本)9781467381239
Resistive Random Access Memory (RRAM)-based FPGA architectures employ RRAMs not only as memories to store the configuration but embed them in the datapaths of programmable routing resources to propagate signals with improved performances. Sources of power consumption have been intensively studied for conventional Static Random Access Memories (SRAM)-based FPGAs. However, very limited works focused so far on studying the power characteristics of RRAM-based FPGAs. In this paper, we first analyze the power characteristics of RRAM-based multiplexer at circuit level and then use electrical simulations to study power consumption of RRAM-based FPGA architectures. Experimental results show that RRAM-based FPGAs achieve a Power-Delay Product reduced by 50% compared to SRAM-based FPGA at nominal voltage and 20% compared to near-V-t SRAM-based FPGA, respectively.
FPGAs are promising platforms to efficiently execute distributed graph algorithms. Unfortunately, they are notoriously hard to program, especially when the problem size and system complexity increases. In this paper, ...
详细信息
ISBN:
(纸本)9782839918442
FPGAs are promising platforms to efficiently execute distributed graph algorithms. Unfortunately, they are notoriously hard to program, especially when the problem size and system complexity increases. In this paper, we propose GraVF, a high-level design framework for distributed graph processing on FPGAs. It leverages the vertex-centric paradigm, which is naturally distributed and requires the user to define only very small kernels and their associated message semantics for the target application. the user design may subsequently be elaborated and compiled to the target system automatically by the framework. To demonstrate the flexibility and capabilities of the proposed framework, 4 graph algorithms with distinct requirements have been implemented, namely breadth-first search, PageRank, single source shortest path, and connected component. Results show that the proposed framework is capable of producing FPGA designs with performance comparable to similar custom designs while requiring only minimal input from the user.
We demonstrate a hybrid reconfigurable cluster-on-chip architecture with a cross-platform Message Passing Interface (MPI), a cross-platform parallel image processing library and a sample application. We describe the s...
详细信息
ISBN:
(纸本)9781424410590
We demonstrate a hybrid reconfigurable cluster-on-chip architecture with a cross-platform Message Passing Interface (MPI), a cross-platform parallel image processing library and a sample application. We describe the system, network architecture, MPI library and the parallel image processing library implementations. We validate the performance, scalability and suitability of MPI as a software interface to enable cross-platform application parallelism on reconfigurable hybrid cluster-on-chip systems and desktop cluster systems. the presented results are promising, showing the suitability, scalability and performance of parallelisation of image processing algorithms with a cross-platform MPI implementation.
the affective content of a video is defined as the expected amount and type of emotion that are contained in a video. Utilizing this affective content will extend the current scope of application possibilities. the di...
详细信息
ISBN:
(纸本)9781424403127
the affective content of a video is defined as the expected amount and type of emotion that are contained in a video. Utilizing this affective content will extend the current scope of application possibilities. the dimensional approach to representing emotion can play an important role in the development of an affective video content analyzer. the three basic affect dimensions are defined as valence, arousal and control [1]. this paper presents a novel FPGA-based system for modeling the arousal content of a video based on user saliency and film grammar. the design is implemented on a Xilinx Virtex-II xc2v6000 on board a RC300 board and it runs 25 times faster than a Pentium 4-based PC at 3.4 Ghz.
this paper presents Archlog, a language and framework for designing multiprocessor architectures in the logic programming domain. Our goal is to enable application developers in areas such as machine learning and cogn...
详细信息
ISBN:
(纸本)9781424403127
this paper presents Archlog, a language and framework for designing multiprocessor architectures in the logic programming domain. Our goal is to enable application developers in areas such as machine learning and cognitive robotics to produce high-performance designs for reconfigurable devices, without detailed knowledge of hardware development. the Archlog framework provides a high level of abstraction, enabling rapid system generation while supporting high performance. In this paper we present the Archlog language and its library-based compilation framework, which makes use of a customisable logic programming processor. the system generates multiple designs, with different trade-offs in the use of reconfigurable logic and embedded memories. An implementation of a multiprocessor for the machine learning system Progol on a 40MHz XC2V6000 FPGA is 10 times faster than a 2GHz Pentium 4 processor.
暂无评论