Withthe introduction of the Stratix V family, the FPGA vendor Altera is now fully supporting partial reconfiguration in all their recent FPGA devices. A distinct feature in the Altera architecture is that reconfigura...
详细信息
ISBN:
(纸本)9782839918442
Withthe introduction of the Stratix V family, the FPGA vendor Altera is now fully supporting partial reconfiguration in all their recent FPGA devices. A distinct feature in the Altera architecture is that reconfigurable regions can be arbitrarily defined which is possible by writing a configuration mask prior to writing the actual configuration data to the FPGA fabric. In this paper, we will present details and the flow for implementing partial reconfiguration using Altera FPGAs, as well as a study on configuration bitstream sizes and configuration speeds for various resource and bounding-box aspect ratio variants. the results are used to build a partial reconfiguration controller that is featuring a lightweight but effective bitstream decompression module for greatly improving configuration speed on a DE5-net board.
Nowadays, FPGAs are integrated in high-performance computing systems, servers, or even used as accelerators in System-on-Chip (SoC) platforms. Since the execution is performed in hardware, FPGA gives much higher perfo...
详细信息
Image features are broadly used in embedded computer vision applications, from object detection and tracking to motion estimation and 3D reconstruction. Efficient feature extraction and description are crucial due to ...
详细信息
ISBN:
(纸本)9782839918442
Image features are broadly used in embedded computer vision applications, from object detection and tracking to motion estimation and 3D reconstruction. Efficient feature extraction and description are crucial due to the real-time requirements of such applications over a constant stream of input data. High-speed computation typically comes at the cost of high power dissipation, yet embedded systems are often highly power constrained, making discovery of power-aware solutions especially critical for these systems. In this paper, we present a power and performance evaluation of three low cost feature detection and description algorithms implemented on various embedded systems (embedded CPUs, GPUs and FPGAs). We show that FPGAs in particular offer attractive solutions for both performance and power and describe several design techniques utilized to accelerate feature extraction and description algorithms on low-cost Zynq SoC FPGAs.
field-programmable Gate Arrays (FPGAs) benefit from the most advanced CMOS technology nodes, in order to meet the increasing demands of high performance and low power digital integrated circuits. this makes them susce...
详细信息
ISBN:
(纸本)9782839918442
field-programmable Gate Arrays (FPGAs) benefit from the most advanced CMOS technology nodes, in order to meet the increasing demands of high performance and low power digital integrated circuits. this makes them susceptible to various reliability challenges at nano-scale. In this paper, we focus on aging degradation of the Look-up table (LUT) on FPGAs. We have characterized the delay degradation of LUT depending on the duty cycle of stress vectors. We have identified also that the duty cycle affects strongly the fall and moderately the rise delay of LUT due to the NBTI aging mechanism. Furthermore, a semi-empirical model of the degradation of LUT timing due to NBTI depending on the time and the duty cycle of stress vector has been investigated in this work. this model can be used to predict the degradation of a complex circuit implemented in a FPGA, and especially the risk of timing violations due to NBTI aging.
Configuration scrubbing is a technique used for repairing Single Event Upsets (SEUs) within the configuration memory of an FPGA. Scrubbing approaches have been developed using hardware external to the FPGA communicati...
详细信息
ISBN:
(纸本)9782839918442
Configuration scrubbing is a technique used for repairing Single Event Upsets (SEUs) within the configuration memory of an FPGA. Scrubbing approaches have been developed using hardware external to the FPGA communicating through a configuration port and using hardware within the FPGA by communicating with an internal configuration port (ICAP). More recent FPGAs such as the Xilinx Zynq 7-Series SoCs provide internal programmable processors that can configure the FPGA logic very rapidly using an internal Processor Configuration Access Port (PCAP). these SoC/FPGAs also provide automatic internal scrubbing through the use of high-speed readback and configuration error correction. this paper presents a novel form of FPGA configuration scrubbing for the Zynq-7000 SoC family by combining the highspeed PCAP configuration port with internal scrubbing. this novel scrubber corrects single-bit upsets in several microseconds and detects these upsets in 8 ms.
TCP/IP is widely used both in the Internet as well as in data centers. the protocol makes very few assumptions about the underlying network and provides useful guarantees such as reliable transmission, in-order delive...
详细信息
ISBN:
(纸本)9782839918442
TCP/IP is widely used both in the Internet as well as in data centers. the protocol makes very few assumptions about the underlying network and provides useful guarantees such as reliable transmission, in-order delivery, or control flow. the price for this functionality is complexity, latency, and computational overhead, which is especially pronounced in software implementations. While for Internet communication this is acceptable, the overhead is too high in data centers. In this paper, we explore how to optimize a TCP/IP stack running on an FPGA for data center applications with an emphasis on data processing (e.g., key value stores). Using a key-value store and a low-latency consensus protocol implemented on an FPGA as an example of the requirements that arise in data centers, we provide an extensive analysis of the overheads of TCP/IP and the solutions that can be adopted to minimize such an overhead. the proposed optimized TCP/IP stack minimizes tail latencies (a key metric in distributed data processing) and is efficiently implemented so as to be able to share the FPGA with application logic.
Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). this paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GR...
详细信息
ISBN:
(纸本)9782839918442
Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). this paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GRU). We first proposed memoization optimization to avoid 3 out of the 6 dense matrix vector multiplications (SGEMVs) that are the majority of the computation in GRU. then, we study the opportunities to accelerate the remaining SGEMVs using FPGAs, in comparison to 14-nm ASIC, GPU, and multi-core CPU. Results show that FPGA provides superior performance/Watt over CPU and GPU because FPGA's on-chip BRAMs, hard DSPs, and reconfigurable fabric allow for efficiently extracting fine-grained parallelisms from small/medium size matrices used by GRU. Moreover, newer FPGAs with more DSPs, on-chip BRAMs, and higher frequency have the potential to narrow the FPGA-ASIC efficiency gap.
Convolutional neural networks (CNNs) are revolutionizing a variety of machine learning tasks, but they present significant computational challenges. Recently, FPGA-based accelerators have been proposed to improve the ...
详细信息
ISBN:
(纸本)9782839918442
Convolutional neural networks (CNNs) are revolutionizing a variety of machine learning tasks, but they present significant computational challenges. Recently, FPGA-based accelerators have been proposed to improve the speed and efficiency of CNNs. Current approaches construct an accelerator optimized to maximize the overall throughput of iteratively computing the CNN layers. However, this approach leads to dynamic resource underutilization because the same accelerator is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator design that improves the dynamic resource utilization. Using the same FPGA resources, we build multiple accelerators, each specialized for specific CNN layers. Our design achieves 1.3x higher throughput than the state of the art when evaluating the convolutional layers of the popular AlexNet CNN on a Xilinx Virtex-7 FPGA.
FPGAs are promising platforms to efficiently execute distributed graph algorithms. Unfortunately, they are notoriously hard to program, especially when the problem size and system complexity increases. In this paper, ...
详细信息
ISBN:
(纸本)9782839918442
FPGAs are promising platforms to efficiently execute distributed graph algorithms. Unfortunately, they are notoriously hard to program, especially when the problem size and system complexity increases. In this paper, we propose GraVF, a high-level design framework for distributed graph processing on FPGAs. It leverages the vertex-centric paradigm, which is naturally distributed and requires the user to define only very small kernels and their associated message semantics for the target application. the user design may subsequently be elaborated and compiled to the target system automatically by the framework. To demonstrate the flexibility and capabilities of the proposed framework, 4 graph algorithms with distinct requirements have been implemented, namely breadth-first search, PageRank, single source shortest path, and connected component. Results show that the proposed framework is capable of producing FPGA designs with performance comparable to similar custom designs while requiring only minimal input from the user.
Sharing multi-cycle hardware blocks like the DSP48E1 primitive in Xilinx FPGAs can result in significant resource savings, but complicates scheduling. For high-throughput, DSP blocks must be pipelined, which results i...
详细信息
ISBN:
(纸本)9782839918442
Sharing multi-cycle hardware blocks like the DSP48E1 primitive in Xilinx FPGAs can result in significant resource savings, but complicates scheduling. For high-throughput, DSP blocks must be pipelined, which results in a high initiation interval (II) for resource shared implementations. In this paper, we propose a resource reduction technique that minimises DSP block usage while also offering improved II over traditional approaches. this is integrated in a high-level tool which takes datapath descriptions in C and generates synthesisable Verilog RTL with different levels of resource sharing. We demonstrate significantly improved throughput compared to traditional resource sharing while achieving resource reduction compared to resource unconstrained and HLS implementations. the approach explores an otherwise infeasible design space between resource unconstrained and traditional resource sharing methods.
暂无评论