Configuration scrubbing is a technique used for repairing Single Event Upsets (SEUs) within the configuration memory of an FPGA. Scrubbing approaches have been developed using hardware external to the FPGA communicati...
详细信息
ISBN:
(纸本)9782839918442
Configuration scrubbing is a technique used for repairing Single Event Upsets (SEUs) within the configuration memory of an FPGA. Scrubbing approaches have been developed using hardware external to the FPGA communicating through a configuration port and using hardware within the FPGA by communicating with an internal configuration port (ICAP). More recent FPGAs such as the Xilinx Zynq 7-Series SoCs provide internal programmable processors that can configure the FPGA logic very rapidly using an internal Processor Configuration Access Port (PCAP). These SoC/FPGAs also provide automatic internal scrubbing through the use of high-speed readback and configuration error correction. This paper presents a novel form of FPGA configuration scrubbing for the Zynq-7000 SoC family by combining the highspeed PCAP configuration port with internal scrubbing. This novel scrubber corrects single-bit upsets in several microseconds and detects these upsets in 8 ms.
This tutorial describes the Why and How of the new 65-nm families of Virtex-5 FPGAs. It describes several aspects of the technology that affect speed, density, and power consumption. The basic device structure and pac...
详细信息
With the deployment of FPGAs in a data center, there is the opportunity to build large multi-FPGA applications. In this paper, we design a partitioner to address the problem of efficiently assigning the various tasks ...
详细信息
ISBN:
(纸本)9798350341515
With the deployment of FPGAs in a data center, there is the opportunity to build large multi-FPGA applications. In this paper, we design a partitioner to address the problem of efficiently assigning the various tasks of a large multi-FPGA application to individual network-connected FPGAs according to constraints that consider resource usage, communication bandwidth and communication latency. By using simulated annealing, we can modify the cost function as new objectives and constraints are determined. We build on the Galapagos multi-FPGA platform by introducing a multi-die shell to extend Galapagos to more recent FPGA boards and design the partitioner to work on any collection of single- and multi-die FPGAs. Finally, We evaluate the new shell and partitioner using micro-benchmarks and analyze the partitioning of a real-world multi-FPGA application, a Transformer model.
The power consumption of digital circuits, e.g., fieldprogrammable Gate Arrays (FPGAs), is directly related to their operating supply voltages. On the other hand, usually, chip vendors introduce a conservative voltag...
详细信息
ISBN:
(数字)9781538685174
ISBN:
(纸本)9781538685174
The power consumption of digital circuits, e.g., fieldprogrammable Gate Arrays (FPGAs), is directly related to their operating supply voltages. On the other hand, usually, chip vendors introduce a conservative voltage guardband below the standard nominal level to ensure the correct functionality of the design in worst-case process and environmental scenarios. For instance, this voltage guardband is empirically measured to be 12%, 20%, and 16% of the nominal level in commercial CPUs [1], Graphics Processing Units (GPUs) [2], and Dynamic RAMs (DRAMs) [3], respectively. However, in many real-world applications, this guardband is extremely conservative and eliminating it can result in significant power savings without any overhead. Motivated by these studies, we aim to extend the undevolting technique to commercial FPGAs. Toward this goal, we will practically demonstrate the voltage guardband for a representative Xilinx FPGA, with a preliminary concentration on on-chip memories, or Block RAMs (BRAMs).
Exploiting the underutilisation of variable-length DSP algorithms during normal operation is vital, when seeking to maximise the achievable functionality of an application within peak power budget. A system level, low...
详细信息
ISBN:
(纸本)9781424419609
Exploiting the underutilisation of variable-length DSP algorithms during normal operation is vital, when seeking to maximise the achievable functionality of an application within peak power budget. A system level, low power design methodology for FPGA-based, variable length DSP IP cores is presented. Algorithmic commonality is identified and resources mapped with a configurable datapath, to increase achievable functionality. It is applied to a digital receiver application where a 100% increase in operational capacity is achieved in certain modes without significant power or area budget increases. Measured results show resulting architectures requires 19% less peak power, 33% fewer multipliers and 12% fewer slices than existing architectures.
Many high-performance applications involve large data sets that are impossible to fit entirely within on-chip memories of even the largest FPGAs. As a result, they must be stored in off-chip SDRAMs and loaded onto the...
详细信息
ISBN:
(纸本)9781424438914
Many high-performance applications involve large data sets that are impossible to fit entirely within on-chip memories of even the largest FPGAs. As a result, they must be stored in off-chip SDRAMs and loaded onto the FPGAs as computations progress. Because of the high latency and energy consumption associated with off-chip memory accesses, it is important to develop efficient operation schedules that not only minimize latency of computations, but also the amount of data I/Os. We formulate this problem as a modified resource-constrained job scheduling problem. The problem is then solved using a list scheduling algorithm that takes advantage of the fast burst-mode access of SDRAMs. Results have shown that for large problem sizes, the performance of our algorithm is within 1% of a hand-optimized matrix-matrix multiplication implementation, with no memory overhead, and is within 0.03% of the theoretical minimum latency of an 8-by-8 cofactor matrix computation.
The division of an application between a conventional processor and an acceleration card with FPGA chips has been proved as a suitable way for an acceleration of computationally intensive tasks. In such applications, ...
详细信息
ISBN:
(纸本)9781424419609
The division of an application between a conventional processor and an acceleration card with FPGA chips has been proved as a suitable way for an acceleration of computationally intensive tasks. In such applications, the designer usually has to implement an interconnection between components placed in FPGA and the host system bus. This task is often complicated by different requirements of user components for throughput, latency of reading operations, need for DMA transfers etc. The objective of this work is to show a new approach for implementation of interconnection systems and to enable the designer to focus on the development of the target application. The proposed interconnection system is based on tree topology. The system eliminates the sensitivity of wide buses to the distance, supports the connection of components with different requirements for throughput, supports split transaction model and many other features. The proposed system is implemented and evaluated on chips with Virtex 5 technology.
Large number multiplication has always been an essential operation in cryptographic algorithms. In this paper, we propose Broken-Karatsuba multiplication by applying the non-least-positive form to represent large numb...
详细信息
ISBN:
(纸本)9789090304281
Large number multiplication has always been an essential operation in cryptographic algorithms. In this paper, we propose Broken-Karatsuba multiplication by applying the non-least-positive form to represent large numbers and dig the parallelism hidden in conventional Karatsuba multiplication. Further, we modify Montgomery modular multiplication algorithm with Broken-Karatsuba multiplication to make it suitable for pipeline implementation with fewer hardware resources. Based on this modified algorithm, a 256-bit two-stage modular multiplier is constructed. There is no stall in the pipeline when performing consecutive modular multiplications and the delay of a modular multiplication is reduced significantly. Implemented on Virtex-6 FPGA platforms, our design outperforms most previous works in terms of modular multiplication latency and area-time product, which makes it suitable for server-side applications.
This paper presents a novel tool flow combining rewriting logic with hardware synthesis. It enables the automated generation of synthesizable VHDL code from mathematical equations and the quick generation of functiona...
详细信息
ISBN:
(纸本)9781424403127
This paper presents a novel tool flow combining rewriting logic with hardware synthesis. It enables the automated generation of synthesizable VHDL code from mathematical equations and the quick generation of functionally equivalent alternative implementations. The simple but powerful semantics of rewriting logic provide a natural mechanism for manipulating algebraic expressions, using a high-level of abstraction which is afterwards automatically converted into lower levels of abstraction. The design flow is validated by generating polynomial approximations for arbitrary continuous functions. The polynomial generation process is completely parameterized regarding polynomial degree, number representation parameters, word width and polynomial evaluation approaches. Different functionally equivalent implementations for the resulting polynomial approximations were generated and synthesized for a Virtex4 device.
Rapid increases in transistor density, clock speeds and competition with custom ICs have escalated the demand for aggressive solutions to battle rising operating temperatures in programmable fabrics. In this work, we ...
详细信息
ISBN:
(纸本)9781424419609
Rapid increases in transistor density, clock speeds and competition with custom ICs have escalated the demand for aggressive solutions to battle rising operating temperatures in programmable fabrics. In this work, we make several key contributions to temperature management in FPGAs. We develop a novel and robust simulation framework exploring adaptive techniques to reduce on chip temperatures in the reconfigurable core. We implement a thermal driven voltage scaling algorithm based on temperature and performance feedback. Our performance estimation model is an accurate empirical relation between delay, supply voltage and temperature with an average error of 9%. Our final results show significant temperature reductions of upto 13.37 degrees C accompanied by the added benefit of power savings averaging 13.48%. Overheads are limited to an average reduction in worst case operating frequency of 10.78% and a voltage swing of 0.61V.
暂无评论