Good fpga placement is crucial to obtain the best Quality of Results (QoR) from fpga hardware. Although many published global placement techniques place objects in a continuous ASIC-like environment, fpgas are discret...
详细信息
ISBN:
(纸本)9781450311557
Good fpga placement is crucial to obtain the best Quality of Results (QoR) from fpga hardware. Although many published global placement techniques place objects in a continuous ASIC-like environment, fpgas are discrete in nature, and a continuous algorithm cannot always achieve superior QoR by itself. Therefore, discrete fpga-specific detail placement algorithms are used to improve the global placement results. Unfortunately, most of these detail placement algorithms do not have a global view. This paper presents a discrete "middle" placer that fills the gap between the two placement steps. It works like simulated annealing, but leverages various acceleration techniques. It does not pay the runtime penalty typical of simulated annealing solutions. Experiments show that with this placer, final QoR is significantly better than with the global-detail placer approach.
Overlay processor architectures allow fpgas to be programmed by non-experts using software, but prior designs have mainly been based on the architecture of their ASIC predecessors. In this paper we develop a new proce...
详细信息
ISBN:
(纸本)9781450311557
Overlay processor architectures allow fpgas to be programmed by non-experts using software, but prior designs have mainly been based on the architecture of their ASIC predecessors. In this paper we develop a new processor architecture that from the beginning accounts for and exploits the predefined widths, depths, maximum operating frequencies, and other discretizations and limits of the underlying fpga components. The result is Octavo, a ten-pipeline-stage eight-threaded processor that operates at the block RAM maximum of 550MHz on a Stratix IV fpga. Octavo is highly parameterized, allowing us to explore trade-offs in datapath and memory width, memory depth, and number of supported thread contexts.
With the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmablegatearrays (fpgas), application designers are confronted with the problem of searching a huge des...
详细信息
ISBN:
(纸本)9781450311557
With the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmablegatearrays (fpgas), application designers are confronted with the problem of searching a huge design space that has been shown to have widely varying performance and energy metrics for different accelerators, different application domains, and different use cases. To address this problem, numerous studies have evaluated specific applications across different accelerators. In this paper, we analyze an important domain of applications, referred to as sliding-window applications, when executing on fpgas, GPUs, and multicores. For each device, we present optimization strategies and analyze use cases where each device is most effective. The results show that fpgas can achieve speedup of up to 11x and 57x compared to GPUs and multicores, respectively, while also using orders of magnitude less energy.
In this paper, we describe a systolic fieldprogrammablegate Array (fpga) implementation of the Fastfood algorithm that is optimised to run at a high frequency. The Fastfood algorithm supports online learning for lar...
详细信息
ISBN:
(纸本)9781450356145
In this paper, we describe a systolic fieldprogrammablegate Array (fpga) implementation of the Fastfood algorithm that is optimised to run at a high frequency. The Fastfood algorithm supports online learning for large scale kernel methods. Empirical results show that 500 MHz clock rates can be sustained for an architecture that can solve problems with input dimensions that are 10(3) times larger than previously reported. Unlike many recent deep learning publications, this design implements both training and prediction. This enables the use of kernel methods in applications requiring a rare combination of capacity, adaption and speed.
In this paper we describe Xilinx's Versal (TM) Adaptive Compute Acceleration Platform (ACAP). ACAP is a hybrid compute platform that tightly integrates traditional fpgaprogrammable fabric, software programmable p...
详细信息
ISBN:
(纸本)9781450361378
In this paper we describe Xilinx's Versal (TM) Adaptive Compute Acceleration Platform (ACAP). ACAP is a hybrid compute platform that tightly integrates traditional fpgaprogrammable fabric, software programmable processors and software programmable accelerator engines. ACAP improves over the programmability of traditional reconfigurable platforms by introducing newer compute models in the form of software programmable accelerators and by separating out the data movement architecture from the compute architecture. The Versal architecture includes a host of new capabilities, including a chip-pervasive programmable Network-on-Chip (NoC), Imux Registers, compute shell, more advanced SSIT, adaptive deskew of global clocks, faster configuration, and other new programmable elements as well as enhancements to the CLB and interconnect. We discuss these architectural developments and highlight their key motivations and differences in relation to traditional fpga architectures.
This article presents the performance evaluation of two new diagonal routing tracks in fpgas. We discuss the automatic detailed architecture generation issues and propose changes in the conventional placement and rout...
详细信息
ISBN:
(纸本)9781605584102
This article presents the performance evaluation of two new diagonal routing tracks in fpgas. We discuss the automatic detailed architecture generation issues and propose changes in the conventional placement and routing to suit these architectures better. We conduct a series of experiments on these architecture with MCNC Benchmarks, where key parameters are varied over practical ranges and we conclude that the results are well in accordance, as predicted by the theory. Copyright 2009 acm.
Glitches are unnecessary transitions on logic signals that needlessly consume dynamic power. Glitches arise from imbalances in the combinational path delays to a signal, which may cause the signal to toggle multiple t...
详细信息
ISBN:
(纸本)9781450338561
Glitches are unnecessary transitions on logic signals that needlessly consume dynamic power. Glitches arise from imbalances in the combinational path delays to a signal, which may cause the signal to toggle multiple times in a given clock cycle before settling to its final value. In this paper, we propose a low-cost circuit structure that is able to eliminate a majority of glitches. The structure, which is incorporated into the output buffers of fpga logic elements, suppresses pulses on buffer outputs whose duration is shorter than a configurable time window (set at the time of fpga configuration). Glitches are thereby eliminated "at the source" ensuring they do not propagate into the high-capacitance fpga interconnect, saving power. An experimental study, using Altera commercial tools for power analysis, demonstrates that the proposed technique reduces 70% of glitches, at a cost of 1% reduction in speed performance.
The size of configuration bitstreams of field-programmablegatearrays (fpga) is increasing rapidly. Compression techniques are used to decrease the size of bitstreams. In this paper, an appropriate bitstream format a...
详细信息
Power consumption in fieldprogrammablegatearrays (fpgas) has become an important issue as the fpga market has grown to include mobile platforms. In this work we present a power-aware logic optimization tool that is...
详细信息
ISBN:
(纸本)9781595936004
Power consumption in fieldprogrammablegatearrays (fpgas) has become an important issue as the fpga market has grown to include mobile platforms. In this work we present a power-aware logic optimization tool that is specialized to facilitate subsequent power-aware technology mapping. Our synthesis framework uses binary decision diagram (BDD) based collapsing and decomposition techniques in conjunction with signal switching estimates to achieve power-efficient circuit networks. The results of synthesis and subsequent power-aware technology mapping are evaluated using two distinct physical design platforms: academic VPR and Altera Quartus II. Our approach achieves an average energy reduction of 13% for Altera Cyclone II devices versus synthesis with SIS-based algebraic optimization at the cost of 11% average circuit performance if performance-optimal technology mapping is performed after synthesis. If technology mapping is tuned to achieve the same average delay for both SIS and BDD-based flows, a 3% average energy reduction is achieved by our new synthesis approach.
暂无评论