Since the inception of fpgas over 2 decades ago, the micro-architectures and macro-architectures of fpgas across all fpga vendors have been converging strongly to the point that comparable fpgas from the main fpga ven...
详细信息
ISBN:
(纸本)9781450370998
Since the inception of fpgas over 2 decades ago, the micro-architectures and macro-architectures of fpgas across all fpga vendors have been converging strongly to the point that comparable fpgas from the main fpga vendors had virtually the same use models, and the same programming models. User designs were getting easier to port from one vendor to the other with every generation. Recent developments in from different fpga vendors targeting the most advanced semiconductor technology nodes are an abrupt and disruptive break from this trend, especially at the macro-architectural level.
In this paper we propose new techniques for thermal and power characterization of fieldprogrammablegatearrays (fpgas) using infrared imaging techniques. For thermal characterization, we capture the thermal emission...
详细信息
ISBN:
(纸本)9781450305549
In this paper we propose new techniques for thermal and power characterization of fieldprogrammablegatearrays (fpgas) using infrared imaging techniques. For thermal characterization, we capture the thermal emissions from the backside of an fpga chip during operation. We analyze the captured emissions and quantify the extent of thermal gradients and hot spots in fpgas. Given that fpgas are fabricated with no knowledge of the potential field designs, we propose soft sensing techniques that can combine the measurements of hard sensors to accurately estimate the temperatures where no sensors are embedded. For power characterization, we propose algorithmic techniques to invert the thermal emissions from fpgas into spatial power estimates. We demonstrate how this technique can be used to produce spatial power maps of soft processors during operation.
This paper describes a bus mastering implementation of the PCI Express protocol using a Xilinx fpga. While the theoretical peak performance of PCI Express is quite high, attaining that performance is a complex endeavo...
详细信息
ISBN:
(纸本)9781605584102
This paper describes a bus mastering implementation of the PCI Express protocol using a Xilinx fpga. While the theoretical peak performance of PCI Express is quite high, attaining that performance is a complex endeavor on top of an already complex protocol. The implementation is described and its performance is analyzed. Source code is offered for free download via the web. Copyright 2009 acm.
The fpga architectural issue of the effect of logic block functionality on fpga performance and density is investigated. In particular, in the context of lookup tables (LUT), cluster-based island-style fpgas, the effe...
详细信息
The fpga architectural issue of the effect of logic block functionality on fpga performance and density is investigated. In particular, in the context of lookup tables (LUT), cluster-based island-style fpgas, the effect of LUT size and cluster size on the speed and logic density of an fpga is analyzed. A fully timing-driven experimental flow, in which a set of benchmark circuits are synthesized, is used into different cluster based logic book architectures, which contain groups of LUTs and flip-flops.
The fieldprogrammable Counter Array (FPCA) was introduced to improve fpga performance for arithmetic circuits. An FPCA is a reconfigurable IP core that can be integrated into an fpga. To exploit the FPCA, a circuit i...
详细信息
ISBN:
(纸本)9781595939340
The fieldprogrammable Counter Array (FPCA) was introduced to improve fpga performance for arithmetic circuits. An FPCA is a reconfigurable IP core that can be integrated into an fpga. To exploit the FPCA, a circuit is transformed by merging disparate addition and multiplication operations into large multi-input addition operations, which are synthesized as compressor trees on the FPCA;the remaining portion of the circuit is synthesized on the fpga. This paper presents a series of architectural improvements to the FPCA that reduce routing delay, increase flexibility and component utilization, and simplify the integration process. Using an fpga containing six FPCAs, we observed average and maximum speedups of 1.60x and 2.40x on a set of arithmetic benchmarks.
We present an architecture for a synthesizable datapath-oriented fieldprogrammablegate Array (fpga) core which can be used to provide post-fabrication flexibility to a System-on-Chip (SoC). Our architecture is optim...
详细信息
ISBN:
(纸本)9781595936004
We present an architecture for a synthesizable datapath-oriented fieldprogrammablegate Array (fpga) core which can be used to provide post-fabrication flexibility to a System-on-Chip (SoC). Our architecture is optimized for bus-based operations that are common in signal processing and computation intensive applications. It employs a directional routing architecture, which allows it to be synthesized using standard ASIC design tools and flows. We also describe a proof-of-concept layout of our core. It is shown that the proposed architecture is significantly more area efficient than the best previously reported synthesizable programmable logic core.
The proceedings contains 26 papers from the fpga 2002 Tenth acminternationalsymposium on field-programmablegatearrays. Topics discussed include: interconnect enhancements for a high-speed PLD architecture;fpga swi...
详细信息
The proceedings contains 26 papers from the fpga 2002 Tenth acminternationalsymposium on field-programmablegatearrays. Topics discussed include: interconnect enhancements for a high-speed PLD architecture;fpga switch block layout and evaluation;a faster distributed arithmetic architecture for fpgas;efficient circuit clustering for area and power reduction in fpgas and integrated retiming and placement for fieldprogrammablegatearrays.
Good fpga placement is crucial to obtain the best Quality of Results (QoR) from fpga hardware. Although many published global placement techniques place objects in a continuous ASIC-like environment, fpgas are discret...
详细信息
ISBN:
(纸本)9781450311557
Good fpga placement is crucial to obtain the best Quality of Results (QoR) from fpga hardware. Although many published global placement techniques place objects in a continuous ASIC-like environment, fpgas are discrete in nature, and a continuous algorithm cannot always achieve superior QoR by itself. Therefore, discrete fpga-specific detail placement algorithms are used to improve the global placement results. Unfortunately, most of these detail placement algorithms do not have a global view. This paper presents a discrete "middle" placer that fills the gap between the two placement steps. It works like simulated annealing, but leverages various acceleration techniques. It does not pay the runtime penalty typical of simulated annealing solutions. Experiments show that with this placer, final QoR is significantly better than with the global-detail placer approach.
Bitwidth optimization of fpga datapaths can save hardware resources by choosing the fewest number of bits required for each datapath variable to achieve a desired quality of result. However, it is an NP-hard problem t...
详细信息
ISBN:
(纸本)9781450338561
Bitwidth optimization of fpga datapaths can save hardware resources by choosing the fewest number of bits required for each datapath variable to achieve a desired quality of result. However, it is an NP-hard problem that requires unacceptably long runtimes when using sequential CPU-based heuristics. We show how to parallelize the key steps of bitwidth optimization on the GPU by performing a fast brute-force search over a carefully constrained search space. We develop a high-level synthesis methodology suitable for rapid prototyping of bitwidth-annotated RTL code generation using gcc's GIMPLE backend. For range analysis, we perform parallel evaluation of sub-intervals to provide tighter bounds compared to ordinary interval arithmetic. For bitwidth allocation, we enumerate the different bitwidth combinations in parallel by assigning each combination to a GPU thread. We demonstrate up to 10-1000x speedups for range analysis and 50-200x speedups for bitwidth allocation when comparing NVIDIA K20 GPU implementation to an Intel Core i5-4570 CPU while maintaining identical solution quality across various benchmarks. This allows us to generate tailor-made RTL with minimum bitwidths in hundreds of milliseconds instead of hundreds of minutes when starting from high-level C descriptions of dataflow computations.
暂无评论