programmable logic devices such as fpgas are useful for a wide range of applications. However, fpgas are not commonly used in battery-powered applications because they consume more power than ASICs and lack power mana...
详细信息
ISBN:
(纸本)1595932925
programmable logic devices such as fpgas are useful for a wide range of applications. However, fpgas are not commonly used in battery-powered applications because they consume more power than ASICs and lack power management features. In this paper, we describe the design and implementation of Pika, a low-power fpga core targeting battery-powered applications such as those in consumer and automotive markets. Our design uses the Xilinx Spartan-3 low-cost fpga as a baseline and achieves substantial power savings through a series of power optimizations. The resulting architecture is compatible with existing commercial design tools. The implementation is done in a 90nm triple-oxide CMOS process. Compared to the baseline design, Pika consumes 46% less active power and 99% less standby power. Furthermore, it retains circuit and configuration state during standby mode, and wakes up from standby mode in approximately 100ns. Copyright 2006 acm.
field-programmable-Core-arrays (FPCA) will include various computing cores for a wide variety of applications ranging from DSP to general purpose computing. With the increasing gap between core computing speeds and me...
详细信息
ISBN:
(纸本)9781581134520
field-programmable-Core-arrays (FPCA) will include various computing cores for a wide variety of applications ranging from DSP to general purpose computing. With the increasing gap between core computing speeds and memory access latency, managing and orchestrating the movement of data across multiple cores will become increasingly important. In this paper we propose data reorganization engines that allow a wide variety of data reorganizations intra- as well as inter-memory modules for future FPCAs. We have experimented with a suite of data reorganizations pervasive in DSP applications. Our limited set of experiments reveals that the proposed designs for these engines are flexile and use little design area in current fpga fabrics, making them amenable to be easily integrated in future FPCAs as either soft- or hard- macros.
The purpose of this paper is to detail the method and findings of an architectural exploration of mixed granularity fieldprogrammablegatearrays (fpgas). The work carried out for the purposes of this study involves ...
详细信息
The purpose of this paper is to detail the method and findings of an architectural exploration of mixed granularity fieldprogrammablegatearrays (fpgas). The work carried out for the purposes of this study involves the creation of an analytical framework within which a set of benchmark circuits can be studied. The idea is to maximise the performance over all benchmark circuits by choosing an optimal set of silicon cores to be placed within a given area constraint. When connected with flexible configurable routing, these cores should together be capable of performing any one of the benchmark circuits. In this paper the problem is cast as a formal optimisation, and solved using existing optimisation tools. Any multiplication or memory operation is allowed to be implemented either by configuring fine-grain resources, or by using specialised functional units such as those found in a Xilinx Virtex 2 fpga. The design space is explored by examining the tradeoffs between area, speed and flexibility. The architectures generated are contrasted to commercial architectures with fixed ratios of functional units and, in addition, a sensitivity analysis is performed to see how the results are affected by the archtectural parameters of the problem.
fpgas are witnessing a big increase in their applications, especially with the introduction of state-of-the-art fpgas using nanometer technologies. This has been accompanied with a big increase in power dissipation in...
详细信息
fpgas are witnessing a big increase in their applications, especially with the introduction of state-of-the-art fpgas using nanometer technologies. This has been accompanied with a big increase in power dissipation in fpgas, which forms a road block to the integration of fpgas in several hand-held applications. Motivated by the increase in the percentage of leakage power dissipation to the total power dissipation in modern technologies, this work presents a complete CAD flow to mitigate leakage power dissipation in fpgas. The algorithm is based on a fpga architecture that employs multi-threshold CMOS technology. The flow is based on the VPR flow and it aims to pack and place logic blocks that exhibit similar idleness close to each other so that they can be turned off during their idle time. The flow is tested with a CMOS 0.13μm dual-vth technology and achieved an average power saving of 22%.
The increasing computational power enables various new applications that are runtime prohibitive before. fpga is one of such computational power with both reconfigurability and energy efficiency. In this paper, we dem...
详细信息
ISBN:
(纸本)9781450343541
The increasing computational power enables various new applications that are runtime prohibitive before. fpga is one of such computational power with both reconfigurability and energy efficiency. In this paper, we demonstrate the feasibility of eyeglasses-free displays through fpga acceleration. Specifically, we propose several techniques to accelerate the sparse matrix-vector multiplication and the L-BFGS iterative optimization algorithm with the consideration of the characteristics of fpgas. The experimental results show that we reach a 12.78X overall speedup of the glass-free display application.
Hardware-software co-design is the new trend for deep neural network and fpga accelerator development, which iteratively revises and tunes the full system. The bottleneck of the approach lies in the time-consuming har...
详细信息
In recent years the challenge of high performance, low power retargettable embedded system has been faced with different technological and architectural solutions. In this paper we present a new configurable unit expl...
详细信息
In recent years the challenge of high performance, low power retargettable embedded system has been faced with different technological and architectural solutions. In this paper we present a new configurable unit explicitly designed to implement additional reconfigurable pipelined datapaths, suitable for the design of reconfigurable processors. A VLIW reconfigurable processor has been implemented on silicon in a standard 0.18 μm CMOS technology to prove the effectiveness of the proposed unit. Testing on a signal processing algorithms benchmark showed speedups from 4.3x to 13.5x and energy consumption reduction up to 92%.
The general computing world settled on radix 2 floating point representations over three decades ago. The analyses which led to this choice were all based on the underlying premise that the goal of a floating-point re...
详细信息
The general computing world settled on radix 2 floating point representations over three decades ago. The analyses which led to this choice were all based on the underlying premise that the goal of a floating-point representation is to maximize numerical accuracy per bit of data. However, the unique nature of fpga-based computations makes numerical accuracy per unit of fpga resources a more important measure by which to judge the usefulness of a given floating point representation. Due to the high cost of shifters as implemented on fpgas, higher radix floating-point representations are uniquely suited to fpga-based computations, especially high precision calculations which require the support of denormalized numbers. Higher radix representations use fpga resources more efficiently. For example, a radix 16 adder requires 20% less LUTs than its radix 2 counterpart, while delivering equal worst-case and better average case numerical accuracy.
The performance benefits of a monolithically stacked 3D-fpga, whereby the programming overhead of an fpga is stacked on top of a standard CMOS layer containing the logic blocks and interconnects, are investigated. A V...
详细信息
ISBN:
(纸本)1595932925
The performance benefits of a monolithically stacked 3D-fpga, whereby the programming overhead of an fpga is stacked on top of a standard CMOS layer containing the logic blocks and interconnects, are investigated. A Virtex-II style 2D-fpga fabric is used as a baseline for quantifying the relative improvements in logic density, delay, and power consumption achieved by such a 3D-fpga. It is assumed that only the pass-transistor switches and configuration memory cells can be moved to the top layers and that the 3D-fpga employs the same logic block and programmable interconnect architecture as the baseline 2D-fpga. Assuming a configuration memory cell that is ≤ 0.7 the area of an SRAM cell and pass-transistor switches having the same characteristics as nMOS devices in the CMOS layer are used, it is shown that a monolithically stacked 3D-fpga can achieve 3.2 times higher logic density, 1.7 times lower critical path delay, and 1.7 times lower total dynamic power consumption than the baseline 2D-fpga fabricated in the same 65nm technology node. Copyright 2006 acm.
A design tool for routing channel segmentation in island-style fpgas is presented. Given the fpga architecture parameters and a set of benchmark designs, the tool optimizes routing channel segmentation using the avera...
详细信息
ISBN:
(纸本)9781595939340
A design tool for routing channel segmentation in island-style fpgas is presented. Given the fpga architecture parameters and a set of benchmark designs, the tool optimizes routing channel segmentation using the average interconnect power-delay product as a performance metric, which is estimated from placed and routed designs. A simulated- annealing procedure is used, whereby segmentation is incrementally changed in each iteration, the benchmark designs are mapped using VPR, and the performance metric is computed to decide whether to accept or reject the new segmentation. Run time is signi cantly reduced by using incremental routing in each iteration and parallelizing the metric evaluation. Experimental results using the MCNC benchmark designs demonstrate an average of 22% and 15% reduction in delay and power relative to a baseline segmentation. The results also show that average segment length should decrease with technology scaling. Finally, we demonstrate how the tool can be used to optimize other aspects of programmable routing in an fpga. Copyright 2008 acm.
暂无评论