Large, high density FPGAs with high local distributed memory bandwidth surpass the peak floating-point performance of high-end, general-purpose processors. Microprocessors do not deliver near their peak floating-point...
详细信息
ISBN:
(纸本)9781595930293
Large, high density FPGAs with high local distributed memory bandwidth surpass the peak floating-point performance of high-end, general-purpose processors. Microprocessors do not deliver near their peak floating-point performance on efficient algorithms that use the Sparse Matrix-Vector Multiply (SMVM) kernel. In fact, it is not uncommon for microprocessors to yield only 10-20% of their peak floating-point performance when computing SMVM. We develop and analyze a scalable SMVM implementation on modern FPGAs and show that it can sustain high throughput, near peak, floating-point performance. For benchmark matrices from the Matrix Market Suite we project 1.5 double precision Gflops/FPGA for a single Virtex II 6000-4 and 12 double precision Gflops for 16 Virtex Us (750Mflops/FPGA). Copyright 2005 acm.
FPGAs are witnessing a big increase in their applications, especially with the introduction of state-of-the-art FPGAs using nanometer technologies. This has been accompanied with a big increase in power dissipation in...
详细信息
FPGAs are witnessing a big increase in their applications, especially with the introduction of state-of-the-art FPGAs using nanometer technologies. This has been accompanied with a big increase in power dissipation in FPGAs, which forms a road block to the integration of FPGAs in several hand-held applications. Motivated by the increase in the percentage of leakage power dissipation to the total power dissipation in modern technologies, this work presents a complete CAD flow to mitigate leakage power dissipation in FPGAs. The algorithm is based on a FPGA architecture that employs multi-threshold CMOS technology. The flow is based on the VPR flow and it aims to pack and place logic blocks that exhibit similar idleness close to each other so that they can be turned off during their idle time. The flow is tested with a CMOS 0.13μm dual-vth technology and achieved an average power saving of 22%.
FPGAs provide a speed advantage in processing for embedded systems, especially when processing is moved close to the sensors. Perhaps the ultimate embedded system is a neural prosthetic, where probes are inserted into...
详细信息
FPGAs provide a speed advantage in processing for embedded systems, especially when processing is moved close to the sensors. Perhaps the ultimate embedded system is a neural prosthetic, where probes are inserted into the brain and recorded electrical activity is analyzed to determine which neurons have fired. In turn, this information can be used to manipulate an external device such as a robot arm or a computer mouse. To make the detection of these signals possible, some baseline data must be processed to correlate impulses to particular neurons. One method for processing this data uses a statistical clustering algorithm called Expectation Maximization, or EM. In this paper, we examine the EM clustering algorithm, determine the most computationally intensive portion, map it onto a reconfigurable device, and show several areas of performance gain.
This paper proposes an integrated framework for the high level design of high performance signal processing algorithms' implementations on FPGAs. The framework emerged from a constant need to rapidly implement inc...
详细信息
This paper proposes an integrated framework for the high level design of high performance signal processing algorithms' implementations on FPGAs. The framework emerged from a constant need to rapidly implement increasingly complicated algorithms on FPGAs while maintaining the high performance needed in many real time digital signal processing applications. This is particularly important for application developers who often rely on iterative and interactive development methodologies. The central idea behind the proposed framework is to dynamically integrate high performance structural hardware description languages with higher level hardware languages in other to help satisfy the dual requirement of high level design and high performance implementation. The paper illustrates this by integrating two environments: Celoxica's Handel-C language, and HIDE, a structural hardware environment developed at the Queen's University of Belfast.
Even with HiCuts algorithm, which is one of the most effective algorithms for packet classification, the on-line searching for each input packet still consumes the main CPU a large amount of computation resource if it...
详细信息
ISBN:
(纸本)9781595930293
Even with HiCuts algorithm, which is one of the most effective algorithms for packet classification, the on-line searching for each input packet still consumes the main CPU a large amount of computation resource if it is fulfilled by software. An effective alternative is to use a hardware co-processor to realize the on-line searching. Based on the principle of HiCuts algorithm, the architecture design of a hardware on-line searching co-processor with an FPGA is presented in this paper. Especially, mapping the decision tree and linear search in each leaf node to the memory data structure is described in detail. Benefiting from multiple pipeline structure, there are a total of 12 searching engines working parallel to achieve very high searching speed (8M packet heads/second). The simulation test results show a useful guide for optimization of off-line pre-processing and the co-processor design.
The purpose of this paper is to detail the method and findings of an architectural exploration of mixed granularity fieldprogrammablegatearrays (FPGAs). The work carried out for the purposes of this study involves ...
详细信息
The purpose of this paper is to detail the method and findings of an architectural exploration of mixed granularity fieldprogrammablegatearrays (FPGAs). The work carried out for the purposes of this study involves the creation of an analytical framework within which a set of benchmark circuits can be studied. The idea is to maximise the performance over all benchmark circuits by choosing an optimal set of silicon cores to be placed within a given area constraint. When connected with flexible configurable routing, these cores should together be capable of performing any one of the benchmark circuits. In this paper the problem is cast as a formal optimisation, and solved using existing optimisation tools. Any multiplication or memory operation is allowed to be implemented either by configuring fine-grain resources, or by using specialised functional units such as those found in a Xilinx Virtex 2 FPGA. The design space is explored by examining the tradeoffs between area, speed and flexibility. The architectures generated are contrasted to commercial architectures with fixed ratios of functional units and, in addition, a sensitivity analysis is performed to see how the results are affected by the archtectural parameters of the problem.
As the logic capacity of field-programmablegatearrays (FPGAs) increases, they are being increasingly used to implement large arithmetic-intensive applications, which often contain a large proportion of datapath circ...
详细信息
ISBN:
(纸本)9781595930293
As the logic capacity of field-programmablegatearrays (FPGAs) increases, they are being increasingly used to implement large arithmetic-intensive applications, which often contain a large proportion of datapath circuits. Since datapath circuits usually consist of regularly structured components (called bit-slices) which are connected together by regularly structured signals (called buses), it is possible to utilize datapath regularity in order to achieve significant area savings through FPGA architectural innovations. This paper describes such an FPGA routing architecture, called the multi-bit routing architecture, which employs bus-based connections in order to exploit datapath regularity. It is experimentally shown that, comparing to conventional FPGA routing architectures, the multi-bit routing architecture can achieve 14% routing area reduction for implementing datapath circuits, which represents an overall FPGA area savings of 10%. This paper also empirically determines the best values of several important architectural parameters for the new routing architecture including the most area efficient granularity values and the most area efficient proportion of bus-based connections. Copyright 2005 acm.
The routing channels of today's FPGAs consist of wire segments of various types. This routing architecture makes us capable of exploiting some new techniques to enhance the routability of net segments in channels ...
详细信息
The routing channels of today's FPGAs consist of wire segments of various types. This routing architecture makes us capable of exploiting some new techniques to enhance the routability of net segments in channels in order to support engineering change order (ECO). In this paper we present an optimal greedy algorithm to switch the track, which each net segment is assigned to, in order to enhance the routability of newly added nets for enabling ECO. We used the routing architecture of Virtex II FPGAs from Xilinx as our target routing architecture and integrated our algorithm into VPR FPGA routing tool. The experimental result show that the algorithm reduces the number of Tracks by 9% in average. It allows 28.4% more rerouting than the existing router of VPR tool, which is based on Dijkestra's maze router algorithm.
Sequential Control System in the industry has been used in applications based on programmable Logical Controllers (PLC). These Systems are, in general, highly complex and with an operation cycle around 1ms or 10ms. PL...
详细信息
Sequential Control System in the industry has been used in applications based on programmable Logical Controllers (PLC). These Systems are, in general, highly complex and with an operation cycle around 1ms or 10ms. PLC are, in general, expensive for theses high complex applications. In this work, a Dynamical Reconfigurable approach is presented, based on Xilinx Virtex-II FPGA architecture, operating as a virtual hardware machine. In this context, the control process is specified in the industrial standard language SFC/Petri net (Sequential Function Chart). For large controllers, a partial and dynamical reconfiguration mechanism takes place and the controller is split into multiple contexts, which are sequentially executed within the same FPGA, without violating the operation cycle of the system, in spite of the reconfiguration overhead. The solution is cost compatible with current PLC for complex applications and can reach better performance by exploration of the potential parallelism of control descriptions.
Vdd-programmable FPGAs have been proposed recently to reduce FPGA power, where Vdd levels can be customized for different circuit elements and unused circuit elements can be power-gated. In this paper, we first develo...
详细信息
ISBN:
(纸本)9781595930293
Vdd-programmable FPGAs have been proposed recently to reduce FPGA power, where Vdd levels can be customized for different circuit elements and unused circuit elements can be power-gated. In this paper, we first develop an accurate FPGA power model and then design novel Vdd-programmable interconnect switches with minimum number of configuration SRAM cells. Applying our power model to placed and routed benchmark circuits, we evaluate Vddprogrammable FPGA architecture using the new switches. The best architecture in our study uses Vdd-programmable logic blocks and Vdd-gateable interconnects. Compared to the baseline architecture similar to the leading commercial architecture, the best architecture reduces the minimal energy-delay product by 44.14% with 48% area overhead and 3% SRAM cell increase. Our evaluation results also show that LUT size 4 always gives the lowest energy consumption while LUT size 7 always leads to the highest performance for all evaluated architectures. Copyright 2005 acm.
暂无评论