The general computing world settled on radix 2 floating point representations over three decades ago. The analyses which led to this choice were all based on the underlying premise that the goal of a floating-point re...
详细信息
The general computing world settled on radix 2 floating point representations over three decades ago. The analyses which led to this choice were all based on the underlying premise that the goal of a floating-point representation is to maximize numerical accuracy per bit of data. However, the unique nature of fpga-based computations makes numerical accuracy per unit of fpga resources a more important measure by which to judge the usefulness of a given floating point representation. Due to the high cost of shifters as implemented on fpgas, higher radix floating-point representations are uniquely suited to fpga-based computations, especially high precision calculations which require the support of denormalized numbers. Higher radix representations use fpga resources more efficiently. For example, a radix 16 adder requires 20% less LUTs than its radix 2 counterpart, while delivering equal worst-case and better average case numerical accuracy.
The routing channels of today's fpgas consist of wire segments of various types. This routing architecture makes us capable of exploiting some new techniques to enhance the routability of net segments in channels ...
详细信息
The routing channels of today's fpgas consist of wire segments of various types. This routing architecture makes us capable of exploiting some new techniques to enhance the routability of net segments in channels in order to support engineering change order (ECO). In this paper we present an optimal greedy algorithm to switch the track, which each net segment is assigned to, in order to enhance the routability of newly added nets for enabling ECO. We used the routing architecture of Virtex II fpgas from Xilinx as our target routing architecture and integrated our algorithm into VPR fpga routing tool. The experimental result show that the algorithm reduces the number of Tracks by 9% in average. It allows 28.4% more rerouting than the existing router of VPR tool, which is based on Dijkestra's maze router algorithm.
Today high-end video and multimedia processing applications require huge amounts of memory. For cost reasons, the usage of conventional dynamic RAM (SDRAM) is preferred. However, SDRAM access optimization is a complex...
详细信息
Today high-end video and multimedia processing applications require huge amounts of memory. For cost reasons, the usage of conventional dynamic RAM (SDRAM) is preferred. However, SDRAM access optimization is a complex task, especially if multi-stream access with different QoS (Quality of Service) requirements is involved. At SIPS 2003 conference, we presented a multi-stream DDR-SDRAM controller IP covering combinations of low latency requirements for processor cache access, hard real-time constraints for periodic video signals and hard real-time bursty accesses for video coprocessors. To handle these contradictory QoS requirements at high system performance, a combination of an 2-stage scheduling algorithm and static priorities was used. This poster describes an additional flow control which greatly enhances the overall performance and controlability. The efficient but simple controller design makes the controller well suited for fpga based designs. Experiments with our fpga based high-end video platform demonstrate the superiority of this architecture.
Dynamically Reconfigurable Systems (DRS) offer a very interesting alternative for embedded digital systems design. Tasks scheduling within a reconfigurable environment allows the development of systems with better exe...
详细信息
Dynamically Reconfigurable Systems (DRS) offer a very interesting alternative for embedded digital systems design. Tasks scheduling within a reconfigurable environment allows the development of systems with better execution performance, chip area economy and lower power consumption. This paper describes a Petri Net based methodology for the design of dynamically reconfigurable systems, where tasks scheduling has as prime objective the best temporal performance of the overall application. The methodology includes the generation of an embedded controller supporting the scheduling process in the target architecture.
The purpose of this paper is to detail the method and findings of an architectural exploration of mixed granularity fieldprogrammablegatearrays (fpgas). The work carried out for the purposes of this study involves ...
详细信息
The purpose of this paper is to detail the method and findings of an architectural exploration of mixed granularity fieldprogrammablegatearrays (fpgas). The work carried out for the purposes of this study involves the creation of an analytical framework within which a set of benchmark circuits can be studied. The idea is to maximise the performance over all benchmark circuits by choosing an optimal set of silicon cores to be placed within a given area constraint. When connected with flexible configurable routing, these cores should together be capable of performing any one of the benchmark circuits. In this paper the problem is cast as a formal optimisation, and solved using existing optimisation tools. Any multiplication or memory operation is allowed to be implemented either by configuring fine-grain resources, or by using specialised functional units such as those found in a Xilinx Virtex 2 fpga. The design space is explored by examining the tradeoffs between area, speed and flexibility. The architectures generated are contrasted to commercial architectures with fixed ratios of functional units and, in addition, a sensitivity analysis is performed to see how the results are affected by the archtectural parameters of the problem.
This paper proposes an integrated framework for the high level design of high performance signal processing algorithms' implementations on fpgas. The framework emerged from a constant need to rapidly implement inc...
详细信息
This paper proposes an integrated framework for the high level design of high performance signal processing algorithms' implementations on fpgas. The framework emerged from a constant need to rapidly implement increasingly complicated algorithms on fpgas while maintaining the high performance needed in many real time digital signal processing applications. This is particularly important for application developers who often rely on iterative and interactive development methodologies. The central idea behind the proposed framework is to dynamically integrate high performance structural hardware description languages with higher level hardware languages in other to help satisfy the dual requirement of high level design and high performance implementation. The paper illustrates this by integrating two environments: Celoxica's Handel-C language, and HIDE, a structural hardware environment developed at the Queen's University of Belfast.
fpgas provide a speed advantage in processing for embedded systems, especially when processing is moved close to the sensors. Perhaps the ultimate embedded system is a neural prosthetic, where probes are inserted into...
详细信息
fpgas provide a speed advantage in processing for embedded systems, especially when processing is moved close to the sensors. Perhaps the ultimate embedded system is a neural prosthetic, where probes are inserted into the brain and recorded electrical activity is analyzed to determine which neurons have fired. In turn, this information can be used to manipulate an external device such as a robot arm or a computer mouse. To make the detection of these signals possible, some baseline data must be processed to correlate impulses to particular neurons. One method for processing this data uses a statistical clustering algorithm called Expectation Maximization, or EM. In this paper, we examine the EM clustering algorithm, determine the most computationally intensive portion, map it onto a reconfigurable device, and show several areas of performance gain.
Leakage power has been overshadowed by dynamic power minimization techniques in fpgas, and is a growing concern in programmable logic. This paper proposes a dual threshold voltage implementation of the fpga architectu...
详细信息
Leakage power has been overshadowed by dynamic power minimization techniques in fpgas, and is a growing concern in programmable logic. This paper proposes a dual threshold voltage implementation of the fpga architecture for leakage power reduction. A CAD flow is developed for assigning high threshold voltage to the logic elements within the logic blocks of the fpga for leakage power reduction. The CAD flow ensures that all the logic blocks remain identical with respect to the number of high and low threshold voltage logic elements that each logic block contains. This CAD flow leads to a dual threshold voltage implementation for the fpga architecture. Results indicate that over 95% of the logic elements in the fpga can be assigned high threshold voltage. On an average leakage savings of 60% and up to 70% for some benchmarks can be achieved. The proposed CAD flow forms a basis on which other dual threshold voltage implementations of fpga can be evaluated. We investigate the design trade-offs between the ratio of the number of high and number of low-Vt logic elements in a cluster and the leakage savings. We also investigate the impact of cluster size on leakage savings for the dual threshold voltage implementation.
fpga technology has become widely used for real-time network intrusion detection. In this paper, a novel packet classification architecture called BV-TCAM is presented, which is implemented for an fpga-based Network I...
详细信息
ISBN:
(纸本)9781595930293
fpga technology has become widely used for real-time network intrusion detection. In this paper, a novel packet classification architecture called BV-TCAM is presented, which is implemented for an fpga-based Network Intrusion Detection System (NIDS). The classifier can report multiple matches at gigabit per second network link rates. The BV-TCAM architecture combines the Ternary Content Addressable Memory (TCAM) and the Bit Vector (BV) algorithm to effectively compress the data representations and boost throughput. A tree-bitmap implementation of the BV algorithm is used for source and destination port lookup while a TCAM performs the lookup of the other header fields, which can be represented as a prefix or exact value. The architecture eliminates the requirement for prefix expansion of port ranges. With the aid of a small embedded TCAM, packet classification can be implemented in a relatively small part of the available logic of an fpga. The design is prototyped and evaluated in a Xilinx fpga XCV2000E on the FPX platform. Even with the most difficult set of rules and packet inputs, the circuit is fast enough to sustain OC48 traffic throughput. Using larger and faster fpgas, the system can work at speeds greater than OC192. Copyright 2005 acm.
This paper introduces a methodology for prototyping Globally Asynchronous Locally Synchronous (GALS) circuits on synchronous commercial fpgas. A library of required elements for implementing GALS circuits is proposed ...
详细信息
This paper introduces a methodology for prototyping Globally Asynchronous Locally Synchronous (GALS) circuits on synchronous commercial fpgas. A library of required elements for implementing GALS circuits is proposed and general design considerations to successfully implement a GALS circuit on fpga are discussed. The library includes clock generators and arbiters, and different port controllers. Different implementations of these circuits and their advantages and disadvantages are explored. At the end we present a GALS Reed-Solomon decoder as a practical example. The results show that the GALS approach improves the performance of the circuit by 11% and reduces the power consumption by 18.7% to 19.6% considering different error rates. On the other hand, the area of the circuit is increased by 51% which is acceptable considering that a pure synchronous circuit including a central controller is decomposed to generate GALS system and 29% of this overhead belongs to distributing controller in different modules. Deploying better decomposition methods can reduce this overhead substantially.
暂无评论