Video signal processing requires complex algorithms performing many basic operations on a video stream. To perform these calculations in real-time in a fpga, we must use innovative structures to meet speed requirement...
详细信息
ISBN:
(纸本)9781581134520
Video signal processing requires complex algorithms performing many basic operations on a video stream. To perform these calculations in real-time in a fpga, we must use innovative structures to meet speed requirements while managing complexity. As part of a project aiming at the development of a video noise reducer, we developed an optimized processing stream that required some floating-point calculations. This paper presents the rationale for developing a floating-point unit, justifies the data representation used, its implementation in a Xilinx VirtexE fpga and reports the performance we obtained. A divider using this representation is also presented, with its implementation and performances in the same fpga.
Recent years have seen a tremendous increase in the capacities and capabilities of field-programmablegatearrays (fpga's). Much of this dramatic improvement has been the result of changes to the fpgas' intern...
详细信息
ISBN:
(纸本)9781581134520
Recent years have seen a tremendous increase in the capacities and capabilities of field-programmablegatearrays (fpga's). Much of this dramatic improvement has been the result of changes to the fpgas' internal architectures. New architectural proposals are routinely generated in both academia and industry. For fpga's to continue to grow, it is important that these new architectural ideas are fairly and accurately evaluated, so that those worthy ideas can be included in future chips. Typically, this evaluation is done using experimentation. However, the use of experimentation is dangerous, since it requires making assumptions regarding the tools and architecture of the device in question. If these assumptions are not accurate, the conclusions from the experiments may not be meaningful. In this paper, we investigate the sensitivity of fpga architectural conclusions to experimental variations. To make our study concrete, we evaluate the sensitivity of four previously published and well-known fpga architectural results: lookup-table size, switch block topology, cluster size, and memory size. It is shown that these experiments are significantly affected by the assumptions, tools, and techniques used in the experiments.
We present a routability-driven bottom-up clustering technique for area and power reduction in clustered fpgas. This technique uses a cell connectivity metric to identify seeds for efficient clustering. Effective seed...
详细信息
ISBN:
(纸本)9781581134520
We present a routability-driven bottom-up clustering technique for area and power reduction in clustered fpgas. This technique uses a cell connectivity metric to identify seeds for efficient clustering. Effective seed selection, coupled with an interconnect-resource aware clustering and placement, can have a favorable impact on circuit routability. It leads to better device utilization, savings in area, and reduction in power consumption. Routing area reduction of 35% is achieved over previously published results. Power dissipation simulations using a buffered pass-transistor-based fpga interconnect model are presented. They show that our clustering technique can reduce the overall device power dissipation by an average of 13%.
This paper examines circuit design of buffered routing switches in symmetrical, island-style fpgas. The effects of switch size, tile length, level-restoring, and slow input slew rates are examined. Two new fanin-based...
详细信息
ISBN:
(纸本)9781581134520
This paper examines circuit design of buffered routing switches in symmetrical, island-style fpgas. The effects of switch size, tile length, level-restoring, and slow input slew rates are examined. Two new fanin-based switch designs are used to eliminate nearly all of the increase in delay that arises from fanout with a previous switch design. Alternating between buffers and pass transistors is shown to improve connection delay without fanout by 25 %. To take advantage of this, we propose schemes to replace some buffers with pass transistors to simultaneously reduce area and delay. Routing a suite of MCNC benchmark circuits shows that 14% in areadelay, or 7% in delay can be saved using the new switch schemes. Alternatively, approximately 13 % in area can be saved with no degradation to delay.
In reconfigurable computing, circuits implemented on multi-fpga systems have to be incrementally modified. Since reconfiguring an fpga is time-consuming, the time for reconfiguration depends on the number of fpgas to ...
详细信息
ISBN:
(纸本)9781581134520
In reconfigurable computing, circuits implemented on multi-fpga systems have to be incrementally modified. Since reconfiguring an fpga is time-consuming, the time for reconfiguration depends on the number of fpgas to be reconfigured. Our objective is to reduce the number of such fpgas. In this paper, we consider the specific problem of incrementally reconfiguring a multi-fpga system that utilizes the direct interconnection architecture, where routing connections between fpgas are to neighbors that are near. This problem can be divided into a net addition problem and a net deletion problem. We show that the net addition problem is a generalization of the NP-complete Steiner tree problem. Our algorithm for this problem is based on an adaptation of the Klein-Ravi approximation algorithm for the node-weighted Steiner tree problem. As for the net deletion problem, we prove that it is NP-complete but the problem is solvable in polynomial time for tree topologies. Based on the algorithm for trees, we design an effective heuristic algorithm for the general net deletion problem. Finally, we present an algorithm for solving the incremental reconfiguration problem which handles both placement of new gates and inter-fpga routing.
This paper reports on a method for extending existing VHDL design and verification software available for the Xilinx Virtex series of fpgas. It allows the designer to apply standard hardware design and verification to...
详细信息
ISBN:
(纸本)9781581134520
This paper reports on a method for extending existing VHDL design and verification software available for the Xilinx Virtex series of fpgas. It allows the designer to apply standard hardware design and verification tools to the design of dynamically reconfigurable logic (DRL). The technique involves the conversion of a dynamic design into multiple static designs, suitable for input to standard synthesis and APR tools. For timing and functional verification after APR, the sections of the design can then be recombined into a single dynamic system. The technique has been automated by extending an existing DRL design tool named DCSTech, which is part of the Dynamic Circuit Switching (DCS) CAD framework. The principles behind the tools are generic and should be readily extensible to other architectures and CAD toolsets. Implementation of the dynamic system involves the production of partial configuration bitstreams to load sections of circuitry. The process of creating such bitstreams, the final stage of our design flow, is summarized.
The continuing advances in the field of electrical engineering, in areas like cellular communications, fiber optics, mobile and multi-gigahertz electronics have necessitated a computer-assisted design approach to the ...
详细信息
ISBN:
(纸本)9781581134520
The continuing advances in the field of electrical engineering, in areas like cellular communications, fiber optics, mobile and multi-gigahertz electronics have necessitated a computer-assisted design approach to the complex electromagnetic interactions and problems that arise. Finite-Difference Time-Domain (FDTD) Analysis is a very powerful tool for the modeling of electromagnetic phenomena. The algorithm is computationally intensive and simulations can run for a few hours to several days. Increasing the computation speed and decreasing the run times of this algorithm would bring greater productivity and new avenues of research to many facets of electrical engineering. The algorithm is transferred to custom fpga-based hardware using a pipelined bit-serial arithmetic architecture. A one-dimensional resonator is used to verify the implementation and explore the hardware speed and costs. The computational speed is extremely fast and is not related to the number of computational cells in the simulation. Finally, a discussion of future research is presented.
Distributed Arithmetic (DA) is an important technique to implement digital signal processing (DSP) functions in fpgas. However, traditional lookup table (LUT) based DA architectures contain one or more carry propagati...
详细信息
ISBN:
(纸本)9781581134520
Distributed Arithmetic (DA) is an important technique to implement digital signal processing (DSP) functions in fpgas. However, traditional lookup table (LUT) based DA architectures contain one or more carry propagation chains in the critical path that dictates the fastest time at which an entire design can run. In this paper, we describe a novel technique that can reduce or eliminate the carry-propagate chain from the critical path in LUT based DA architectures on fpgas. In the proposed scheme, the individual bits of a word do not have to be processed as a unit. Instead, the current iteration can start as soon as the least significant bit (LSB) of the previous iteration is available, without waiting for the entire word from the previous iteration to be fully computed. This technique has great potential in speeding up DSP applications based on DA. Designs are described for serial and parallel DALUT and accumulator structures in which an n-bit carry chain, where n is the word length, is broken into smaller r-bit chains, 1 ≤ r
The SCORE compute model uses fixed-size, virtual compute and memory pages connected by stream links to capture the definition of a computation abstracted from the detailed size of the physical hardware. When the numbe...
详细信息
The SCORE compute model uses fixed-size, virtual compute and memory pages connected by stream links to capture the definition of a computation abstracted from the detailed size of the physical hardware. When the number of physical compute pages is smaller than the number of virtual compute pages in the abstract computation graph, the design is time-multiplexed onto the available physical hardware. A key component of this strategy is an automatic scheduler that selects the temporal sequencing of virtual resources onto the physical device. We describe a quasi-static scheduling strategy that retains the full semantic power of the dynamic SCORE flow graph while taking advantage of static scheduling techniques at program load time to hoist most of the computational work out of the inner scheduling loops. This strategy reduces online scheduling work per reconfiguration epoch by an order of magnitude. In addition, a more global perspective available from offline-scheduling improves schedule quality, resulting in a net reduction of total execution time by 46-81%.
Medical image processing in general and computerized tomography (CT) in particular can benefit greatly from hardware acceleration. This application domain is marked by computationally intensive algorithms requiring th...
详细信息
ISBN:
(纸本)9781581134520
Medical image processing in general and computerized tomography (CT) in particular can benefit greatly from hardware acceleration. This application domain is marked by computationally intensive algorithms requiring the rapid processing of large amounts of data. To date, reconfigurable hardware has not been applied to this important area. For efficient implementation and maximum speedup, fixed-point implementations are required. The associated quantization errors must be carefully balanced against the requirements of the medical community. Specifically, care must be taken so that very little error is introduced compared to floating-point implementations and the visual quality of the images is not compromised. In this paper, we present an fpga implementation of the parallel-beam backprojection algorithm used in CT for which all of these requirements are met. We explore a number of quantization issues arising in backprojection and concentrate on minimizing error while maximizing efficiency. Our implementation shows significant speedup over software versions of the same algorithm, and is more flexible than an ASIC implementation. Our fpga implementation can easily be adapted to both medical sensors with different dynamic ranges as well as tomographic scanners employed in a wider range of application areas including nondestructive evaluation and baggage inspection in airport terminals.
暂无评论