Mutual information-based 3D image Registration algorithm is shown to be an accurate, robust and more general registration method for medical image processing. However, its potential application in clinic is jeopardize...
详细信息
Mutual information-based 3D image Registration algorithm is shown to be an accurate, robust and more general registration method for medical image processing. However, its potential application in clinic is jeopardized by its high computational costs. Like many other 3D imaging algorithms, 3D registration needs to process vast amount of image data. The intrinsic computing characteristics of the mutual information-based registration algorithm make it even more difficult to manage and cache the input data. In this paper, we introduce our FPGA-based computing platform and detail its application in accelerating the mutual information-based 3D image registration. This platform is designed to accelerate a series of 3D medical imaging algorithms dominated by local operations. It is a System-on-Chip (SoC) architecture with an intelligent data caching system for efficient data fetching and buffering. A reconfigurable multi-pipeline execution unit in the platform is designed to perform high-speed parallel computation. This unit can be customized according to different algorithm requirements to achieve high performance. The simulation results for accelerating the mutual information based registration with our platform show that a speed-up of about 30 can be achieved when compared to a single-CPU computer. Moreover, the platform can be easily reconfigured to accelerate other 3D medical imaging algorithms.
Based on architecture analysis of island-style FPGA, area and delay models of LUT FPGA are proposed. The effect of LUT size on FPGA area and performance is studied. Results show optimal L UT size conclusion from compu...
详细信息
ISBN:
(纸本)0769523013
Based on architecture analysis of island-style FPGA, area and delay models of LUT FPGA are proposed. The effect of LUT size on FPGA area and performance is studied. Results show optimal L UT size conclusion from computation models is same as that of experiments. A LUT size of 4 produces the best area results. A LUT size of 5 provides the better performance.
Creating a new FPGA is a challenging undertaking because of the significant effort that must be spent on circuit design, layout and verification. It currently takes approximately 50 to 200 person years from architectu...
详细信息
ISBN:
(纸本)9781595930293
Creating a new FPGA is a challenging undertaking because of the significant effort that must be spent on circuit design, layout and verification. It currently takes approximately 50 to 200 person years from architecture definition to tape-out for a new FPGA family. Such a lengthy development time is necessary because the process is primarily done manually. Simplifying and shortening the design process would be advantageous since it could reduce the time to market for new FPGAs while also enhancing architecture explorations. One way to accomplish this is through automation and, in this paper, we describe our efforts to automate the entire process by making use of a previously developed set of tools that assist in the creation of the repeatable FPGA tile [25]. Our aim is to demonstrate the feasibility of a CAD flow that uses an input FPGA architecture description to generate a layout that can be sent for fabrication. We prove the feasibility of this proposition by actually designing and fabricating a complete FPGA. Initial functional testing of the FPGA appears promising but is inconclusive at this time. Through this architecture to layout process, we investigate the issues that are faced in the architecture selection, circuit design, layout and verification of such an automatically produced FPGA. We found that there are significant savings in design time. As well, we demonstrate that we can produce a layout using automated tools that is only 36% larger than a commercial FPGA device layout. Given the significant time savings and the relatively minor area penalty, we feel that this work demonstrates that automated layout of FPGAs is practical and advantageous. Copyright 2005 acm.
Numerous applications in Digital Signal Processing (DSP), telecommunications, graphics, cryptography and control systems have computations that involve a large number of multiplications of one variable with one or sev...
详细信息
Numerous applications in Digital Signal Processing (DSP), telecommunications, graphics, cryptography and control systems have computations that involve a large number of multiplications of one variable with one or several constants. In this paper, we present a constant array multiplier core generator using dynamic partial evaluation. The proposed constant array multiplier core generator combines a new partial evaluation method named Full Complement Recoding with Booth's recoding and the straightforward partial evaluation method. Based on the number of 0s, the number of runs that have more than two consecutive 1s and the total number of 1s in all the runs in the constant operand, the proposed multiplier core generator selects one of the three partial evaluation methods to construct a partial evaluation architecture and generate an efficient Hardware Description Language (HDL) code that can be used as a design component. The constant multiplier core generated by the Xilinx CORE Generator™ system does not provide the optimized constant multipliers for a large number of cases. When implemented using Xilinx FPGA Virtex II device, the average area saving and delay improvement of the constant multiplier generated by proposed core generator is 70% and 36% compared to the 55% and 15% of constant multipliers generated by Xilinx CORE Generator™ system.
Just-in-time (JIT) compilation has been used in many applications to enable standard software binaries to execute on different underlying processor architectures, yielding software portability benefits. We previously ...
详细信息
Just-in-time (JIT) compilation has been used in many applications to enable standard software binaries to execute on different underlying processor architectures, yielding software portability benefits. We previously introduced the concept of a standard hardware binary to achieve similar portability benefits for hardware, using a JIT compiler to compile the hardware binary to an FPGA. Our JIT compiler includes lean versions of technology mapping, placement, and routing algorithms that implement the standard hardware binary on a simple custom FPGA fabric designed specifically for JIT compilation. While directly implementing a custom FPGA fabric on silicon may be feasible for some applications, we investigated the option of implementing the simple FPGA fabric as a circuit mapped to a physical FPGA - a virtual FPGA. We described our simple fabric in structural VHDL, synthesized the fabric onto a Xilinx Spartan-IIE FPGA, and mapped 18 benchmark circuits onto the resulting virtual FPGA. Our results show a 6X decrease in performance and a 100X increase in hardware resource usage for the virtual FPGA approach compared to mapping the circuits directly to the physical FPGA. For applications in which hardware portability is essential, a designer could leverage the large capacity of current commercially available FPGAs to implement a virtual FPGA with tens of thousands of configurable gates, providing about the same amount of configurable logic as FPGAs produced in the mid 1990s. Nevertheless, the large overheads clearly indicate the need to develop a virtual FPGA approach tuned to physical fabrics in order to reduce the overhead.
In this work, the design of an energy-efficient FPGA interconnect architecture has been investigated. It concerns a dual-supply solution, where the logic blocks are powered by a nominal voltage supply and the intercon...
详细信息
In this work, the design of an energy-efficient FPGA interconnect architecture has been investigated. It concerns a dual-supply solution, where the logic blocks are powered by a nominal voltage supply and the interconnect part is powered by a reduced voltage supply. The behaviour of a fully-buffered, a fully pass-transistor based and a hybrid buffer and pass-transistor architecture has been investigated over a range of power supply voltages. It is found that there exists an optimal ratio between the number of pass-transistor and tri-state buffer switches depending on the load and power supply involved. By reducing the signal voltage swing on the interconnect, the need for a fully tri-state buffer-based interconnect is eliminated, thus saving valuable area and power. Through benchmark studies, it is confirmed that using an optimal composite of pass-transistor and tri-state buffer switches operating at a reduced power supply can meet the same speed as compared to the full-swing scenario at a much lower power consumption. An average reduction in power-delay of 4.4x for low-load critical paths and 2.7x for high-load critical paths is achieved using buffer receivers. Using levelshifter receivers, an average reduction in power-delay of 4.7x for low-load critical paths and 2.8x for high-load critical paths is obtained. It is also found that due to partially replacing tri-state buffers by pass-transistor switches and inspite of using levelshifters, we can save up to a factor of 4x in interconnect area as compared to fully-buffered architectures. The results have been validated over various benchmarks in a 0.1 μm CMOS technology.
In this paper we evaluate the trade-offs between various low-leakage design techniques for fieldprogrammablegatearrays (FGPAs) in deep sub-micron technologies. Since multiplexers are widely used in FPGAs for implem...
详细信息
In this paper we evaluate the trade-offs between various low-leakage design techniques for fieldprogrammablegatearrays (FGPAs) in deep sub-micron technologies. Since multiplexers are widely used in FPGAs for implementing look up tables (LUTs) and connection and routing switches, several low-leakage implementations of pass transistor based multiplexers and routing switches are proposed and their design trade-offs are presented based on transistor-level simulation, physical design, and impact on overall system performance. We find that gate biasing, the use of redundant SRAM cells, and integration of multi-Vt technology are ideal for FPGAs, and they can reduce leakage current by 2X-4X compared to an implementation without any leakage reduction technique. For some of the potential low-leakage design techniques being evaluated in our study, the impact on chip area is very minimal to an increase of 15% - 30%.
Moore's Law states that the number of transistors on a device doubles every two years: however, it is often (mis)quoted based on its impact on CPU performance. This important corollary of Moore's Law states th...
详细信息
Moore's Law states that the number of transistors on a device doubles every two years: however, it is often (mis)quoted based on its impact on CPU performance. This important corollary of Moore's Law states that improved clock frequency plus improved architecture yields a doubling of CPU performance every 18 months. This paper examines the impact of Moore's Law on the peak floating-point performance of FPGAs. Performance trends for individual operations are analyzed as well as the performance trend of a common instruction mix (multiply accumulate). The important result is that peak FPGA floating-point performance is growing significantly faster than peak floating-point performance for a CPU.
暂无评论