Just-in-time (JIT) compilation has been used in many applications to enable standard software binaries to execute on different underlying processor architectures, yielding software portability benefits. We previously ...
详细信息
Just-in-time (JIT) compilation has been used in many applications to enable standard software binaries to execute on different underlying processor architectures, yielding software portability benefits. We previously introduced the concept of a standard hardware binary to achieve similar portability benefits for hardware, using a JIT compiler to compile the hardware binary to an fpga. Our JIT compiler includes lean versions of technology mapping, placement, and routing algorithms that implement the standard hardware binary on a simple custom fpga fabric designed specifically for JIT compilation. While directly implementing a custom fpga fabric on silicon may be feasible for some applications, we investigated the option of implementing the simple fpga fabric as a circuit mapped to a physical fpga - a virtual fpga. We described our simple fabric in structural VHDL, synthesized the fabric onto a Xilinx Spartan-IIE fpga, and mapped 18 benchmark circuits onto the resulting virtual fpga. Our results show a 6X decrease in performance and a 100X increase in hardware resource usage for the virtual fpga approach compared to mapping the circuits directly to the physical fpga. For applications in which hardware portability is essential, a designer could leverage the large capacity of current commercially available fpgas to implement a virtual fpga with tens of thousands of configurable gates, providing about the same amount of configurable logic as fpgas produced in the mid 1990s. Nevertheless, the large overheads clearly indicate the need to develop a virtual fpga approach tuned to physical fabrics in order to reduce the overhead.
Creating a new fpga is a challenging undertaking because of the significant effort that must be spent on circuit design, layout and verification. It currently takes approximately 50 to 200 person years from architectu...
详细信息
ISBN:
(纸本)9781595930293
Creating a new fpga is a challenging undertaking because of the significant effort that must be spent on circuit design, layout and verification. It currently takes approximately 50 to 200 person years from architecture definition to tape-out for a new fpga family. Such a lengthy development time is necessary because the process is primarily done manually. Simplifying and shortening the design process would be advantageous since it could reduce the time to market for new fpgas while also enhancing architecture explorations. One way to accomplish this is through automation and, in this paper, we describe our efforts to automate the entire process by making use of a previously developed set of tools that assist in the creation of the repeatable fpga tile [25]. Our aim is to demonstrate the feasibility of a CAD flow that uses an input fpga architecture description to generate a layout that can be sent for fabrication. We prove the feasibility of this proposition by actually designing and fabricating a complete fpga. Initial functional testing of the fpga appears promising but is inconclusive at this time. Through this architecture to layout process, we investigate the issues that are faced in the architecture selection, circuit design, layout and verification of such an automatically produced fpga. We found that there are significant savings in design time. As well, we demonstrate that we can produce a layout using automated tools that is only 36% larger than a commercial fpga device layout. Given the significant time savings and the relatively minor area penalty, we feel that this work demonstrates that automated layout of fpgas is practical and advantageous. Copyright 2005 acm.
Mutual information-based 3D image Registration algorithm is shown to be an accurate, robust and more general registration method for medical image processing. However, its potential application in clinic is jeopardize...
详细信息
Mutual information-based 3D image Registration algorithm is shown to be an accurate, robust and more general registration method for medical image processing. However, its potential application in clinic is jeopardized by its high computational costs. Like many other 3D imaging algorithms, 3D registration needs to process vast amount of image data. The intrinsic computing characteristics of the mutual information-based registration algorithm make it even more difficult to manage and cache the input data. In this paper, we introduce our fpga-based computing platform and detail its application in accelerating the mutual information-based 3D image registration. This platform is designed to accelerate a series of 3D medical imaging algorithms dominated by local operations. It is a System-on-Chip (SoC) architecture with an intelligent data caching system for efficient data fetching and buffering. A reconfigurable multi-pipeline execution unit in the platform is designed to perform high-speed parallel computation. This unit can be customized according to different algorithm requirements to achieve high performance. The simulation results for accelerating the mutual information based registration with our platform show that a speed-up of about 30 can be achieved when compared to a single-CPU computer. Moreover, the platform can be easily reconfigured to accelerate other 3D medical imaging algorithms.
Floating-point Sparse Matrix-Vector Multiplication (SpMXV) is a key computational kernel in scientific and engineering applications. The poor data locality of sparse matrices significantly reduces the performance of S...
详细信息
ISBN:
(纸本)9781595930293
Floating-point Sparse Matrix-Vector Multiplication (SpMXV) is a key computational kernel in scientific and engineering applications. The poor data locality of sparse matrices significantly reduces the performance of SpMXV on general-purpose processors, which rely heavily on the cache hierarchy to achieve high performance. The abundant hardware resources on current fpgas provide new opportunities to improve the performance of SpMXV. In this paper, we propose an fpga-based design for SpMXV. Our design accepts sparse matrices in Compressed Row Storage format, and makes no assumptions about the sparsity structure of the input matrix. The design employs IEEE-754 format double-precision floating-point multipliers/adders, and performs multiple floating-point operations as well as I/O operations in parallel. The performance of our design for SpMXV is evaluated using various sparse matrices from the scientific computing community, with the Xilinx Virtex-II Pro XC2VP70 as the target device. The MFLOPS performance increases with the hardware resources on the device as well as the available memory bandwidth. For example, when the memory bandwidth is 8 GB/s, our design achieves over 350 MFLOPS for all the test matrices. It demonstrates significant speedup over general-purpose processors particularly for matrices with very irregular sparsity structure. Besides solving SpMXV problem, our design provides a parameterized and flexible tree-based design for floating-point applications on fpgas. Copyright 2005 acm.
In this work, the design of an energy-efficient fpga interconnect architecture has been investigated. It concerns a dual-supply solution, where the logic blocks are powered by a nominal voltage supply and the intercon...
详细信息
In this work, the design of an energy-efficient fpga interconnect architecture has been investigated. It concerns a dual-supply solution, where the logic blocks are powered by a nominal voltage supply and the interconnect part is powered by a reduced voltage supply. The behaviour of a fully-buffered, a fully pass-transistor based and a hybrid buffer and pass-transistor architecture has been investigated over a range of power supply voltages. It is found that there exists an optimal ratio between the number of pass-transistor and tri-state buffer switches depending on the load and power supply involved. By reducing the signal voltage swing on the interconnect, the need for a fully tri-state buffer-based interconnect is eliminated, thus saving valuable area and power. Through benchmark studies, it is confirmed that using an optimal composite of pass-transistor and tri-state buffer switches operating at a reduced power supply can meet the same speed as compared to the full-swing scenario at a much lower power consumption. An average reduction in power-delay of 4.4x for low-load critical paths and 2.7x for high-load critical paths is achieved using buffer receivers. Using levelshifter receivers, an average reduction in power-delay of 4.7x for low-load critical paths and 2.8x for high-load critical paths is obtained. It is also found that due to partially replacing tri-state buffers by pass-transistor switches and inspite of using levelshifters, we can save up to a factor of 4x in interconnect area as compared to fully-buffered architectures. The results have been validated over various benchmarks in a 0.1 μm CMOS technology.
Based on architecture analysis of island-style fpga, area and delay models of LUT fpga are proposed. The effect of LUT size on fpga area and performance is studied. Results show optimal L UT size conclusion from compu...
详细信息
ISBN:
(纸本)0769523013
Based on architecture analysis of island-style fpga, area and delay models of LUT fpga are proposed. The effect of LUT size on fpga area and performance is studied. Results show optimal L UT size conclusion from computation models is same as that of experiments. A LUT size of 4 produces the best area results. A LUT size of 5 provides the better performance.
In this work, we parameterize and explore the interconnect structure of pipelined fpgas. Specifically, we explore the effects of interconnect register population, length of registered routing track segments, registere...
详细信息
In this work, we parameterize and explore the interconnect structure of pipelined fpgas. Specifically, we explore the effects of interconnect register population, length of registered routing track segments, registered 10 terminals of logic units, and the flexibility of the interconnect structure on the performance of a pipelined fpga. Our experiments with the RaPiD architecture identify tradeoffs that must be made while designing the interconnect structure of a pipelined fpga. The post-exploration architecture that we found shows a 19% improvement over RaPiD, while the area overhead incurred in placing and routing benchmarks netlists on the post-exploration architecture is 18%.
Moore's Law states that the number of transistors on a device doubles every two years: however, it is often (mis)quoted based on its impact on CPU performance. This important corollary of Moore's Law states th...
详细信息
Moore's Law states that the number of transistors on a device doubles every two years: however, it is often (mis)quoted based on its impact on CPU performance. This important corollary of Moore's Law states that improved clock frequency plus improved architecture yields a doubling of CPU performance every 18 months. This paper examines the impact of Moore's Law on the peak floating-point performance of fpgas. Performance trends for individual operations are analyzed as well as the performance trend of a common instruction mix (multiply accumulate). The important result is that peak fpga floating-point performance is growing significantly faster than peak floating-point performance for a CPU.
In this paper we study the technology mapping problem of fpga architectures with dual supply voltages (Vdds) for power optimization. This is done with the guarantee that the mapping depth of the circuit will not incre...
详细信息
In this paper we study the technology mapping problem of fpga architectures with dual supply voltages (Vdds) for power optimization. This is done with the guarantee that the mapping depth of the circuit will not increase compared to the circuit with a single Vdd. We first design a single-Vdd mapping algorithm that achieves better power results than the latest published low-power mapping algorithms. We then show that our dual-Vdd mapping algorithm can further improve power savings by up to 11.6% over the single-Vdd mapper. In addition, we investigate the best low-Vdd/high-Vdd ratio for the largest power reduction among several dual-Vdd combinations. To our knowledge, this is the first work on dual-Vdd mapping for fpga architectures.
暂无评论