In this paper we present an FPGA-based dataflow architecture that both efficiently computes parallel algorithms using dedicated FPGA resources and scales well to multi-FPGA chip designs while the overall communication...
详细信息
ISBN:
(纸本)9781424410590
In this paper we present an FPGA-based dataflow architecture that both efficiently computes parallel algorithms using dedicated FPGA resources and scales well to multi-FPGA chip designs while the overall communication bandwidth increases. the basic idea is based on reconfiguration. In contrast to the concept of partially reconfiguring FPGAs, our approach is to connect computational units via a dynamically variable topology. the latter consists of dedicated switches which are individually controlled by simple shift registers. Hence, the computational result is a function of the currently configured interconnection pattern that can be updated within one single clock cycle. the scalability of this architecture is shown on a high-performance parallel FFT.
Molecular dynamics simulation based on discrete event simulation (DMD) is emerging as an alternative to time-step driven molecular dynamics (MD). DMD uses simplified discretized models, enabling simulations to be adva...
详细信息
ISBN:
(纸本)9781424410590
Molecular dynamics simulation based on discrete event simulation (DMD) is emerging as an alternative to time-step driven molecular dynamics (MD). DMD uses simplified discretized models, enabling simulations to be advanced by event, with a resulting performance increase of several orders of magnitude. Even so, DMD is compute bound. Moreover, unlike MD, causality issues make DMD difficult to scale, with O(root p) being the best so far achieved. We find that FPGAs are extremely well suited to accelerating DMD. the chaotic execution, which results in there being virtually no prediction window, is overcome with a long processing pipeline augmented with associative structures analogous to those used in CPU reorder buffers. Our primary result is a microarchitecture for DMD that processes events with a throughput equal to a small multiple of the FPGA's clock, resulting in a hundred-fold speed-up over serial implementations.
this paper describes an FPGA-based accelerator for maze routing applications such as integrated circuit detailed routing. the accelerator efficiently supports multiple layers, multi-terminal nets, and rip up and rerou...
详细信息
ISBN:
(纸本)9781424410590
this paper describes an FPGA-based accelerator for maze routing applications such as integrated circuit detailed routing. the accelerator efficiently supports multiple layers, multi-terminal nets, and rip up and reroute. By time-multiplexing multiple layers over a two-dimensional array of processing elements, this approach can support multi-layer grids large enough for detailed routing while providing at 1-2 orders of magnitude speedup over software running on a modern desktop computer. the current implementation supports a 32 X 32 routing grid with up to 16 layers in a single Xilinx XC2V6000 FPGA. Up to 64 X 64 routing grids are feasible in larger commercially available FPGAs. Performance measurements (including interface overhead) show a speedup of 29X-93X over the classic Lee Algorithm and 5X-19X over the A* Algorithm. An improved interface design could yield significantly larger speedups.
A biological organism's ability to sense and adapt to its environment is essential to its survival. Likewise, environmentally aware computing systems avail themselves to a longer operational life and a wider range...
详细信息
ISBN:
(纸本)9781424410590
A biological organism's ability to sense and adapt to its environment is essential to its survival. Likewise, environmentally aware computing systems avail themselves to a longer operational life and a wider range of applicationsthan traditional systems. In this paper, we propose a novel circuit design methodology that allows parameterizable hardware to self-regulate its temperature. We apply this methodology to an image recognition system on an Xilinx Virtex 4 FX100 fieldprogrammable gate array (FPGA). the image recognition system sustains a safe operational temperature by automatically adjusting its frequency and output quality. the circuit sacrifices output performance and quality to lower its internal temperature as the ambient temperature increases, and can leverage cooler temperatures by increasing output performance and quality. Furthermore, the circuit will shutdown if the ambient temperature becomes too hot for the device to function properly. A performance evaluation of our adaptive circuit under various thermal conditions shows up to a 4x factor increase in performance and a 2x factor increase in quality over a system without dynamic thermal control.
this paper introduces a scalable FPGA implementation of a stochastic simulation algorithm (SSA) called the Next Reaction Method. there are some hardware approaches of SSAs that obtained high-throughput on reconfigurab...
详细信息
ISBN:
(纸本)9781424410590
this paper introduces a scalable FPGA implementation of a stochastic simulation algorithm (SSA) called the Next Reaction Method. there are some hardware approaches of SSAs that obtained high-throughput on reconfigurable devices such as FPGAs, but these works lacked in scalability. the design of this work can accommodate to the increasing size of target biochemical models, or to make use of increasing capacity of FPGAs. Interconnection network between arithmetic circuits and multiple simulation circuits aims to per-form a data-driven multi-threading simulation. Approximately 8 times speedup was obtained compared to an execution on Xeon 2.80GHz.
In this paper, we investigate three different realizations of the same block from different points of view. the mentioned different realizations include two realizations with embedded processors (custom 16-bit RISC pr...
详细信息
ISBN:
(纸本)9781424410590
In this paper, we investigate three different realizations of the same block from different points of view. the mentioned different realizations include two realizations with embedded processors (custom 16-bit RISC processor and general soft-core processor) and the third realization uses Handel-C as an example of synthesisable high-level abstraction languages. the results show that development time of complete solution (HW and SW) is approximately the same for the Handel-C design and the design with soft-core processor;the development time of the Custom 16-bit RISC processor is about five times higher. Moreover, the throughput of the Handel-C design measured in the number of bits processed in one second is the highest. the obtained frequency and occupied area of the Handel-C design depends on the complexity of the used program. However, results are comparable or even better than results of the embedded processors.
Multiprocessor systems-on-chip (MPSoC) are being developed in increasing numbers to support the high number of applications running on modern embedded systems. Designing and programming such systems prove to be a majo...
详细信息
ISBN:
(纸本)9781424410590
Multiprocessor systems-on-chip (MPSoC) are being developed in increasing numbers to support the high number of applications running on modern embedded systems. Designing and programming such systems prove to be a major challenge. Most of the current design methodologies rely on creating the design by hand, and are therefore error-prone and time-consuming. this also limits the number of design points that can be explored. While some efforts have been made to automate the flow and raise the abstraction level, these are still limited to single-application designs. In this paper, we present a design methodology to generate and program MPSoC designs in a systematic and automated way for multiple applications. the architecture is automatically inferred from the application specifications, and customized for it. the flow is ideal for fast design space exploration (DSE) in MPSoC systems. We present results of a case study to compute the buffer-throughput trade-offs in real-life applications, H263 and JPEG decoders. the generation of the entire project takes about 100ms, and the whole DSE was completed in 45 minutes, including the FPGA mapping and synthesis.
In this paper, we describe a hardware algorithm for the minimum p-quasi clique cover (MPQCC) problem and its implementation on an FPGA. MPQCC problem is a combinational optimization problem that is NP-complete. Furthe...
详细信息
ISBN:
(纸本)9781424410590
In this paper, we describe a hardware algorithm for the minimum p-quasi clique cover (MPQCC) problem and its implementation on an FPGA. MPQCC problem is a combinational optimization problem that is NP-complete. Furthermore, gene expression profile analysis is one of applied fields of MPQCC problem. We aim to develop an inexpensive acceleration system using FPGAs for gene expression profile analysis. We adopt a Hopfield neural network for the proposed algorithm for the reduction of the calculation time. the proposed architecture using a ring network can execute the proposed algorithm effectively on FPGAs because each module can run in parallel independently and the system can be implemented with simple placement and routing of modules, and high scalability. We show that the proposed method is better than the existing one with regard to its solution searching ability and required calculation time.
A novel fieldprogrammable gate array (FPGA) logic synthesis technique that determines if a logic function can be implemented in a given programmable circuit is presented, and how this problem can be formalised and so...
详细信息
A novel fieldprogrammable gate array (FPGA) logic synthesis technique that determines if a logic function can be implemented in a given programmable circuit is presented, and how this problem can be formalised and solved using quantified Boolean satisfiability is described. this technique is general enough to be applied to any type of logic function and programmable circuit;thus, it has many applications to FPGAs. the application demonstrated is the FPGA programmablelogic block evaluation and the results show that this tool allows radical new features of FPGA logic blocks to be evaluated in a rigorous scientific way.
the benchmark of pricing a European option via Monte Carlo simulation is commonly used in financial engineering for evaluating the performance of new computational techniques and to tune the parameters of the Monte Ca...
详细信息
ISBN:
(纸本)9781424410590
the benchmark of pricing a European option via Monte Carlo simulation is commonly used in financial engineering for evaluating the performance of new computational techniques and to tune the parameters of the Monte Carlo simulation for improved convergence. this paper presents a comparison of different FPGA implementations of the European option benchmark against other implementations using GPUs, Cell BE, and a traditional software implementation. Error against a closed form solution is contrasted with relative acceleration for the different implementations. the FPGA approach gives significant performance advantages compared to the alternatives examined. An acceleration of x compared to a reference software implementation can be obtained using FPGAs, compared to only x in the case of the best non-FPGA alternative. Better error performance than a double precision floating point software implementation may also be obtained. In addition, the reconfigurability of an FPGA solution allows tradeoffs between acceleration and error not possible with alternative approaches. the FPGA implementations were produced using 'HyperStreams', a high level abstraction for designing arithmetic pipelines built on the Handel-C programming language.
暂无评论