FPGA has been used as hardware accelerators for many scientific applications in recent years. This paper investigates performance of FPGA hardware accelerator with cell unit capable of floating point operation through...
详细信息
ISBN:
(纸本)1601320647
FPGA has been used as hardware accelerators for many scientific applications in recent years. This paper investigates performance of FPGA hardware accelerator with cell unit capable of floating point operation through a case study of Dirichlet Boundary Problem (DBP). In this paper, we concentrate on the accelerator performance with real-time results updated in PC memory. FPGA architecture for the DBP application is designed and implemented on FPGA computing card with a Xilinx XC4VLX100 chip. A performance model is established for the FPGA implementation based on communication time for data sharing between host PC and FPGA and execution time within FPGA accelerator. Experiment environments and hardware resource utilization are discussed. Finally, the model is analyzed and verified to find the optimum performance.
Today most research involving the Advanced Encryption Standard (AES) algorithm falls into one of three areas: Ultra-high speed encryption, Very low power consumption, and Algorithm integrity. The problem that this stu...
详细信息
ISBN:
(纸本)1601320647
Today most research involving the Advanced Encryption Standard (AES) algorithm falls into one of three areas: Ultra-high speed encryption, Very low power consumption, and Algorithm integrity. The problem that this study addresses is how to lower the power consumption of an FPGA-based encryption core without dramatically affecting its performance. Three designs are proposed that utilize direct routing optimizations rather than remaining dependant on compilation software for finding connections optimizations. Analysis of the designs show that the proposed designs are able to reduce logic power by 55% and signal power by 112% on a Xilinx Spartan 3 FPGA operating at 25 MHz.
In a preemptive multitasking environment, when a task is preempted, necessary state information must be correctly preserved in order for the task to be resumed later. For hardware tasks executing on a coarse-grained d...
详细信息
ISBN:
(纸本)1601320647
In a preemptive multitasking environment, when a task is preempted, necessary state information must be correctly preserved in order for the task to be resumed later. For hardware tasks executing on a coarse-grained dynamically reconfigurable processing array (DRPA), a greate amount of state data are usually distributed on many different storage elements. Besides, DRPAs have different architectures using a variety of development tools. This paper addresses such problems and propose a method for capturing the state data of hardware tasks. Based on resource usage analysis, algorithms for identifying preemption points and inserting preemption states subject to user-specified preemption latency are proposed. Also, the integaration of the proposed steps into the system design flow is discussed. The performance degradation caused by preemption is minimized by allowing preemption only at predefined points where demanded resources are small. The evaluation result using a model based on NEC Electronics DRP-1 shows that the proposed method could allow preemption for a certain task satisfying a given preemption latency with reasonable hardware overhead (from 6% to 15%).
Recently, a new reconfigurable device with a concept of "self-reconfiguration" has been proposed led by recent highly developed semiconductor technologies. In this paper we propose a distributed control mode...
详细信息
ISBN:
(纸本)1932415424
Recently, a new reconfigurable device with a concept of "self-reconfiguration" has been proposed led by recent highly developed semiconductor technologies. In this paper we propose a distributed control model and its design of adaptive load distribution as an application of the concept. The circuits of the model are designed as objects for a prototype device named PCA-1 and verified by a simulation.
The timing analysis for reconfiguration and the execution of an optical differential reconfigurable gate array (ODRGA) for a dynamically reconfigurable processor was discussed. The reconfigurable timing was analyzed u...
详细信息
ISBN:
(纸本)1932415424
The timing analysis for reconfiguration and the execution of an optical differential reconfigurable gate array (ODRGA) for a dynamically reconfigurable processor was discussed. The reconfigurable timing was analyzed using a 0.35 μm CMOS process ODRGA-VLSI chip. The reconfiguration time of the ODRGA-VLSI chip was separated into three parts which were refresh time, detection time of the optical reconfiguration context, and the response time of differential reconfiguration circuits. It was found that the total time period necessary for optical reconfiguration and subsequent execution of the implementation circuit were 14.4 ns at 69 MHz.
There is a significant demand for embedding high performance reconfigurable cores within future system on chip (SoC) designs as such cores offer flexibility as well as superior performance advantages in terms of speed...
详细信息
ISBN:
(纸本)1601320647
There is a significant demand for embedding high performance reconfigurable cores within future system on chip (SoC) designs as such cores offer flexibility as well as superior performance advantages in terms of speed, power and run time reconfigurability. However, one obstacle to the adoption of such cores within SoC designs is lack of know-how on how to model such high performance cores such that they can be exploited by electronic system level (ESL) design tools and methodologies. This paper proposes a new multi-context representation in order to capture reconfigurable features, within different embedded reconfigurable cores, at different run time periods. A co-simulation framework is proposed as part of the design and modelling procedure. A hybrid SystemC-HDL co-simulation scenario is utilised in order to demonstrate interoperability of the designed reconfigurable hardware block. As a case study, we utilise our modelling strategy in order to design the crucial WiMAX standard compliant telecommunication receiver where two new reconfigurable fabrics are employed for the implementation of the performance bottlenecks within this standard, namely FFT and Viterbi processing tasks. We demonstrate how our modelling approach can be utilised to capture complete system functionality and provide performance and power consumption figures for system design.
Triple modular redundancy (TMR) is a popular technique for mitigating single-event upsets (SEUs) in FPGAs. Traditional TMR, however, is only designed to protect against a single fault at a time. TMR with more frequent...
详细信息
ISBN:
(纸本)1601320647
Triple modular redundancy (TMR) is a popular technique for mitigating single-event upsets (SEUs) in FPGAs. Traditional TMR, however, is only designed to protect against a single fault at a time. TMR with more frequent voting (also called partitioned TMR) can provide improved reliability by giving protection against more than a single upset at a time. This paper implements partitioned TMR within an FPGA and demonstrates the improvements in reliability provided by this technique through fault injection. The results of these experiments demonstrate significant improvements in reliability when large numbers of upsets occur. Arbitrarily increasing the number of partitions, however, provides diminishing returns as the reliability of the voters become dominant with large numbers of partitions.
In this paper, we propose FPGA-based scalable architecture for DCT computation using dynamic partial reconfiguration. Our architecture can achieve quality scalability using dynamic partial reconfiguration. This is imp...
详细信息
ISBN:
(纸本)1601320647
In this paper, we propose FPGA-based scalable architecture for DCT computation using dynamic partial reconfiguration. Our architecture can achieve quality scalability using dynamic partial reconfiguration. This is important for some critical applications that need continuous hardware servicing. Our scalable architecture has two features. First, the architecture can perform DCT computations for eight different zones, i.e., from 11 DCT to 88 DCT. Second, the architecture can change the configuration of processing elements to trade off the precisions of DCT coefficients with computational complexity. Using dynamic partial reconfiguration with 2.1 MB bitstreams, 16 distinct hardware architectures can be implemented. We show the experimental results and comparisons between different configurations using both partial reconfiguration and non-partial reconfiguration process.
Real-time embedded systems are being built more and more as systems on a Programmable Chip. Better predictability and faster execution times can be achieved by moving functions traditionally implemented in software to...
详细信息
ISBN:
(纸本)1932415424
Real-time embedded systems are being built more and more as systems on a Programmable Chip. Better predictability and faster execution times can be achieved by moving functions traditionally implemented in software to specialized hardware. The Task Resource Matrix developed at Brigham Young University gives hardware more control in administration of tasks and communication primitives by providing mutual exclusion, synchronization, and task operating state. When coupled with a hardware scheduler and the Real-Time Processor, the Task Resource Matrix can be configured to support specific real-time applications.
There is increasing demand for fast floating-point arithmetic support to make Field Programmable Gate Arrays (FPGAs) a practical option for scientific computing applications. In existing FPGA based double-precision fl...
详细信息
ISBN:
(纸本)1601320647
There is increasing demand for fast floating-point arithmetic support to make Field Programmable Gate Arrays (FPGAs) a practical option for scientific computing applications. In existing FPGA based double-precision floating-point division approaches, operational frequency of the design is bounded by the performance of mantissa stage. In earlier work we introduced a new algorithm for mantissa multiplication to implement IEEE-754 compliant multiplier. In this paper we apply the same algorithm onto division by convergence technique to implement the division operation. Division algorithm reaches a maximum of 256MHz operational frequency on Virtex-4 platform which outperforms algorithm and IP-Core based solutions in the academia as well as Xilinx LogiCORE solutions when no embedded resources are used. Operational frequency is improved by 153%, 48% and 20% compared to algorithm and IP core based solutions of academia and Xilinx LogiCORE solutions respectively. Area overhead is larger then IP-Core based solutions, but we rely on the capacity trend of contemporary FPGAs.
暂无评论