Combining multi-processing with the high level of configurability possible with fpga-based soft-processors, this paper presents a multiprocessing framework based on the MicroBlaze soft-processor that provides multicor...
详细信息
ISBN:
(纸本)9781450333153
Combining multi-processing with the high level of configurability possible with fpga-based soft-processors, this paper presents a multiprocessing framework based on the MicroBlaze soft-processor that provides multicore support and fully coherent, independently configurable Level 1 Caches with Linux multicore support. This architecture allows for finegrain configurability of the system, allowing for fpga resources to be better optimized for a specific embedded application. We use our framework to explore the L1 Data Cache configuration, developing a metric for efficiency based on resource usage and static application runtime. We find that a Pseudo-Random replacement policy is consistently the more efficient choice for fpga systems.
Vdd-programmablefpgas have been proposed recently to reduce fpga power, where Vdd levels can be customized for different circuit elements and unused circuit elements can be power-gated. In this paper, we first develo...
详细信息
ISBN:
(纸本)9781595930293
Vdd-programmablefpgas have been proposed recently to reduce fpga power, where Vdd levels can be customized for different circuit elements and unused circuit elements can be power-gated. In this paper, we first develop an accurate fpga power model and then design novel Vdd-programmable interconnect switches with minimum number of configuration SRAM cells. Applying our power model to placed and routed benchmark circuits, we evaluate Vddprogrammablefpga architecture using the new switches. The best architecture in our study uses Vdd-programmable logic blocks and Vdd-gateable interconnects. Compared to the baseline architecture similar to the leading commercial architecture, the best architecture reduces the minimal energy-delay product by 44.14% with 48% area overhead and 3% SRAM cell increase. Our evaluation results also show that LUT size 4 always gives the lowest energy consumption while LUT size 7 always leads to the highest performance for all evaluated architectures. Copyright 2005 acm.
This paper shows a method to verifying the thermal status of complex fpga-based circuits like microprocessors. Thus, the designer can evaluate if a particular block is working beyond specifications. The idea is to ext...
详细信息
This paper shows a method to verifying the thermal status of complex fpga-based circuits like microprocessors. Thus, the designer can evaluate if a particular block is working beyond specifications. The idea is to extract the output frequencies of an array of ring-oscillators previously distributed in the die, taking full advantage of the configuration port capabilities in Xilinx technology. As a result, it is shown that the fpga technology offers the designers of embedded systems the possibility of viewing a detailed thermal map of a circuit at a minimum cost. The verification can be done in actual working conditions;for example with heat sinks and fans attached to the chip, inside the system case, or even in an on-board satellite application. The main results of the work are unthinkable using other alternatives like IR cameras, external sensors, or embedded diodes.
Throughput oriented high level synthesis allows efficient design and optimization using parallel input languages. Parallel languages offer the benefit of parallelism extraction at multiple levels of granularity, offer...
详细信息
ISBN:
(纸本)9781450338561
Throughput oriented high level synthesis allows efficient design and optimization using parallel input languages. Parallel languages offer the benefit of parallelism extraction at multiple levels of granularity, offering effective design space exploration to select efficient single core implementations, and easy scaling of parallelism through multiple core instantiations. However, study of high level synthesis for parallel languages has concentrated on optimization of core and on-chip communications, while neglecting platform integration, which can have a significant impact on achieved performance. In this paper, we create an automated flow to perform efficient platform integration for an existing CUDAto-RTL throughput oriented HLS, and we open source the FCUDA tool, platform integration, and benchmark applications. We demonstrate platform integration of 16 benchmarks on two Zynq-based systems in bare-metal and OS mode. We study implementation optimization for platform integration, compare to an embedded GPU (Tegra TK1) and verify designs on a Zedboard Zynq 7020 (bare-metal) and Omnitek Zynq 7045 (OS).
Due to their generic and highly programmable nature, fpgas provide the ability to implement a wide range of applications. However, it is this nonspecific nature that has limited the use of fpgas in scientific applicat...
详细信息
ISBN:
(纸本)1595932925
Due to their generic and highly programmable nature, fpgas provide the ability to implement a wide range of applications. However, it is this nonspecific nature that has limited the use of fpgas in scientific applications that require floating-point arithmetic. Even simple floating-point operations consume a large amount of computational resources. In this paper, we introduce embedding floating-point multiply-add units in an island style fpga. This has shown to have an average area savings of 55.0% and an average increase of 40.7% in clock rate over existing architectures. Copyright 2006 acm.
In this paper we present an implementation of a Cholesky decomposition core, with IEEE754 single precision arithmetic. The datapaths are generated using fused datapath synthesis, created with an experimental floating ...
详细信息
ISBN:
(纸本)9781605584102
In this paper we present an implementation of a Cholesky decomposition core, with IEEE754 single precision arithmetic. The datapaths are generated using fused datapath synthesis, created with an experimental floating point compiler tool, capable of fitting hundreds of floating point operators into a single device. We present a scalable architecture for both real and complex matrixes, on which we will report results for up to 128128 real matrices. The concepts of fused datapath synthesis for fpga floating point designs will be reviewed, and the application to the Cholesky algorithm detailed. Experimental results will be given to show that the accuracy of this method is superior to those expected from a traditional IEEE754 core based design flow. Copyright 2009 acm.
fieldprogrammablegatearrays (fpgas) are being used to provide fast Internet Protocol (IP) packet routing and advanced queuing in a highly scalable network switch. A new module, called the field-programmable Port Ex...
详细信息
ISBN:
(纸本)9781581131932
fieldprogrammablegatearrays (fpgas) are being used to provide fast Internet Protocol (IP) packet routing and advanced queuing in a highly scalable network switch. A new module, called the field-programmable Port Extender (FPX), is being built to augment the Washington University Gigabit Switch (WUGS) with reprogrammable logic. FPX modules reside at the edge of the WUGS switching fabric. Physically, the module is inserted between an optical line card and the WUGS gigabit switch back-plane. The hardware used for this project allows ports of the switch populated with an FPX to operate at rates up to 2.4 Gigabits/second. The aggregate throughput of the system scales with the number of switch ports. Logic on the FPX module is implemented with two fpga devices. The first device is used to interface between the switch and the line card, while the second is used to prototype new networking functions and protocols. The logic on the second fpga can be re-programmed dynamically via control cells sent over the network.
Clock network power in field-programmablegatearrays (FP- ) is considered and two complementary approaches for power reduction in the Xilinx RVirtexTM-5 fpga are. The approaches are unique in that they lever- specifi...
详细信息
ISBN:
(纸本)9781605584102
Clock network power in field-programmablegatearrays (FP- ) is considered and two complementary approaches for power reduction in the Xilinx RVirtexTM-5 fpga are. The approaches are unique in that they lever- specific architectural aspects of Virtex-5 to achieve re- in dynamic power consumed by the clock network. first approach comprises a placement-based technique reduce interconnect resource usage on the clock network, reducing capacitance and power (up to 12%). The approach borrows the "clock gating" notion from the domain and applies it to fpgas. Clock enable sig- on flip-flops are selectively migrated to use the dedi- clock enable available on the fpga's built-in clock, leading to reduced toggling on the clock intercon- and lower power (up to 28%). Power reductions are achieved without any performance penalty, on average. Copyright 2009 acm.
C-slow retiming is a process of automatically increasing the throughput of a design by enabling fine grained pipelining of problems with feedback loops. This transformation is especially appropriate when applied to FP...
详细信息
C-slow retiming is a process of automatically increasing the throughput of a design by enabling fine grained pipelining of problems with feedback loops. This transformation is especially appropriate when applied to fpga designs because of the large number of available registers. To demonstrate and evaluate the benefits of C-slow retiming, we constructed an automatic tool which modifies designs targeting the Xilinx Virtex family of fpgas. Applying our tool to three benchmarks: AES encryption. Smith/Waterman sequence matching, and the LEON 1 synthesized microprocessor core, we were able to substantially increase the total throughput. For some parameters, throughput is effectively doubled.
Reconfigurable computing can provide a significant speed-up factor to cryptographic and error correcting code algorithms. Finite field arithmetic is essential to both, but is difficult to implement efficiently. Finite...
详细信息
ISBN:
(纸本)9781595936004
Reconfigurable computing can provide a significant speed-up factor to cryptographic and error correcting code algorithms. Finite field arithmetic is essential to both, but is difficult to implement efficiently. Finite field instruction set extensions and a reconfiguration framework have been constructed to enable a finite field multiplier to be regenerated via software control. A performance evaluation has been created by generating a Finite field Extensions Unit with MicroBlaze processor in a Xilinx Virtex(2)Pro fpga. By utilizing the in-system partial reconfiguration capability, the finite field multiplier can be customized to a particular size and definition. With a customized GF(2(163)) multiplier, a speed-up factor of 1530x has been demonstrated versus execution of the same algorithm on the MicroBlaze processor alone.
暂无评论