To improve fpga performance for arithmetic circuits, this paper proposes a new architecture for fpga logic cells that includes a 6:2 compressor. The new cell features additional fast carry-chains that concatenate adja...
详细信息
ISBN:
(纸本)9781595939340
To improve fpga performance for arithmetic circuits, this paper proposes a new architecture for fpga logic cells that includes a 6:2 compressor. The new cell features additional fast carry-chains that concatenate adjacent compressors and can be routed locally without the global routing network. Unlike previous carry-chains for binary and ternary addition, the carry chain used by the new cell only spans 2 logic blocks, which significantly improves the delay of multi-input addition operations mapped onto the fpga. The delay and area overhead that arises from augmenting a traditional fpga logic cell with the new compressor structure is minimal. Using this new cell, we observed an average speedup in combinational delay of 1.41 compared to adder trees synthesized using ternary adders. Copyright 2008 acm.
A design tool for routing channel segmentation in island-style fpgas is presented. Given the fpga architecture parameters and a set of benchmark designs, the tool optimizes routing channel segmentation using the avera...
详细信息
ISBN:
(纸本)9781595939340
A design tool for routing channel segmentation in island-style fpgas is presented. Given the fpga architecture parameters and a set of benchmark designs, the tool optimizes routing channel segmentation using the average interconnect power-delay product as a performance metric, which is estimated from placed and routed designs. A simulated- annealing procedure is used, whereby segmentation is incrementally changed in each iteration, the benchmark designs are mapped using VPR, and the performance metric is computed to decide whether to accept or reject the new segmentation. Run time is signi cantly reduced by using incremental routing in each iteration and parallelizing the metric evaluation. Experimental results using the MCNC benchmark designs demonstrate an average of 22% and 15% reduction in delay and power relative to a baseline segmentation. The results also show that average segment length should decrease with technology scaling. Finally, we demonstrate how the tool can be used to optimize other aspects of programmable routing in an fpga. Copyright 2008 acm.
Stochastic simulations and other scientific applications that depend on random numbers are increasingly implemented in a parallelized manner in programmable logic. High-quality pseudo-random number generators (PRNG), ...
详细信息
ISBN:
(纸本)9781595939340
Stochastic simulations and other scientific applications that depend on random numbers are increasingly implemented in a parallelized manner in programmable logic. High-quality pseudo-random number generators (PRNG), such as the Mersenne Twister, are often based on binary linear recurrences and have extremely long periods (more than 21024). Many software implementations of such PRNGs exist, but hardware implementations are rare. We have developed an optimized, resource-efficient parallel framework for this class of random number generators that exploits the underlying algorithm as well as fpga-specific architectural features. The framework also incorporates fast "jump-ahead" capability for these PRNGs, allowing simultaneous, independent sub-streams to be generated in parallel by partitioning one long-period pseudorandom sequence. We demonstrate parallelized implementations of three types of PRNGs - the 32-, 64- and 128-bit SIMD Mersenne Twister - on Xilinx Virtex-II Pro fpgas. Their area/throughput performance is impressive: for example, compared clock-for-clock with a previous fpga implementation, a "two-parallelized" 32-bit Mersenne Twister uses 41% fewer resources. It can also scale to 350 MHz for a throughput of 22.4 Gbps, which is 5.5x faster than the older fpga implementation and 7.1x faster than a dedicated software implementation. The quality of generated random numbers is verified with the standard statistical test batteries diehard and TestU01. We also present two real-world application studies with multiple RNG streams: the Ziggurat method for generating normal random variables and a Monte Carlo photon-transport simulation. The availability of fast long-period random number generators with multiple streams accelerates hardware-based scientific simulations and allows them to scale to greater complexities. Copyright 2008 acm.
The placement problem for structured ASICs combines aspects of the standard cell ASIC placement problem and fpga placement. Similarities with ASIC placement include the number and size of the place-able objects and th...
详细信息
ISBN:
(纸本)9781605580487
The placement problem for structured ASICs combines aspects of the standard cell ASIC placement problem and fpga placement. Similarities with ASIC placement include the number and size of the place-able objects and the need to consider buffering within placement. Similarities with fpga placement include the existence of discrete legal locations for all types of objects, the constraints caused by "intrinsic" connections, such as clock, reset or 10 signals and fixed routing tracks. The research community has provided detailed analysis of various different solutions for the standard cell placement problem over the last two decades. fpga placement research has not focused on the legalization issues. Architecturally, fpgas are changing to focus more on synthesis and clustering than fine-grained placement to meet timing. In this paper we discuss the similarities and differences between fpga, Standard Cell, and Structured ASIC placement, and we present new representations and tests cases for the structured ASIC problem.
The design process for chip multiprocessors (CMPs) requires extremely long simulation times to explore performance, power, and thermal issues, particularly when operating system (OS) effects are included. In response,...
详细信息
ISBN:
(纸本)9781605581095
The design process for chip multiprocessors (CMPs) requires extremely long simulation times to explore performance, power, and thermal issues, particularly when operating system (OS) effects are included. In response, our novel fpga-based emulation methodology models a full CMP design including applications and an OS, Activity counters programmed into the cores feed per-component microarchitectural power models. These models achieve under 10% error compared to detailed gate-level simulations. Our method retains software flexibility, but offers up to 35 X speedup compared to full-system software simulations. We present our approach by emulating a 2-core Leon3 cache-coherent multiprocessor running Linux and parallel benchmarks. In an example case study, our emulated system uses activity counts (a proxy for temperature) to guide process migration between the CMP cores. Overall, this paper's methodology makes possible detailed power and thermal studies of CMPs and their operating systems. Copyright 2008 acm.
Portable embedded SoC processor architects are constantly challenged by exponentially increasing demand for newer functionality, faster real-time communication, stronger security, and higher reliability;while the cons...
详细信息
ISBN:
(纸本)9780769531175
Portable embedded SoC processor architects are constantly challenged by exponentially increasing demand for newer functionality, faster real-time communication, stronger security, and higher reliability;while the constraint on energy, feature size, NRE cost, and time-to-market (TTM) grows tighter than ever. Existing approaches attempting to achieve these mutual conflicting design goals rely heavily on adopting special-purpose accelerators (SPA) to take charge of the heavy lifting in the aimed embedded SoC designs. These SPAs, synthesized from either ASIC or fpga, are usually augmented to the base processor as co-processors to execute the performance-critical regions of applications. ASIC-based SPAs achieve performance-energy efficiency at the expense of sacrificing post-manufacturing programmability while incurring large NRE and TTM;fpga-based SPAs retain programmability at the expense of significant energy and area increase. Furthermore, augmenting these SPAs as co-processors adds considerable communication and synchronization overhead severely compromising their initially promised benefits. This paper proposes an innovative design paradigm that moves away from the common scheme of adding co-processing ASIC/fpga SPAs to an integrated and reconfigurable design. Specifically, we propose a new class of embedded processor by replacing the processor's conventional ALU with a more powerful and flexible Versatile Processing Unit (VPU). VPU enables multiple interdependent instructions to be fused and processed together as a single atomic VPU instruction by exploring dataflow dependencies of the application code. The instruction fusion is automatically performed by a VPU-aware compiler. The optimized VPU code reduces code size and amplifies the effective processor bandwidth and capacity by eliminating transient computation and register spill code. Experimental results show up to 400% and average 150% speedup for MediaBench with negligible area increase.
The proceedings contain 22 papers. The topics discussed include: embedded floating-point units in fpgas;measuring the gap between fpgas and ASICs;optimality study of logic synthesis for LUT-based fpgas;improvements to...
详细信息
ISBN:
(纸本)1595932925
The proceedings contain 22 papers. The topics discussed include: embedded floating-point units in fpgas;measuring the gap between fpgas and ASICs;optimality study of logic synthesis for LUT-based fpgas;improvements to technology mapping for LUT-based fpgas;improving performance and robustness of domain-specific CPLDs;design, implementation, and verification of active cache emulator (ACE);modeling and data-dependent performance of pattern-matching architectures;yield enhancements of design-specific fpgas;FGPA clock network architecture: flexibility vs. area and power;a reconfigurable hardware based embedded scheduler for buffered crossbar switches;and combining module selection and resource sharing for efficient fpga pipeline synthesis.
A reconfigurable logic element (LE) is developed for use in constructing a NULL Convention Logic (NCL) fpga. It can be configured as any of the 27 fundamental NCL gates, including resettable and inverting variations, ...
详细信息
ISBN:
(纸本)9781595936004
A reconfigurable logic element (LE) is developed for use in constructing a NULL Convention Logic (NCL) fpga. It can be configured as any of the 27 fundamental NCL gates, including resettable and inverting variations, and can utilize embedded registration for gates with three or fewer inputs. The developed LE is compared with a previous NCL LE, showing that the one developed herein yields a more area efficient NCL circuit implementation. The NCL fpga logic element is simulated at the transistor level using the 1.8V, 180nm TSMC CMOS process.
This paper describes a technique that reduces dynamic power in fpgas by reducing the number of glitches in the global routing resources. The technique involves adding programmable delay elements within the logic block...
详细信息
ISBN:
(纸本)9781595936004
This paper describes a technique that reduces dynamic power in fpgas by reducing the number of glitches in the global routing resources. The technique involves adding programmable delay elements within the logic blocks of an fpga to programmably align the arrival times of early-arriving signals to the inputs of the lookup tables and to filter out glitches generated by earlier circuitry. On average, the proposed technique eliminates 91% of the glitching, which reduces overall fpga power by 18%. The added circuitry increases overall area by 5% and critical-path delay by less than 1%. Furthermore, since it is applied after routing, the proposed technique requires no modifications to the existing fpga routing architecture or CAD flow.
暂无评论