the proceedings contain 24 papers. the topics discussed include: designing with extreme parallelism;high-quality, deterministic parallel placement for fpgas on commodity hardware;enforcing long-path timing closure for...
ISBN:
(纸本)9781595939340
the proceedings contain 24 papers. the topics discussed include: designing with extreme parallelism;high-quality, deterministic parallel placement for fpgas on commodity hardware;enforcing long-path timing closure for fpga routing with path searches on clamped lexicographic spirals;mapping for better than worst-case delays in LUT-based fpga designs;a complexity-effective architecture for accelerating full-system multiprocessor simulations using fpgas;efficient ASIP design for configurable processors with fine-grained resource sharing;pattern-based behavior synthesis for fpga resource reduction;modeling routing demand for early-stage fpga architecture development;and trace-based framework for concurrent development of process and fpga architecture considering process variation and reliability.
the fieldprogrammable Counter Array (FPCA) was introduced to improve fpga performance for arithmetic circuits. An FPCA is a reconfigurable IP core that can be integrated into an fpga. To exploit the FPCA, a circuit i...
详细信息
ISBN:
(纸本)9781595939340
the fieldprogrammable Counter Array (FPCA) was introduced to improve fpga performance for arithmetic circuits. An FPCA is a reconfigurable IP core that can be integrated into an fpga. To exploit the FPCA, a circuit is transformed by merging disparate addition and multiplication operations into large multi-input addition operations, which are synthesized as compressor trees on the FPCA;the remaining portion of the circuit is synthesized on the fpga. this paper presents a series of architectural improvements to the FPCA that reduce routing delay, increase flexibility and component utilization, and simplify the integration process. Using an fpga containing six FPCAs, we observed average and maximum speedups of 1.60x and 2.40x on a set of arithmetic benchmarks.
To improve fpga performance for arithmetic circuits, this paper proposes a new architecture for fpga logic cells that includes a 6:2 compressor. the new cell features additional fast carry-chains that concatenate adja...
详细信息
ISBN:
(纸本)9781595939340
To improve fpga performance for arithmetic circuits, this paper proposes a new architecture for fpga logic cells that includes a 6:2 compressor. the new cell features additional fast carry-chains that concatenate adjacent compressors and can be routed locally without the global routing network. Unlike previous carry-chains for binary and ternary addition, the carry chain used by the new cell only spans 2 logic blocks, which significantly improves the delay of multi-input addition operations mapped onto the fpga. the delay and area overhead that arises from augmenting a traditional fpga logic cell withthe new compressor structure is minimal. Using this new cell, we observed an average speedup in combinational delay of 1.41 compared to adder trees synthesized using ternary adders. Copyright 2008acm.
We consider packing in the commercial fpga context and examine the speed, performance and power trade-offs associated with packing in a state-of-the art fpga - the Xilinx (R) Virtex (TM) -5 fpga. Two aspects of packin...
详细信息
ISBN:
(纸本)9781595939340
We consider packing in the commercial fpga context and examine the speed, performance and power trade-offs associated with packing in a state-of-the art fpga - the Xilinx (R) Virtex (TM) -5 fpga. Two aspects of packing are discussed: 1) packing for general logic blocks, and 2) packing for large IP blocks. Virtex-5 logic blocks contain dual-output 6-input look-up-tables (LUTs). Such LUTs call implement any single logic function requiring no more than 6 inputs, or any two logic functions requiring no more than 5 distinct inputs. the second LUT Output is associated with slower speed, and therefore, must be used judiciously. We present placement-based techniques for dual-output LUT packing that;lead to improved area-efficiency and power, with minimal performance degradation. We then move on to address packing for large IP blocks, specifically, block RAMs and DSPs. We present a packing optimization that is widely applicable in DSP designs that leads to significantly improved design performance.
the Second international Workshop on Overlay Architectures for fpgas is held in Monterey, California, USA, on February 21, 2016 and co-located withfpga 2016: the 24thacm/sigdainternationalsymposium on field Progra...
详细信息
ISBN:
(纸本)9781450338561
the Second international Workshop on Overlay Architectures for fpgas is held in Monterey, California, USA, on February 21, 2016 and co-located withfpga 2016: the 24thacm/sigdainternationalsymposium on fieldprogrammablegatearrays. the main objective of the workshop is to address how overlay architectures can help address the challenges and opportunities provided by fpga-based reconfigurable computing. the workshop provides a venue for researchers to present and discuss the latest developments in fpga overlay architecture and related areas. We have assembled a program of six refereed papers and a panel discussion with prominent experts in the field.
A design tool for routing channel segmentation in island-style fpgas is presented. Given the fpga architecture parameters and a set of benchmark designs, the tool optimizes routing channel segmentation using the avera...
详细信息
ISBN:
(纸本)9781595939340
A design tool for routing channel segmentation in island-style fpgas is presented. Given the fpga architecture parameters and a set of benchmark designs, the tool optimizes routing channel segmentation using the average interconnect power-delay product as a performance metric, which is estimated from placed and routed designs. A simulated- annealing procedure is used, whereby segmentation is incrementally changed in each iteration, the benchmark designs are mapped using VPR, and the performance metric is computed to decide whether to accept or reject the new segmentation. Run time is signi cantly reduced by using incremental routing in each iteration and parallelizing the metric evaluation. Experimental results using the MCNC benchmark designs demonstrate an average of 22% and 15% reduction in delay and power relative to a baseline segmentation. the results also show that average segment length should decrease with technology scaling. Finally, we demonstrate how the tool can be used to optimize other aspects of programmable routing in an fpga. Copyright 2008acm.
Stochastic simulations and other scientific applications that depend on random numbers are increasingly implemented in a parallelized manner in programmable logic. High-quality pseudo-random number generators (PRNG), ...
详细信息
ISBN:
(纸本)9781595939340
Stochastic simulations and other scientific applications that depend on random numbers are increasingly implemented in a parallelized manner in programmable logic. High-quality pseudo-random number generators (PRNG), such as the Mersenne Twister, are often based on binary linear recurrences and have extremely long periods (more than 21024). Many software implementations of such PRNGs exist, but hardware implementations are rare. We have developed an optimized, resource-efficient parallel framework for this class of random number generators that exploits the underlying algorithm as well as fpga-specific architectural features. the framework also incorporates fast "jump-ahead" capability for these PRNGs, allowing simultaneous, independent sub-streams to be generated in parallel by partitioning one long-period pseudorandom sequence. We demonstrate parallelized implementations of three types of PRNGs - the 32-, 64- and 128-bit SIMD Mersenne Twister - on Xilinx Virtex-II Pro fpgas. their area/throughput performance is impressive: for example, compared clock-for-clock with a previous fpga implementation, a "two-parallelized" 32-bit Mersenne Twister uses 41% fewer resources. It can also scale to 350 MHz for a throughput of 22.4 Gbps, which is 5.5x faster than the older fpga implementation and 7.1x faster than a dedicated software implementation. the quality of generated random numbers is verified withthe standard statistical test batteries diehard and TestU01. We also present two real-world application studies with multiple RNG streams: the Ziggurat method for generating normal random variables and a Monte Carlo photon-transport simulation. the availability of fast long-period random number generators with multiple streams accelerates hardware-based scientific simulations and allows them to scale to greater complexities. Copyright 2008acm.
It has become clear that large embedded configurable memory arrays will be essential in future fpgas. Embedded arrays provide high-density high-speed implementations of the storage parts of circuits. Unfortunately, th...
详细信息
It has become clear that large embedded configurable memory arrays will be essential in future fpgas. Embedded arrays provide high-density high-speed implementations of the storage parts of circuits. Unfortunately, they require the fpga vendor to partition the device into memory and logic resources at manufacture-time. this leads to a waste of chip area for customers that do not use all of the storage provided. this chip area need not be wasted, and can in fact be used very efficiently, if the arrays are configured as large multi-output ROMs, and used to implement logic. In order to efficiently use the embedded arrays in this way, a technology mapping algorithm that identifies parts of circuits that can be efficiently mapped to an embedded array is required. In this paper, we describe such an algorithm. the new tool, called SMAP, packs as much circuit information as possible into the available memory arrays, and maps the rest of the circuit into four-input lookup-tables. On a set of 29 sequential and combinational benchmarks, the tool is able to map, on average, 60 4-LUTs into a single 2-Kbit memory array. If there are 16arrays available, it can map, on average, 358 4-LUTs to the 16arrays.
this paper describes the hardware implementation of a real-time, large-scale, multi-chip fpga (fieldprogrammablegate Array) based emulation engine with a capacity of 10 million ASIC (Application Specific Integrated ...
详细信息
this paper describes the hardware implementation of a real-time, large-scale, multi-chip fpga (fieldprogrammablegate Array) based emulation engine with a capacity of 10 million ASIC (Application Specific Integrated Circuits) equivalent gates. Attainable system operation frequency can exceed 60 MHz, and the system throughput has been empirically verified to achieve 600 billion 16-bit additions per second. the emulator is custom designed to maximize the performance and resource utilization for a range of telecommunication and digital signal processing applications. With its high-speed interconnect architecture and large external I/O bandwidth, the emulator excels in prototyping real-time systems that have strict timing, logic capacity, and data rate requirements. Our development efforts are guided by such ongoing projects as ultra-wide band (UWB) and multi-channel-multi-antenna (MCMA) radio systems research.
暂无评论