Memory bandwidth is critical to achieving high performance in many fpga applications. The bandwidth of SDRAM memories is, however, highly dependent upon the order ill which addresses are presented on the SDRAM interfa...
详细信息
ISBN:
(纸本)9781450311557
Memory bandwidth is critical to achieving high performance in many fpga applications. The bandwidth of SDRAM memories is, however, highly dependent upon the order ill which addresses are presented on the SDRAM interface. We present an automated tool for constructing an application specific on-chip memory address sequencer which presents requests to the external memory with an ordering that optimizes off-chip memory bandwidth for fixed on-chip memory resource. Within a class of algorithms described by affine loop nests, this approach can be shown to reduce both the number of requests made to external memory and the overhead associated with those requests. Data presented shows a trade off between the use of on-chip resources and achievable off-chip memory bandwidth where a range of improvements from 3.6x to 4x gain in efficiency on the external memory interface can be gained at a cost of up to a 1.4 x increase in the ALUTs dedicated to address generation circuits in an Altera Stratix III device.
Emerging high-level hardware description and synthesis technologies in conjunction with field-programmablegatearrays (fpgas) have significantly lowered the threshold for hardware development. Opportunities exist to ...
详细信息
Emerging high-level hardware description and synthesis technologies in conjunction with field-programmablegatearrays (fpgas) have significantly lowered the threshold for hardware development. Opportunities exist to integrate these technologies into a tool for exploring and evaluating microarchitectural designs. This paper presents a case study in developing the synthesizable high-level model of a superscalar processor and producing a working prototype in fpga. Using an experimental operation-centric hardware description language, we have created the synthesizable model of a superscalar speculative out-of-order core for the integer subset of SimpleScalar PISA. A prototype implementation is produced by synthesizing the high-level model for the Spyder fpga prototyping board. In addition, we have modified the baseline processor model to create derivative processor designs that add newly proposed experimental mechanisms. The derivative models are useful both in testing the completeness and correctness of new mechanisms and in assessing the mechanisms' impact on implementation area and cycle time.
The high unit cost of fpga devices often deters their use beyond the prototyping stage. Efforts have been made to reduce the part-cost of fpga devices, resulting in the development of Design-Specific fpgas. These part...
详细信息
ISBN:
(纸本)1595932925
The high unit cost of fpga devices often deters their use beyond the prototyping stage. Efforts have been made to reduce the part-cost of fpga devices, resulting in the development of Design-Specific fpgas. These parts offer cost reductions by limiting manufacturing tests and improving the number of working devices in a wafer. This paper addresses the issue of yield enhancement in Design-Specific fpgas. In this paper, an analytical model predicting the probability of mapping a specific design onto potentially defective fpgas is developed. When combined with existing yield modelling techniques, a quantitative measure of the potential yield improvements of the Design-Specific fpga approach is reported for current and future technology nodes. It is found that this approach, while beneficial with current manufacturing technology, may not be suitable for 22nm technology or beyond. Copyright 2006 acm.
Novel applications have triggered significant changes at the system level of fpga architecture design, such as the introduction of embedded VLIW processor arrays and hardened NoCs. However, the routing architecture of...
详细信息
Embedded memory blocks are important resources in contemporary fpga devices. When targeting fpgas, application designers often specify high-level memory functions which exhibit a range of sizes and control structures....
详细信息
ISBN:
(纸本)1595932925
Embedded memory blocks are important resources in contemporary fpga devices. When targeting fpgas, application designers often specify high-level memory functions which exhibit a range of sizes and control structures. These logical memories must be mapped to fpga embedded memory resources such that physical design objectives are met. In this work a set of power-aware logical-to-physical RAM mapping algorithms are described which convert user-defined memory specifications to on-chip fpga memory block resources. These algorithms minimize RAM dynamic power by evaluating a range of possible embedded memory block mappings and selecting the most power-efficient choice. Our automated approach has been integrated into a commercial fpga compiler and tested with 40 large fpga benchmarks. Through experimentation, we show that, on average, embedded memory dynamic power can be reduced by 21% and overall core dynamic power can be reduced by 7% with a minimal loss (1%) in design performance. Copyright 2006 acm.
Long fpga CAD runtime has emerged as a limitation to the future scaling of fpga densities. Already, compile times on the order of a day are common, and the situation will only get worse as fpgas get larger. Without a ...
详细信息
ISBN:
(纸本)9781450305549
Long fpga CAD runtime has emerged as a limitation to the future scaling of fpga densities. Already, compile times on the order of a day are common, and the situation will only get worse as fpgas get larger. Without a concerted effort to reduce compile times, further scaling of fpgas will eventually become impractical. Previous works have presented fast CAD tools that tradeoff quality of result for compile time. In this paper, we take a different but complementary approach. We show that the architecture of the fpga itself can be designed to be amenable to fast-compile. If not done carefully, this can lead to lower-quality mapping results, so a careful tradeoff between area, delay, power, and compile run-time is essential. We investigate the extent to which run-time can be reduced by employing high-capacity logic blocks. We extend previous studies on logic block architectures by quantifying the area, delay and CAD runtime trade-offs for large capacity blocks, and also investigate some multi-level logic block architectures. In addition, we present an analytically derived equation to guide the design of logic block I/O requirements.
Carbon nanotubes (CNTs), with their unique electronic properties, are promising materials for building nanoscale circuits. In this paper, we present a new CNT-based fpga architecture known as FPCNA. We define novel CN...
详细信息
ISBN:
(纸本)9781605584102
Carbon nanotubes (CNTs), with their unique electronic properties, are promising materials for building nanoscale circuits. In this paper, we present a new CNT-based fpga architecture known as FPCNA. We define novel CNT and nanoswitch based components and characterize these components considering nanospecific process variations, including the variation caused by the random mixture of metallic and semiconducting CNTs. To evaluate the architecture, we develop a variation-aware physicaldesign flow which can handle both Gaussian and non-Gaussian random variables using variation-aware placement and routing. When FPCNA is evaluated with this CAD flow, we see a 2.67 performance gain over a baseline CMOS fpga at the same technology node (at a 95% performance yield). In addition, FPCNA offers a 4.5 footprint reduction compared to the baseline fpga. These results demonstrate the potential of using CNTs and nanoswitches to build high performance fpga circuits. Copyright 2009 acm.
We can design high-frequency soft-processors on fpgas that exploit deep pipelining of DSP primitives, supported by selective data forwarding, to deliver up to 25% performance improvements across a range of benchmarks....
详细信息
We present a routability-driven bottom-up clustering technique for area and power reduction in clustered fpgas. This technique uses a cell connectivity metric to identify seeds for efficient clustering. Effective seed...
详细信息
ISBN:
(纸本)9781581134520
We present a routability-driven bottom-up clustering technique for area and power reduction in clustered fpgas. This technique uses a cell connectivity metric to identify seeds for efficient clustering. Effective seed selection, coupled with an interconnect-resource aware clustering and placement, can have a favorable impact on circuit routability. It leads to better device utilization, savings in area, and reduction in power consumption. Routing area reduction of 35% is achieved over previously published results. Power dissipation simulations using a buffered pass-transistor-based fpga interconnect model are presented. They show that our clustering technique can reduce the overall device power dissipation by an average of 13%.
field-programmablegatearrays (fpga) implement logic functions using programmable cells, such as K-input lookuptables (K-LUTs). A K-LUT can implement any Boolean function with K inputs and one output. Methods for map...
详细信息
ISBN:
(纸本)9781450333153
field-programmablegatearrays (fpga) implement logic functions using programmable cells, such as K-input lookuptables (K-LUTs). A K-LUT can implement any Boolean function with K inputs and one output. Methods for mapping into K-LUTs are extensively researched and widely used. Recently, cells other than K-LUTs have been explored, for example, those composed of several LUTs and those combining LUTs with several gates. Known methods for mapping into these cells are specialized and complicated, requiring a substantial effort to evaluate custom cell architectures. This paper presents a general approach to efficiently map into single-output K-input cells containing LUTs, MUXes, and other elementary gates. Cells with to 16 inputs can be handled. The mapper is fully automated and takes a logic network and a symbolic description of a programmable cell, and produces an optimized network composed of instances of the given cell. Past work on delay/area optimization during mapping is applicable and leads to good quality of results.
暂无评论