This paper presents the implementation of Manticore: a manycore accelerator for parallel RTL simulation. Manticore packs up to 225 custom soft processors running at 475 MHz on a large fpga. Implementing manycore accel...
详细信息
This paper presents a flexible fpga architecture evaluation framework, named fpgaEVA-LP, for power efficiency analysis of LUT-based fpga architectures. Our work has several contributions: (i) We develop a mixed-level ...
详细信息
This paper presents a flexible fpga architecture evaluation framework, named fpgaEVA-LP, for power efficiency analysis of LUT-based fpga architectures. Our work has several contributions: (i) We develop a mixed-level fpga power model that combines switch-level models for interconnects and macromodels for LUTs;(ii) We develop a tool that automatically generates a back-annotated gate-level netlist with post-layout extracted capacitances and delays;(iii) We develop a cycle-accurate power simulator based on our power model. It carries out gate-level simulation under real delay model and is able to capture glitch power;(iv) Using the frame work fpgaEVA-LP, we study the power efficiency of fpgas, in 0.10um technology, under various settings of architecture parameters such as LUT sizes, cluster sizes and wire segmentation schemes and reach several important conclusions. We also present the detailed power consumption distribution among different fpga components and shed light on the potential opportunities of power optimization for future fpga designs (e.g., ≤ 0.10um technology).
Leakage power has been overshadowed by dynamic power minimization techniques in fpgas, and is a growing concern in programmable logic. This paper proposes a dual threshold voltage implementation of the fpga architectu...
详细信息
Leakage power has been overshadowed by dynamic power minimization techniques in fpgas, and is a growing concern in programmable logic. This paper proposes a dual threshold voltage implementation of the fpga architecture for leakage power reduction. A CAD flow is developed for assigning high threshold voltage to the logic elements within the logic blocks of the fpga for leakage power reduction. The CAD flow ensures that all the logic blocks remain identical with respect to the number of high and low threshold voltage logic elements that each logic block contains. This CAD flow leads to a dual threshold voltage implementation for the fpga architecture. Results indicate that over 95% of the logic elements in the fpga can be assigned high threshold voltage. On an average leakage savings of 60% and up to 70% for some benchmarks can be achieved. The proposed CAD flow forms a basis on which other dual threshold voltage implementations of fpga can be evaluated. We investigate the design trade-offs between the ratio of the number of high and number of low-Vt logic elements in a cluster and the leakage savings. We also investigate the impact of cluster size on leakage savings for the dual threshold voltage implementation.
The purpose of this paper is to introduce a modified packing and placement algorithm for fpgas that utilizes logic duplication to improve performance. The modified packing algorithm was designed to leave unused basic ...
详细信息
The purpose of this paper is to introduce a modified packing and placement algorithm for fpgas that utilizes logic duplication to improve performance. The modified packing algorithm was designed to leave unused basic logic elements (BLEs) in timing critical clusters, to allow potential targets for logic duplication. The modified placement algorithm consists of a new stage after placement in which logic duplication is performed to shorten the length of the critical path. In this paper, we show that in a representative fpga architecture using .18 μm technology, the length of the final critical path can be reduced by an average of 14.1%. Approximately half of this gain comes directly from the changes to the packing algorithm while the other half comes from the logic duplication performed during placement.
The personal computer market grew exponentially in the 1980's for vendors such as Apple, Microsoft, and Intel when there was a healthy mix of software, tools, and microprocessor devices. At the time, killer applic...
详细信息
ISBN:
(纸本)9781450333153
The personal computer market grew exponentially in the 1980's for vendors such as Apple, Microsoft, and Intel when there was a healthy mix of software, tools, and microprocessor devices. At the time, killer applications that drove the market were spreadsheets, compilers, and games that ran on the personal computer. Thirty years later, we now have a similar opportunity to grow a healthy ecosystem as developers and vendors bring killer applications, tools, and programmable logic devices to the market to accelerate datacenters for cloud computing. Copyright is held by the author/owner(s).
The proceedings contains 24 papers. Topics discussed include logic design, fieldprogrammablegatearrays, pipelined routing and scheduling, logic synthesis, architecture of special purpose structures, field programma...
详细信息
The proceedings contains 24 papers. Topics discussed include logic design, fieldprogrammablegatearrays, pipelined routing and scheduling, logic synthesis, architecture of special purpose structures, fieldprogrammablegatearrays partitioning, applications and bit-serial synthesis.
For high-end industrial image processing applications with real-time requirements, fpgas are often used as custom accelerators. High level synthesis tools, such as CatapultC, provide a compelling means of speeding up ...
详细信息
ISBN:
(纸本)9781450305549
For high-end industrial image processing applications with real-time requirements, fpgas are often used as custom accelerators. High level synthesis tools, such as CatapultC, provide a compelling means of speeding up the algorithmic hardware design. However, increasing image resolutions make it ever more difficult to obtain sufficient throughput from external SDRAM frame buffers while providing simple, low-latency memory resources for the data path. To address these issues, this paper proposes a platform-based design with a custom memory system of buffers, caches and an optimized commercial memory controller that improves available SDRAM bandwidth by up to 4x. This greatly facilitates the high level synthesis flow, which is demonstrated by implementing two memory-intensive algorithms using 47.0 Gbit/s and 5.7 Gbit/s of on-chip and off-chip memory bandwidth respectively.
fpgas have been used in many applications to achieve orders-of-magnitude improvement in absolute performance and energy efficiency relative to conventional microprocessors. Despite their promise in both processing per...
详细信息
ISBN:
(纸本)9781450305549
fpgas have been used in many applications to achieve orders-of-magnitude improvement in absolute performance and energy efficiency relative to conventional microprocessors. Despite their promise in both processing performance and efficiency, fpgas have not yet gained widespread acceptance as mainstream computing devices. A fundamental obstacle to fpga-based computing today is the fpga's lack of a common, scalable memory architecture. When developing applications for fpgas, designers are often directly responsible for crafting the application-specific infrastructure logic that manages and transports data to and from the processing kernels. This infrastructure not only increases design time and effort but will frequently lock a design to a particular fpga product line, hindering scalability and portability. We propose a new fpga memory architecture called Connected RAM (CoRAM) to serve as a portable bridge between the distributed computation kernels and the external memory interfaces. In addition to improving performance and efficiency, the CoRAM architecture provides a virtualized memory environment as seen by the hardware kernels to simplify development and to improve an application's portability and scalability.
This paper introduces a methodology for prototyping Globally Asynchronous Locally Synchronous (GALS) circuits on synchronous commercial fpgas. A library of required elements for implementing GALS circuits is proposed ...
详细信息
This paper introduces a methodology for prototyping Globally Asynchronous Locally Synchronous (GALS) circuits on synchronous commercial fpgas. A library of required elements for implementing GALS circuits is proposed and general design considerations to successfully implement a GALS circuit on fpga are discussed. The library includes clock generators and arbiters, and different port controllers. Different implementations of these circuits and their advantages and disadvantages are explored. At the end we present a GALS Reed-Solomon decoder as a practical example. The results show that the GALS approach improves the performance of the circuit by 11% and reduces the power consumption by 18.7% to 19.6% considering different error rates. On the other hand, the area of the circuit is increased by 51% which is acceptable considering that a pure synchronous circuit including a central controller is decomposed to generate GALS system and 29% of this overhead belongs to distributing controller in different modules. Deploying better decomposition methods can reduce this overhead substantially.
Software based tools for simulation are not keeping up with the demands for increased chip and system design complexity. In this paper, we describe a cycle-accurate and cycle-reproducible large-scale fpga platform tha...
详细信息
ISBN:
(纸本)9781450311557
Software based tools for simulation are not keeping up with the demands for increased chip and system design complexity. In this paper, we describe a cycle-accurate and cycle-reproducible large-scale fpga platform that is designed from the ground up to accelerate logic verification of the Bluegene/Q compute node ASIC, a multi-processor SOC implemented in IBM's 45 rim SOI CMOS technology. This paper discusses the challenges for constructing such large-scale fpga platforms, including design partitioning, clocking & synchronization, and debugging support, as well as our approach for addressing these challenges without sacrificing cycle accuracy and cycle reproducibility. The resulting fullchip simulation of the Bluegene/Q compute node ASIC runs at a simulated processor clock speed of 4 MHz, over 100,000 times faster than the logic level software simulation of the same design. The vast increase in simulation speed provides a new capability in the design cycle that proved to be instrumental in logic verification as well as early software development and performance validation for Bluegene/Q.
暂无评论