The design process for chip multiprocessors (CMPs) requires extremely long simulation times to explore performance, power, and thermal issues, particularly when operating system (OS) effects are included. In response,...
详细信息
ISBN:
(纸本)9781605581095
The design process for chip multiprocessors (CMPs) requires extremely long simulation times to explore performance, power, and thermal issues, particularly when operating system (OS) effects are included. In response, our novel FPGA-based emulation methodology models a full CMP design including applications and an OS, Activity counters programmed into the cores feed per-component microarchitectural power models. These models achieve under 10% error compared to detailed gate-level simulations. Our method retains software flexibility, but offers up to 35 X speedup compared to full-system software simulations. We present our approach by emulating a 2-core Leon3 cache-coherent multiprocessor running Linux and parallel benchmarks. In an example case study, our emulated system uses activity counts (a proxy for temperature) to guide process migration between the CMP cores. Overall, this paper's methodology makes possible detailed power and thermal studies of CMPs and their operating systems. Copyright 2008 acm.
This paper presents an improved Tree-based architecture that unifies two unidirectional programmable networks: A predictible downward network based on the Butterfly-FatTree topology, and an upward network using hierar...
详细信息
ISBN:
(纸本)9781595939999
This paper presents an improved Tree-based architecture that unifies two unidirectional programmable networks: A predictible downward network based on the Butterfly-FatTree topology, and an upward network using hierarchy. Studies based on Rent's Rule show that switch requirements in this architecture grow slower than in traditional Mesh topologies. New tools are developed to place and route several benchmark circuits on this architecture. Experimental results show that the Tree-based architecture can implement MCNC benchmark circuits with an average gain of 54% in total area compared with Mesh architecture. Copyright 2008 acm.
Recent research into molecular scale electronics has led to the realization of novel nanoscale devices that can be used to implement circuits such as what we dub programmable Majority Logic arrays (PMLA). A PMLA lever...
详细信息
ISBN:
(纸本)9781595939999
Recent research into molecular scale electronics has led to the realization of novel nanoscale devices that can be used to implement circuits such as what we dub programmable Majority Logic arrays (PMLA). A PMLA leverages two characteristics found in molecular electronic devices, hysteretic switching and negative differential resistance (NDR), in the implementation of a PLA based on majority logic. This paper deals with the integration of several nanoscale PMLA units with micro scale technologies to implement a high density FPGA architecture. One of the key contributions of this work is the interface between the top nanoscale layer and a lower CMOS layer. Two approaches are considered for interfacing these two technologies: (1) direct connection and (2) connection utilizing tapered buffers between the layers for improved delay. The intermediate tapered buffers in the second approach ensure that the variation in feature size, and hence load capacitance, from one layer to the next is not too substantial. This paper also demonstrates the potential of the PMLA FPGA from a high level perspective in terms of increased density and performance for a set of applications. Copyright 2008 acm.
Design-for-manufacture (DFM) for thick gate oxide layout in a dual gate oxide product is investigated. Careless placement and layout for thick gate oxide transistors in the multigate oxide chip can cause significant y...
详细信息
ISBN:
(纸本)0769527957
Design-for-manufacture (DFM) for thick gate oxide layout in a dual gate oxide product is investigated. Careless placement and layout for thick gate oxide transistors in the multigate oxide chip can cause significant yield loss. The root cause of the yield loss is that the thick gate oxide can impact the uniformity of the adjacent thin gate oxide thickness. Further experiments' results show that the optimization of thick gate oxide transistor layout for the same product can improve the yield. Besides tweaking the gate oxide etching process to overcome the difficulty of multioxide product manufacture, the guidelines for a good gate oxide layout practice are provided to facilitate the manufacture.
Various commercial programmable compute platforms have their processor architecture enhanced with field-programmablegatearrays (FPGAs). In a common usage scenario, an application loads custom processors into the FPG...
详细信息
ISBN:
(纸本)9781605584690
Various commercial programmable compute platforms have their processor architecture enhanced with field-programmablegatearrays (FPGAs). In a common usage scenario, an application loads custom processors into the FPGA to speed up application execution compared to processor-only execution. Transient applications, changing application workloads, and limited FPGA capacity have led to a new problem of operating-system-controlled dynamic management of the loading of coprocessors into the FPGAs for best overall performance or energy. We define the Dynamic Coprocessor Management problem and provide a mapping to an online optimization problem known as Metrical Task Systems. We introduce a robust heuristic, called the fading cumulative benefit (FCBenefit) heuristic, that outperforms other heuristics, including a previously developed one for MTS. For two distinct application sets, we generate numerous workloads and show that the FCBenefit heuristic provides best results across all considered workloads. In our simulations, the heuristic's results were within 9% of the offline optimal for performance, and within 3% for energy. The heuristic may be applicable to a wide variety of dynamic architecture management problems. Copyright 2008 acm.
Hardware/software partitioning is an increasingly common technique that maps critical regions of a software application into custom hardware to achieve application speedup. Most previous partitioning approaches assume...
详细信息
ISBN:
(纸本)9781595939999
Hardware/software partitioning is an increasingly common technique that maps critical regions of a software application into custom hardware to achieve application speedup. Most previous partitioning approaches assume that each application region has only a single hardware implementation. However, code regions typically can be implemented as many different versions that tradeoff performance and area, as in the case of a loop that can be unrolled by different amounts. We introduce a new formulation of hardware/software partitioning that integrates multiple versions of region implementations, improving performance by more than 27% on average compared to partitioning with a single implementation. We present an optimal ILP solution, and introduce an efficient heuristic that achieves solutions within 0% to 8% of the optimal while running in less than one second for large problem sizes. Copyright 2008 acm.
We present an efficient timing-driven placement algorithm for FPGAs. Our major contribution is a criticality history guided (CHG) approach that can simultaneously reduce the critical path delay and computation time. T...
详细信息
ISBN:
(纸本)9781595939999
We present an efficient timing-driven placement algorithm for FPGAs. Our major contribution is a criticality history guided (CHG) approach that can simultaneously reduce the critical path delay and computation time. The proposed approach keeps track of the timing criticality history of each edge and utilizes this information to effectively guide the placer. We also present a cooling schedule that optimizes both timing and run time when combined with the CHG method. The proposed algorithm is applied to the 20 largest MCNC benchmark circuits. Experimental results show that compared with VPR [1], our placement algorithm yields an average of 21.7% reduction (maximum 45.8%) in the critical path delay and it runs 2.2X faster than VPR. In addition, our approach outperforms other algorithms discussed in the literature in both delay and run time. Copyright 2008 acm.
With current technology trends, FPGA routing is an important problem, since routing in FPGAs contributes significantly to delay and resource utilization, as compared to the logic portion of FPGAs. In this paper we imp...
详细信息
ISBN:
(纸本)9781595939999
With current technology trends, FPGA routing is an important problem, since routing in FPGAs contributes significantly to delay and resource utilization, as compared to the logic portion of FPGAs. In this paper we improve the FPGA routing characteristics by applying the technique of network coding. This relatively new technique was developed in the context of communication networks, and proven to improve network throughput, reliability, etc. To the best of our knowledge, this paper is the first to apply network coding to improve FPGA routing. Our preliminary results are implemented in the VPR 4.30 tool suite. We demonstrate (on average) a 14% reduction in worst case delay, a 3% reduction in wirelength and a healthy reduction in the routing track count on several MCNC benchmark circuits, over the current best known results. By using carefully generated cost models for applying the technique of network coding, we show that this routability improvement is accompanied by a zero percent CLB utilization overhead and
This paper presents an implementation of a multilayer perceptron neural network and the backpropagation learning algorithm in an FPGA. The resulting system, in contrast to others, is low-cost with effective resource u...
详细信息
ISBN:
(纸本)9781595939999
This paper presents an implementation of a multilayer perceptron neural network and the backpropagation learning algorithm in an FPGA. The resulting system, in contrast to others, is low-cost with effective resource utilization, capable of training the neural network for any given task. The system is based on a modular scheme conforming to a system-on-a-chip (SoC), where modules can be replaced or scaled for a specific application. The system uses fixed-point arithmetic and it was carried out using generic hardware description language. A pipeline architecture is used in order to build a time-efficient system. The efficacy of the systems was tested in a pattern recognition application, tests were done in a low-cost Xilinx Spartan-3E FPGA. Copyright 2008 acm.
High-level synthesis tools automatically generate custom hardware circuits from high-level languages, including popular programming languages like standard ANSI C, but are unable to handle recursive functions. The con...
详细信息
ISBN:
(纸本)9781595939999
High-level synthesis tools automatically generate custom hardware circuits from high-level languages, including popular programming languages like standard ANSI C, but are unable to handle recursive functions. The convenience of recursive algorithms has made recursion a widespread programming practice, therefore limiting the applicability of high-level synthesis tools. We introduce a new synthesis technique, recursion flattening, that reduces the limitations caused by recursion for high-level synthesis. Recursion flattening can eliminate many instances of recursion by determining recursion depth, and then inlining recursive calls. Recursion flattening cannot eliminate all recursion, but we show that the technique succeeds for many common recursive algorithms. We applied the technique to seven recursive benchmarks that previously would not have been synthesizable, resulting in FPGA hardware circuits that run 75x faster on average than if the benchmark were run as microprocessor software. Furthermore, we compared those hardware circuits to circuits synthesized from the same benchmarks coded using non-recursive algorithms, and show nearly identical performance and area for many examples, and significantly increased performance for several examples. Copyright 2008 acm.
暂无评论