GPU platforms are becoming increasingly attractive for implementing accelerators because they feature a larger number of cores with improved programmability. In this paper, we describe our implementation of a state-of...
详细信息
ISBN:
(纸本)9781605588001
GPU platforms are becoming increasingly attractive for implementing accelerators because they feature a larger number of cores with improved programmability. In this paper, we describe our implementation of a state-of-the-art academic multi-level analytical placer mPL [8] on Nvidia's massively parallel GT200 series platforms. We detail our efforts on performance tuning and optimizations. When compared to software implementation on Intel's recent generation Xeon CPU, the speed of the global placement part of mPL is 15X faster on average using a Tesla C1060 card, with comparable WL. (less than 1% WL degradation on average). Copyright 2009 acm.
An accurate and highly-efficient performance analysis approach is extremely important for the early-stage designs of network-onchip. In this paper, the novel M/G/1/N queuing models for generic routers are proposed to ...
详细信息
ISBN:
(纸本)9781605588001
An accurate and highly-efficient performance analysis approach is extremely important for the early-stage designs of network-onchip. In this paper, the novel M/G/1/N queuing models for generic routers are proposed to analyze various packet blockings and then the performance analysis algorithm is presented to estimate some key metrics in terms of packet latency, buffer utilization, etc. For single-channel and multi-channel routers, the comparisons between analysis and observed results validate that the proposed approach with mean errors of 6.9% and 7.8% achieve the speedups of 240 and 210 times respectively. In our design methodology, this approach can not only effectively direct NoC synthesis process but also be conveniently applied to multi-objective optimizations to find the best mapping solutions. Copyright 2009 acm.
Present application specific embedded systems tend to choose instruction set extensions (ISEs) based on limitations imposed by the available data bandwidth to custom functional units (CFUs). Adoption of the optimal IS...
详细信息
ISBN:
(纸本)9781605588001
Present application specific embedded systems tend to choose instruction set extensions (ISEs) based on limitations imposed by the available data bandwidth to custom functional units (CFUs). Adoption of the optimal ISE for an application would, in many cases, impose formidable cost increase in order to achieve the required data bandwidth. In this paper we propose a novel methodology for laying out data in memories, generating high-bandwidth memory systems by making use of existing low-bandwidth low-cost ones and designing custom functional units all with the desirable data bandwidth for only a fraction of the additional cost required by traditional techniques. Copyright 2009 acm.
System level power management must consider the uncertainty and variability that comes from the environment, the application and the hardware. A robust power management technique must be able to learn the optimal deci...
详细信息
ISBN:
(纸本)9781605588001
System level power management must consider the uncertainty and variability that comes from the environment, the application and the hardware. A robust power management technique must be able to learn the optimal decision from past history and improve itself as the environment changes. This paper presents a novel online power management technique based on model-free constrained reinforcement learning (RL). It learns the best power management policy that gives the minimum power consumption for a given performance constraint without any prior information of workload. Compared with existing machine learning based power management techniques, the RL based learning is capable of exploring the trade-off in the power-performance design space and converging to a better power management policy. Experimental results show that the proposed RL based power management achieves 24% and 3% reduction in power and latency respectively comparing to the existing expert based power management. Copyright 2009 acm.
This paper shows that a timing graph has a hierarchy of specially defined subgraphs, based on which we present a technique that captures topological correlation in arbitrary block-based statistical static timing analy...
详细信息
ISBN:
(纸本)9781605588001
This paper shows that a timing graph has a hierarchy of specially defined subgraphs, based on which we present a technique that captures topological correlation in arbitrary block-based statistical static timing analysis (SSTA).We interpret a timing graph as an algebraic expression made up of addition and maximum operators. We define the division operation on the expression and propose algorithms that modify factors in the expression without expansion. As a result, they produce an expression to derive the latest arrival time with better accuracy in SSTA. Existing techniques handling reconvergent fanouts usually use dependency lists, requiring quadratic space complexity. Instead, the proposed technique has linear space complexity by using a new directed acyclic graph search algorithm. Our results show that it outperforms an existing technique in speed and memory usage with comparable accuracy. Copyright 2009 acm.
We present a rigorous framework that defines a class of net weighting schemes in which unconstrained minimization is successively performed on a weighted objective. We show that, provided certain goals are met in the ...
详细信息
ISBN:
(纸本)9781605588001
We present a rigorous framework that defines a class of net weighting schemes in which unconstrained minimization is successively performed on a weighted objective. We show that, provided certain goals are met in the unconstrained minimization, these net weighting schemes are guaranteed to converge to the optimal solution of the original timingconstrained placement problem. These are the first results that provide conditions under which a net weighting scheme will converge to a timing optimal placement. We then identify several weighting schemes that satisfy the given convergence properties and implement them, with promising results: a modification of the weighting scheme given in [11]results in consistently improved delay over the original, 4% on average, without increase in computation *** 2009 acm.
While Dynamic Voltage Scaling (DVS) and Dynamic Power Management (DPM) techniques are widely used in real-time embedded applications, their complex interaction is not fully understood. In this research effort, we cons...
详细信息
ISBN:
(纸本)9781605588001
While Dynamic Voltage Scaling (DVS) and Dynamic Power Management (DPM) techniques are widely used in real-time embedded applications, their complex interaction is not fully understood. In this research effort, we consider the problem of minimizing the expected energy consumption on settings where the workload is known only probabilistically. By adopting a system-level power model, we formally show how the optimal processing frequency can be computed ef-ficiently for a real-time embedded application that can use multiple devices during its execution, while still meeting the timing constraints. Our evaluations indicate that the new technique provides clear (up to 35%) energy gains over the existing solutions that are proposed for deterministic workloads. Moreover, in a non-negligible part of the parameter spectrum, the algorithm's performance is shown to be close to that of a clairvoyant algorithm that can minimize the energy consumption with the advance knowledge about the exact workload. Copyright 2009 acm.
As the complexity of integrated circuits has increased, so has the need for improving testing efficiency. Unfortunately, the types of defects are also becoming more complex, which in turn makes simple approaches for t...
详细信息
ISBN:
(纸本)9781605588001
As the complexity of integrated circuits has increased, so has the need for improving testing efficiency. Unfortunately, the types of defects are also becoming more complex, which in turn makes simple approaches for testing inadequate. Using n-detect testing can improve detect coverage;however, this approach can greatly increase the test set size. In this proof-of-concept paper we investigate the use of logic implication checkers, inserted in hardware, as an aid in compacting n-detect test sets. We show that checker hardware with minimal area overhead can reduce test set size by up to 25%. In addition, this implication checker can serve a dual purpose for online error detection. Copyright 2009 acm.
The importance of within-die process variation and its impact on product yield has increased significantly with scaling. Within-die variation is typically monitored by embedding characterization circuits in product ch...
详细信息
ISBN:
(纸本)9781605588001
The importance of within-die process variation and its impact on product yield has increased significantly with scaling. Within-die variation is typically monitored by embedding characterization circuits in product chips. In this work, we propose a minimally-invasive, low-overhead technique for characterizing within-die variation. The proposed technique monitors within-die variation by measuring quiescent (IDDQ) currents at multiple power supply ports during wafer-probe test. We show that the spatially distributed nature of power ports enables spatial observation of process variation. We demonstrate our methodology on an experimental test-chip fabricated in 65-nm technology. The measurement results show that the IDDQ currents drawn by multiple power supply ports correlate very well with the variation trends introduced by state-dependent leakage patterns. Copyright 2009 acm.
Process variation is recognized as a major source of parametric yield loss, which occurs because a fraction of manufactured chips do not satisfy timing or power constraints. On the other hand, both chip performance an...
详细信息
ISBN:
(纸本)9781605588001
Process variation is recognized as a major source of parametric yield loss, which occurs because a fraction of manufactured chips do not satisfy timing or power constraints. On the other hand, both chip performance and chip leakage power depend on supply voltage. This dependence can be used for converting the fraction of too slow or too leaky chips into good ones by adjusting their supply voltage. This technique is called voltage binning [4]. All the manufactured chips are divided into groups (bins) and each group is assigned its individual supply voltage. This paper proposes a statistical technique of yield computation for different voltage binning schemes using results of statistical timing and variational power analysis. The paper formulates and solves the problem of computing optimal supply voltages for a given binning scheme. Copyright 2009 acm.
暂无评论