Heterogeneous architectures offer many potential avenues for improving energy efficiency in today's low-power cores. Two common approaches are dynamic voltage/frequency scaling (DVFS) and heterogeneous microarchit...
详细信息
ISBN:
(纸本)9781509066070
Heterogeneous architectures offer many potential avenues for improving energy efficiency in today's low-power cores. Two common approaches are dynamic voltage/frequency scaling (DVFS) and heterogeneous microarchitectures (HMs). Traditionally both approaches have incurred large switching overheads, which limit their applicability to coarse-grain program phases. However, recent research has demonstrated low-overhead mechanisms that enable switching at granularities as low as 1K instructions. The question remains, in this fine-grained switching regime, which form of heterogeneity offers better energy efficiency for a given level of performance? The effectiveness of these techniques depend critically on both efficient architectural implementation and accurate scheduling to maximize energy efficiency for a given level of performance. Therefore, we develop PaTH, an offline analysis tool, to compute (near-)optimal schedules, allowing us to determine Pareto-optimal energy savings for a given architecture. We leverage PaTH to study the potential energy efficiency of fine-grained DVFS and HMs, as well as a hybrid approach. We show that HMs achieve higher energy savings than DVFS for a given level of performance. While at a coarse granularity the combination of DVFS and HMs still proves beneficial, for fine-grained scheduling their combination makes little sense as HMs alone provide the bulk of the energy efficiency.
We propose DC-store, a storage framework that offers deterministic I/O performance for a multi-container execution environment. DC-store's hardware-level design implements multiple NVM sets on a shared storage poo...
ISBN:
(纸本)9781939133120
We propose DC-store, a storage framework that offers deterministic I/O performance for a multi-container execution environment. DC-store's hardware-level design implements multiple NVM sets on a shared storage pool, each providing a deterministic SSD access time by removing internal resource conflicts. In parallel, software support of DC-Store is aware of the NVM sets and enlightens Linux kernel to isolate noisy neighbor containers, performing page frame reclaiming, from peers. We prototype both hardware and software counterparts of DC-Store and evaluate them in a real system. The evaluation results demonstrate that containerized data-intensive applications on DC-Store exhibit 31% shorter average execution time, on average, compared to those on a baseline system.
Reducing switching activity would significantly reduce power consumption of a processor chip. The authors present two novel techniques, Gray code addressing and Cold scheduling, for reducing switching activity on high...
详细信息
Reducing switching activity would significantly reduce power consumption of a processor chip. The authors present two novel techniques, Gray code addressing and Cold scheduling, for reducing switching activity on high performance processors. They use Gray code which has only one-bit different in consecutive number for addressing. Due to locality of program execution, Gray code addressing can significantly reduce the number of bit switches. Experimental results show that for typical programs running on a RISC microprocessor, using Gray code addressing reduce the switching activity at the address lines by 30/spl sim/50% compared to using normal binary code addressing. Cold scheduling is a software method which schedules instructions in a way that switching activity is minimized. The authors carried out experiments with cold scheduling on the VLSI-BAM. Preliminary results show that switching activity in the control path is reduced by 20-30%.< >
We recently invented a true single-phase energy-recovering circuit family, called TSEL, that relies on a cross-coupled latch structure and two DC reference voltages to achieve low energy consumption for a broad range ...
详细信息
We recently invented a true single-phase energy-recovering circuit family, called TSEL, that relies on a cross-coupled latch structure and two DC reference voltages to achieve low energy consumption for a broad range of operating frequencies. In this paper, we explore the application of TSEL to the design of low-energy DSP circuits. Specifically, we describe and evaluate a 6,768-transistor, pipelined TSEL module that performs the 8-point Hadamard transform. In layout simulations with a standard 0.5 /spl mu/m CMOS technology, our TSEL module functions correctly for operating frequencies in excess of 280 MHz. Above 40 MHz, our TSEL design is more energy-efficient than any other energy-recovering alternative with a similar cross-coupled latch structure. At 280 MHz; it is at least 4 times more energy-efficient than a corresponding static CMOS design.
This paper focuses on the issues of designing an optimal superpipelined ISP (instruction set processor) driven by a set of benchmark programs. Most issues discussed in this paper also apply to VLIW and superscalar pro...
详细信息
This paper focuses on the issues of designing an optimal superpipelined ISP (instruction set processor) driven by a set of benchmark programs. Most issues discussed in this paper also apply to VLIW and superscalar processors.< >
This paper presents a study of cache hashing functions for micro-parallel processors (e.g., superpipeline and super-scalar processors). Several novel cache hashing functions are experimented. Our simulation results sh...
详细信息
This paper presents a study of cache hashing functions for micro-parallel processors (e.g., superpipeline and super-scalar processors). Several novel cache hashing functions are experimented. Our simulation results show that an unconventional cache hashing function applied on a direct-mapped cache results in hit rates as good as a two-way set associative cache with traditional mapping, while the cache hit times are as fast as a direct-mapped cache with traditional mapping.
Caches usually consume a significant amount of energy in modern microprocessors (e.g. superpipelined or superscalar processors). In this paper, we examine contemporary cache design techniques and provide an analytical...
详细信息
Caches usually consume a significant amount of energy in modern microprocessors (e.g. superpipelined or superscalar processors). In this paper, we examine contemporary cache design techniques and provide an analytical model for estimating cache energy consumption. We also present several novel techniques for designing energy-efficient caches, which include block buffering, cache sub-banking, and Gray code addressing. Experimental results suggest that both the block buffering and Gray code addressing techniques are ideal for instruction cache designs which tend to be accessed in a consecutive sequence. Cache sub-banking is ideal for both instruction and data caches. Overall, these techniques can achieve an order of magnitude energy reduction on caches.< >
In this paper, we introduce Optimus: an optimizing synthesis compiler for streaming applications. Optimus compiles programs written in a high level streaming language to either software or hardware implementations. Th...
详细信息
ISBN:
(纸本)9781605584690
In this paper, we introduce Optimus: an optimizing synthesis compiler for streaming applications. Optimus compiles programs written in a high level streaming language to either software or hardware implementations. The compiler uses a hierarchical compilation strategy that separates concerns between macro- and micro-functional requirements. Macro-functional concerns address how components (modules) are assembled to implement larger more complex applications. Micro-functional issues deal with synthesis issues of the module internals. Optimus thus allows software developers who lack deep hardware design expertise to transparently leverage the advantages of hardware customization without crossing the semantic gap between high level languages and hardware description languages. Optimus generates streaming hardware that achieves on average 40x speedup over our baseline embedded processor for a fraction of the energy. Additionally, our results show that streaming-specific optimizations can further improve performance by 255% and reduce the area requirements by 16% in average. These designs are competitive with Handel-C implementations for some of the same benchmarks. Copyright 2008 ACM.
This survey provides an overview of some recent developments in the testing and design validation of reversible logic circuits Reversible circuits are of interest in ultra-low-power design and in quantum information p...
详细信息
Technology scaling has delivered on its promises of increasing device density on a single chip. However, the voltage scaling trend has failed to keep up, introducing tight power constraints on manufactured parts. In s...
详细信息
暂无评论