As one of the most popular accelerators, Graphics processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one...
详细信息
ISBN:
(纸本)9781424497799
As one of the most popular accelerators, Graphics processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one of the most largest power consumers in desktop and supercomputer systems. However, software power optimization method targeted for GPU has not been well studied. In this work, we propose kernel fusion method to reduce energy consumption and improve power efficiency on GPU architecture. Through fusing two or more independent kernels, kernel fusion method achieves higher utilization and much more balanced demand for hardware resources, which provides much more potential for power optimization, such as dynamic voltage and frequency scaling (DVFS). Basing on the CUDA programming model, this paper also gives several different fusion methods targeted for different situations. In order to make judicious fusion strategy, we deduce the process of fusing multiple independent kernels as a dynamic programming problem, which could be well solved with many existing tools and be simply embedded into compiler or runtime system. To reduce the overhead introduced by kernel fusion, we also propose effective method to reduce the usage of shared memory and coordinate the thread space of the kernels to be fused. Detailed experimental evaluation validates that the proposed kernel fusion method could reduce energy consumption without performance loss for several typical kernels.
As the system scales up continuously, the problem of power consumption for high performance computing (HPC) system becomes more severe. Heterogeneous system integrating two or more kinds of processors, could be better...
详细信息
As the system scales up continuously, the problem of power consumption for high performance computing (HPC) system becomes more severe. Heterogeneous system integrating two or more kinds of processors, could be better adapted to heterogeneity in applications and provide much higher energy efficiency in theory. Many studies have shown heterogeneous system is preferable on energy consumption to homogeneous system in a multi-programmed computing environment. However, how to exploit energy efficiency (Flops/Watt) of heterogeneous system for a single application or even for a single phase in an application has not been well studied. This paper proposes a power-efficient work distribution method for single application on a CPU-GPU heterogeneous system. The proposed method could coordinate inter-processor work distribution and per-processor's frequency scaling to minimize energy consumption under a given scheduling length constraint. We conduct our experiment on a real system, which equips with a multi-core CPU and a multi-threaded GPU. Experimental results show that, with reasonably distributing work over CPU and GPU, the method achieves 14% reduction in energy consumption than static mappings for several typical benchmarks. We also demonstrate that our method could adapt to changes in scheduling length constraint and hardware configurations.
This paper proposes a novel transactional memory design: conflict graph based hardware transactional memory. It allows two conflicting transactions both to commit if they do not violate the condition of serializabilit...
详细信息
This paper proposes a novel transactional memory design: conflict graph based hardware transactional memory. It allows two conflicting transactions both to commit if they do not violate the condition of serializability. Simulation results show that conflict graph based hardware transactional memory outperforms the state-of-art transactional memory system.
Many applications demand distributing data with different contents efficiently in the network environment with unreliable links and a high node churn. Existing approaches mostly focus on optimizing either efficiency o...
详细信息
Many applications demand distributing data with different contents efficiently in the network environment with unreliable links and a high node churn. Existing approaches mostly focus on optimizing either efficiency or robustness of data distribution, and fail to ensure both of them simultaneously. In this paper, we propose Semantic Cast - a content-based data distribution approach over self-organizing semantic overlay networks. Semantic Cast maintains a self-organizing semantic overlay based on view exchange (called Crowd). In Crowd, each node seeks neighbors with more similar interests by periodically exchanging its neighbor list (called view) with a chosen neighbor. Through these nodes' self-organizing behavior, various interest communities emerge in the overlay. For data distribution over Crowd, Semantic Cast adopts random walk to route data between interest communities, and adopts flooding to disseminate data inside the interested communities. The experimental results show that compared to existing approaches, Semantic Cast can support efficient content-based data distribution in the unreliable and highly dynamic network environment.
The influence of on-chip metal interconnections, power grids, heat sink together with packaging, and metal dummy fills on the transmission characteristics of a 2mm-long integrated dipole antenna pair has been investig...
详细信息
The influence of on-chip metal interconnections, power grids, heat sink together with packaging, and metal dummy fills on the transmission characteristics of a 2mm-long integrated dipole antenna pair has been investigated in this paper. These metal structures and placements have been classified and particular simulations are performed to explore the interference effects of neighboring various metal structures on transmission gain, phase, impedance and radiation pattern for on-chip dipole antenna pair. By virtue of the experimental results and analyses, several experiential linear expressions for antenna pair gain and phase in interference circumstances are obtained using numerical fit. A set of design rules is concluded accordingly for guiding on-chip antenna layout and design targeting wireless interconnect.
According to Moore's law the complexity of VLSI circuits has doubled approximately every two years, resulting in simulation becoming the major bottleneck in the circuit design process. parallel and distributed sim...
详细信息
According to Moore's law the complexity of VLSI circuits has doubled approximately every two years, resulting in simulation becoming the major bottleneck in the circuit design process. parallel and distributed simulations can be applied as fast, cost effective approaches to the simulation of large, complex circuits. In this paper, a simple yet effective simulated annealing-based approach is proposed to optimize the choice of a time window for optimistic parallel simulation. We chose gate level circuits simulations as our experimental vehicle. Our results show up to a 52% improvement in the simulation time using our simulated annealing algorithm. To the best of our knowledge, this is the first time that SA has been applied to optimize the performance of time warp simulations.
This paper presents reuse-aware modulo scheduling to maximizing stream reuse and improving concurrency for stream-level loops running on stream processors. The novelty lies in the development of a new representation f...
详细信息
ISBN:
(纸本)9783981080162
This paper presents reuse-aware modulo scheduling to maximizing stream reuse and improving concurrency for stream-level loops running on stream processors. The novelty lies in the development of a new representation for an unrolled and software-pipelined stream-level loop using a set of reuse equations, resulting in simultaneous optimization of two performance objectives for the loop, reuse and concurrency, in a unified framework. We have implemented this work in the compiler developed for our 64-bit FT64 stream processor. Our experimental results obtained on FT64 and by simulation using nine representative stream applications demonstrate the effectiveness of the proposed approach.
Multi-core architectures, which have multiple processing units on a single chip, are widely viewed as a way to achieve higher processor performance. Well scheduling of running threads on these processors will result i...
Multi-core architectures, which have multiple processing units on a single chip, are widely viewed as a way to achieve higher processor performance. Well scheduling of running threads on these processors will result in achieving higher performance. Modern multi-core systems are designed to allow clusters of cores to share various hardware structures, such as last-level caches, memory controllers, and interconnections, as well as prefetching hardware. Without considering these shared resources, scheduling the threads will cause serious degradation in overall performance of the system. In this paper we propose a novel algorithm to schedule the threads that considers these potential contentions to keep away from. The simulation results showed that the proposed scheduler would avoid from lots of contentions between threads on various resources especially on shared caches.
The networked application environment has motivated the development of multitasking operating systems for sensor networks and other low-power electronic devices, but their multitasking capability is severely limited b...
详细信息
ISBN:
(纸本)9781424472611;9780769540597
The networked application environment has motivated the development of multitasking operating systems for sensor networks and other low-power electronic devices, but their multitasking capability is severely limited because traditional stack management techniques perform poorly on small-memory systems. In this paper, we show that combining binary translation and a new kernel runtime can lead to efficient OS designs on resource-constrained platforms. We introduce SenSmart, a multitasking OS for sensor networks, and present new OS design techniques for supporting preemptive multi-task scheduling, memory isolation, and versatile stack management. We have implemented SenSmart on MICA2/MICAz motes. Evaluation shows that SenSmart performs efficient binary translation and demonstrates a significantly better capability in managing concurrent tasks than other sensornet operating systems.
暂无评论