On chip multiprocessors (CMPs) platforms, multiple co-scheduled applications can severely degrade performance and quality of service (QoS) when they contend for last-level cache (LLC) resources. Whether an application...
详细信息
On chip multiprocessors (CMPs) platforms, multiple co-scheduled applications can severely degrade performance and quality of service (QoS) when they contend for last-level cache (LLC) resources. Whether an application will impose destructive interference on co-scheduled applications is largely dependent on its own inherent cache access behavior characteristics. In this work, we first present case studies that show how inter-application interferences result in undesirable performance in both shared and private cache based LLC designs. We then propose a new online approach for application cache behavior identification on the basis of detailed simulation and analysis with SPEC CPU2006 benchmarks. We demonstrate that our approach can more concisely identify application cache behaviors. Moreover, the proposed approach can be implemented directly in hardware to dynamically identify the application cache behaviors at runtime. Finally, we show with two case studies that how the proposed approach can be adopted by both shared and private based cache sharing mechanisms, i.e. cache partitioning algorithms (CPAs) and cache spilling techniques, for more concise cache resource management.
We investigate the effect of self and cross-coupling capacitance on stability diagram in a metallic double-dot device by theory and method. In linear transport regime, cross-coupling capacitances affect the dimension ...
详细信息
We investigate the effect of self and cross-coupling capacitance on stability diagram in a metallic double-dot device by theory and method. In linear transport regime, cross-coupling capacitances affect the dimension of the honeycomb cell and the distance of two triple points, while self capacitances only slightly broaden the boundary of the cell and make two triple point closer. In nonlinear transport regime, cross-coupling capacitances stretch the current region and charge region in the mid-line direction, while self capacitances extend the region of current regions but not change the shape of the stability cells. Cross-coupling capacitances make stronger impact on the dimensions of stability diagram than self capacitance. But the self-capacitance must be included in the current calculation if its value can not be neglected with respect to the device parameters.
Network calculus is a promising theory for analyzing and modeling networks based on min-plus algebra. Using network calculus theory, we propose formulas of arrival curve and service curve for end-to-end communication,...
详细信息
ISBN:
(纸本)9781424486670
Network calculus is a promising theory for analyzing and modeling networks based on min-plus algebra. Using network calculus theory, we propose formulas of arrival curve and service curve for end-to-end communication, build the corresponding time model, and derive the communication delay formulas for two scenarios of the model respectively. Then we take fat tree topology, which is widely used in Infiniband interconnection, as an example to analyze the delay of one-to-all broadcast. This paper, as a groundwork, provides a new approach for the network researchers to delve communication delay in future researches.
As one of the most popular many-core architecture, GPUs have illustrated power in many non-graphic applications. Traditional general purpose computing systems tend to integrate GPU as the co-processor to accelerate pa...
详细信息
ISBN:
(纸本)9781424465392
As one of the most popular many-core architecture, GPUs have illustrated power in many non-graphic applications. Traditional general purpose computing systems tend to integrate GPU as the co-processor to accelerate parallel computing tasks. Meanwhile, GPUs also result in high power consumption, which accounts for a large proportion of the total system power consumption. In this paper, we mainly focus on the power analysis and optimizations for GPU architecture. The main contributions of this paper are: firstly, we establish a GPU power research platform, which is extended from an existing GPU simulator with several power models; secondly, we validate that, as the gap between shader core and memory speed becomes larger and larger, integrating more shader cores or enhancing running frequencies may not bring better performance, but results in higher energy consumption; thirdly, we show that traditional power optimization methods for CPUs, such as dynamic frequency scaling and concurrency-throttling, could be effectively applied on GPU architectures for better power efficiency, especially for memory-intensive applications.
Multithreading is a promising technique that widely used in general purpose processors to hide long latency events such as cache misses. This paper proposes an embedded processor design with multithreading support bas...
详细信息
Multithreading is a promising technique that widely used in general purpose processors to hide long latency events such as cache misses. This paper proposes an embedded processor design with multithreading support based on the OR1200 processor. The multithreaded OR1200 processor supports interleaved execution of four threads in a round-robin way. The hardware design is evaluated through RTL-simulation of the verilog code. Results show that the interleaved execution of multiple threads can tolerate the memory latency effectively and an average speed-up of 1.16 can be achieved.
As a fast on-chip SRAM managed by software (the application and/or compiler), Scratchpad Memory (SPM) is widely used in many fields. This paper presents a Simple Scalar-based multi-level SPM memory hierarchy architect...
详细信息
As a fast on-chip SRAM managed by software (the application and/or compiler), Scratchpad Memory (SPM) is widely used in many fields. This paper presents a Simple Scalar-based multi-level SPM memory hierarchy architecture simulator Sim-spm. We simulate the hardware of the multi-level SPM memory hierarchy successfully by extending Sim-outorder, which is an out-of-order simulator from Simple Scalar. Through the simulating memory method, the simulation framework of the multi-level SPM memory hierarchy has been built under the existing ISA (Instruction Set Architecture), which largely reduces the requirement to modify the existing compiler. The experimental results show that Sim-spm can accurately simulate the running state of the processor with a multi-level SPM memory hierarchy architecture, and it has a good prospect for the research of multi-level SPM memory hierarchy architecture.
With the growth of supercomputer's scale, the communication time during executing is increasing. This phenomenon arouses the architecture researchers' interests. In this paper, based on the fat-tree topology, ...
详细信息
ISBN:
(纸本)9781424465392;9781424465422
With the growth of supercomputer's scale, the communication time during executing is increasing. This phenomenon arouses the architecture researchers' interests. In this paper, based on the fat-tree topology, which is widely used in Infiniband, we present an one-to-all broadcast communication time model. After classifying applications into two kinds, we establish the ideal model and the bandwidth-limited model on the exponential-capacity binary fat-trees for the two kinds of applications. Through analyzing the models, we get the curves which describe the relationship between the communication time and the processor number. The conclusions we get in this paper can help system designers make better system design.
As one of the most popular accelerators, Graphics processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one...
详细信息
ISBN:
(纸本)9781424497799
As one of the most popular accelerators, Graphics processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one of the most largest power consumers in desktop and supercomputer systems. However, software power optimization method targeted for GPU has not been well studied. In this work, we propose kernel fusion method to reduce energy consumption and improve power efficiency on GPU architecture. Through fusing two or more independent kernels, kernel fusion method achieves higher utilization and much more balanced demand for hardware resources, which provides much more potential for power optimization, such as dynamic voltage and frequency scaling (DVFS). Basing on the CUDA programming model, this paper also gives several different fusion methods targeted for different situations. In order to make judicious fusion strategy, we deduce the process of fusing multiple independent kernels as a dynamic programming problem, which could be well solved with many existing tools and be simply embedded into compiler or runtime system. To reduce the overhead introduced by kernel fusion, we also propose effective method to reduce the usage of shared memory and coordinate the thread space of the kernels to be fused. Detailed experimental evaluation validates that the proposed kernel fusion method could reduce energy consumption without performance loss for several typical kernels.
This paper proposes a novel transactional memory design: conflict graph based hardware transactional memory. It allows two conflicting transactions both to commit if they do not violate the condition of serializabilit...
详细信息
This paper proposes a novel transactional memory design: conflict graph based hardware transactional memory. It allows two conflicting transactions both to commit if they do not violate the condition of serializability. Simulation results show that conflict graph based hardware transactional memory outperforms the state-of-art transactional memory system.
As the system scales up continuously, the problem of power consumption for high performance computing (HPC) system becomes more severe. Heterogeneous system integrating two or more kinds of processors, could be better...
详细信息
As the system scales up continuously, the problem of power consumption for high performance computing (HPC) system becomes more severe. Heterogeneous system integrating two or more kinds of processors, could be better adapted to heterogeneity in applications and provide much higher energy efficiency in theory. Many studies have shown heterogeneous system is preferable on energy consumption to homogeneous system in a multi-programmed computing environment. However, how to exploit energy efficiency (Flops/Watt) of heterogeneous system for a single application or even for a single phase in an application has not been well studied. This paper proposes a power-efficient work distribution method for single application on a CPU-GPU heterogeneous system. The proposed method could coordinate inter-processor work distribution and per-processor's frequency scaling to minimize energy consumption under a given scheduling length constraint. We conduct our experiment on a real system, which equips with a multi-core CPU and a multi-threaded GPU. Experimental results show that, with reasonably distributing work over CPU and GPU, the method achieves 14% reduction in energy consumption than static mappings for several typical benchmarks. We also demonstrate that our method could adapt to changes in scheduling length constraint and hardware configurations.
暂无评论