This paper addresses the optimization of parallel simulators for large-scale parallel systems and applications. Such simulators are often based on parallel discrete event simulation with conservative or optimistic pro...
详细信息
This paper addresses the optimization of parallel simulators for large-scale parallel systems and applications. Such simulators are often based on parallel discrete event simulation with conservative or optimistic protocols to synchronize the simulating processes. The paper considers how available future information about events and application behaviors can be efficiently extracted and further exploited to improve the performance of adaptive optimistic protocols. First, we extract information about future events and their dependencies in application traces to guide adaptive adjustments of time window in trace-driven parallel simulation. Second, we use information about application behaviors, specifically the iterative behavior found in many applications, to avoid the unnecessary adjustments of time window. These techniques are implemented in the BigSim simulator and tested by real-world and standard benchmark applications including Jacobi3D and HPL. The results show that our optimization approaches can reduce the execution times of simulation ranging from 11% up to 32%. Moreover, our methods are easy to implement and don't need to augment compilers or even modify the core codes of parallel simulators.
Combining virtual machine technology, virtual computing is able to effectively aggregate the widely distributed resources to provide users services. We view the federation of multiple data centers and voluntary resour...
详细信息
Combining virtual machine technology, virtual computing is able to effectively aggregate the widely distributed resources to provide users services. We view the federation of multiple data centers and voluntary resources on the Internet as a very large scale resource pool. Based on the tree structure of the pool, this paper proposes a virtual machine deployment algorithm, called iVDA, considers users' requests and the capabilities of the physical resources as well as the dynamic load, implements an adaptive mechanism to scheduling servers to host virtual machines forming virtual execution environments for various applications, and supports on-demand computing.
On chip multiprocessors (CMPs) platforms, multiple co-scheduled applications can severely degrade performance and quality of service (QoS) when they contend for last-level cache (LLC) resources. Whether an application...
详细信息
On chip multiprocessors (CMPs) platforms, multiple co-scheduled applications can severely degrade performance and quality of service (QoS) when they contend for last-level cache (LLC) resources. Whether an application will impose destructive interference on co-scheduled applications is largely dependent on its own inherent cache access behavior characteristics. In this work, we first present case studies that show how inter-application interferences result in undesirable performance in both shared and private cache based LLC designs. We then propose a new online approach for application cache behavior identification on the basis of detailed simulation and analysis with SPEC CPU2006 benchmarks. We demonstrate that our approach can more concisely identify application cache behaviors. Moreover, the proposed approach can be implemented directly in hardware to dynamically identify the application cache behaviors at runtime. Finally, we show with two case studies that how the proposed approach can be adopted by both shared and private based cache sharing mechanisms, i.e. cache partitioning algorithms (CPAs) and cache spilling techniques, for more concise cache resource management.
We investigate the effect of self and cross-coupling capacitance on stability diagram in a metallic double-dot device by theory and method. In linear transport regime, cross-coupling capacitances affect the dimension ...
详细信息
We investigate the effect of self and cross-coupling capacitance on stability diagram in a metallic double-dot device by theory and method. In linear transport regime, cross-coupling capacitances affect the dimension of the honeycomb cell and the distance of two triple points, while self capacitances only slightly broaden the boundary of the cell and make two triple point closer. In nonlinear transport regime, cross-coupling capacitances stretch the current region and charge region in the mid-line direction, while self capacitances extend the region of current regions but not change the shape of the stability cells. Cross-coupling capacitances make stronger impact on the dimensions of stability diagram than self capacitance. But the self-capacitance must be included in the current calculation if its value can not be neglected with respect to the device parameters.
Network calculus is a promising theory for analyzing and modeling networks based on min-plus algebra. Using network calculus theory, we propose formulas of arrival curve and service curve for end-to-end communication,...
详细信息
ISBN:
(纸本)9781424486670
Network calculus is a promising theory for analyzing and modeling networks based on min-plus algebra. Using network calculus theory, we propose formulas of arrival curve and service curve for end-to-end communication, build the corresponding time model, and derive the communication delay formulas for two scenarios of the model respectively. Then we take fat tree topology, which is widely used in Infiniband interconnection, as an example to analyze the delay of one-to-all broadcast. This paper, as a groundwork, provides a new approach for the network researchers to delve communication delay in future researches.
As one of the most popular many-core architecture, GPUs have illustrated power in many non-graphic applications. Traditional general purpose computing systems tend to integrate GPU as the co-processor to accelerate pa...
详细信息
ISBN:
(纸本)9781424465392
As one of the most popular many-core architecture, GPUs have illustrated power in many non-graphic applications. Traditional general purpose computing systems tend to integrate GPU as the co-processor to accelerate parallel computing tasks. Meanwhile, GPUs also result in high power consumption, which accounts for a large proportion of the total system power consumption. In this paper, we mainly focus on the power analysis and optimizations for GPU architecture. The main contributions of this paper are: firstly, we establish a GPU power research platform, which is extended from an existing GPU simulator with several power models; secondly, we validate that, as the gap between shader core and memory speed becomes larger and larger, integrating more shader cores or enhancing running frequencies may not bring better performance, but results in higher energy consumption; thirdly, we show that traditional power optimization methods for CPUs, such as dynamic frequency scaling and concurrency-throttling, could be effectively applied on GPU architectures for better power efficiency, especially for memory-intensive applications.
Multithreading is a promising technique that widely used in general purpose processors to hide long latency events such as cache misses. This paper proposes an embedded processor design with multithreading support bas...
详细信息
Multithreading is a promising technique that widely used in general purpose processors to hide long latency events such as cache misses. This paper proposes an embedded processor design with multithreading support based on the OR1200 processor. The multithreaded OR1200 processor supports interleaved execution of four threads in a round-robin way. The hardware design is evaluated through RTL-simulation of the verilog code. Results show that the interleaved execution of multiple threads can tolerate the memory latency effectively and an average speed-up of 1.16 can be achieved.
As a fast on-chip SRAM managed by software (the application and/or compiler), Scratchpad Memory (SPM) is widely used in many fields. This paper presents a Simple Scalar-based multi-level SPM memory hierarchy architect...
详细信息
As a fast on-chip SRAM managed by software (the application and/or compiler), Scratchpad Memory (SPM) is widely used in many fields. This paper presents a Simple Scalar-based multi-level SPM memory hierarchy architecture simulator Sim-spm. We simulate the hardware of the multi-level SPM memory hierarchy successfully by extending Sim-outorder, which is an out-of-order simulator from Simple Scalar. Through the simulating memory method, the simulation framework of the multi-level SPM memory hierarchy has been built under the existing ISA (Instruction Set Architecture), which largely reduces the requirement to modify the existing compiler. The experimental results show that Sim-spm can accurately simulate the running state of the processor with a multi-level SPM memory hierarchy architecture, and it has a good prospect for the research of multi-level SPM memory hierarchy architecture.
With the growth of supercomputer's scale, the communication time during executing is increasing. This phenomenon arouses the architecture researchers' interests. In this paper, based on the fat-tree topology, ...
详细信息
ISBN:
(纸本)9781424465392;9781424465422
With the growth of supercomputer's scale, the communication time during executing is increasing. This phenomenon arouses the architecture researchers' interests. In this paper, based on the fat-tree topology, which is widely used in Infiniband, we present an one-to-all broadcast communication time model. After classifying applications into two kinds, we establish the ideal model and the bandwidth-limited model on the exponential-capacity binary fat-trees for the two kinds of applications. Through analyzing the models, we get the curves which describe the relationship between the communication time and the processor number. The conclusions we get in this paper can help system designers make better system design.
As one of the most popular accelerators, Graphics processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one...
详细信息
ISBN:
(纸本)9781424497799
As one of the most popular accelerators, Graphics processing Unit (GPU) has demonstrated high computing power in several application fields. On the other hand, GPU also produces high power consumption and has been one of the most largest power consumers in desktop and supercomputer systems. However, software power optimization method targeted for GPU has not been well studied. In this work, we propose kernel fusion method to reduce energy consumption and improve power efficiency on GPU architecture. Through fusing two or more independent kernels, kernel fusion method achieves higher utilization and much more balanced demand for hardware resources, which provides much more potential for power optimization, such as dynamic voltage and frequency scaling (DVFS). Basing on the CUDA programming model, this paper also gives several different fusion methods targeted for different situations. In order to make judicious fusion strategy, we deduce the process of fusing multiple independent kernels as a dynamic programming problem, which could be well solved with many existing tools and be simply embedded into compiler or runtime system. To reduce the overhead introduced by kernel fusion, we also propose effective method to reduce the usage of shared memory and coordinate the thread space of the kernels to be fused. Detailed experimental evaluation validates that the proposed kernel fusion method could reduce energy consumption without performance loss for several typical kernels.
暂无评论