Outsourcing jobs to a public cloud is a cost-effective way to address the problem of satisfying the peak resource demand when the local cloud has insufficient resources. In this paper, we study on managing deadline-co...
详细信息
The Java Virtual Machine (JVM) is the corner stone of Java technology, and its efficiency in executing the portable Java bytecodes is crucial for the success of this technology. Interpretation, Just-In-Time (JIT) comp...
详细信息
The Java Virtual Machine (JVM) is the corner stone of Java technology, and its efficiency in executing the portable Java bytecodes is crucial for the success of this technology. Interpretation, Just-In-Time (JIT) compilation, and hardware realization are well known solutions for a JVM, and previous research has proposed optimizations for each of these techniques. However, each technique has its pros and cons and may not be uniformly attractive for all hardware platforms. Instead, an understanding of the architectural implications of JVM implementations with real applications, can be crucial to the development of enabling technologies for efficient Java runtime system development on a wide range of platforms (from resource-rich servers to resource-constrained hand-held/embedded systems). Towards this goal, this paper examines architectural issues, from both the hardware and JVM implementation perspectives. It specifically explores the potential of a smart JIT compiler strategy that can dynamically interpret or compile based on associated costs, investigates the CPU and cache architectural support that would benefit JVM implementations, and examines the synchronization support for enhancing performance, using applications from the SpecJVM98 benchmarks.
Data cache prefetching in the L2 is at the forefront of pre-fetching research. In this paper we analyze the impact of virtual page boundaries on these prefetchers. Conservative measurements on real hardware show that ...
详细信息
ISBN:
(纸本)1595936831
Data cache prefetching in the L2 is at the forefront of pre-fetching research. In this paper we analyze the impact of virtual page boundaries on these prefetchers. Conservative measurements on real hardware show that 30-50% of consecutive virtual pages are mapped to pages which are not consecutive in physical memory. Advanced hardware prefetching techniques that detect access patterns which span virtual page boundaries often end up prefetching data that is from the wrong physical page. Meanwhile, current simulation techniques for evaluating prefetching algorithms assume that all virtual pages are mapped consecutively. We show that not accounting for virtual page boundaries in simulation can lead to overestimates of as much as 29% (9% on average). We also show that a simple prefetch filter can improve performance up to 32% (7% on average) and recover the overestimated performance. This leads to the conclusion that although previous simulations may not have accounted for virtual page boundaries, the results they demonstrate are still attainable and that it is not necessary to simulate virtual page boundaries to get accurate results. However, actual hardware designers should take care to implement a simple filter or else their hardware may not show the same gains in performance as they did in simulation. Copyright 2007 ACM.
In many computer systems with large data computations, the delay of memory access is one of the major performance bottlenecks. In this paper, we propose an enhanced field remapping scheme for dynamically allocated str...
详细信息
ISBN:
(纸本)3981080114
In many computer systems with large data computations, the delay of memory access is one of the major performance bottlenecks. In this paper, we propose an enhanced field remapping scheme for dynamically allocated structures in order to provide better locality than conventional field layouts. Our proposed scheme reduces cache miss rates drastically by aggregating and grouping fields from multiple instances of the same structure, which implies the performance improvement and power reduction. Our methodology will become more important in the design space exploration, especially as the embedded systems for data oriented application become prevalent. Experimental results show that average LI and L2 data cache misses are reduced by 23% and 17%, respectively. Due to the enhanced localities, our remapping achieves 13% faster execution time on average than original programs. It also reduces power consumption by 18% for data cache.
A large logical register file is important to allow effective compiler transformations or to provide a windowed space of registers to allow fast function calls. Unfortunately, a large logical register file can be slow...
详细信息
ISBN:
(纸本)9781581134100
A large logical register file is important to allow effective compiler transformations or to provide a windowed space of registers to allow fast function calls. Unfortunately, a large logical register file can be slow, particularly in the context of a wide-issue processor which requires an even larger physical register file, and many read and write ports. Previous work has suggested that a register cache can be used to address this problem. This paper proposes a new register caching mechanism in which a number of good features from previous approaches am combined with existing out-of-order processor hardware to implement a register cache for a large logical register file. It does so by separating the logical register file from the physical register file and using a modified form of register renaming to make the cache easy to implement. The physical register file in this configuration contains fewer entries than the logical register file and is designed so that the physical register file acts as a cache f or the logical register file, which is the backing store. The tag information in this caching technique is kept in the register alias table and the physical register file. It is found that the caching mechanism improves IPC up to 20% over an un-cached large logical register file and has performance near to that of a logical register file that is both large and fast.
This paper describes experiences gained during the process of implementing a standard serial benchmark (SLALOM) on a distributed computing system (Pleiades running ESP). The purpose of our experiments has been to maxi...
详细信息
A fast fault-tolerant controller structure is presented which is capable of recovering from transient faults by performing a rollback operation in hardware. The proposed fault-tolerant controller structure utilizes th...
详细信息
A fast fault-tolerant controller structure is presented which is capable of recovering from transient faults by performing a rollback operation in hardware. The proposed fault-tolerant controller structure utilizes the rollback hardware also for system mode and this way achieves performance improvements of more than 50% compared to controller structures made fault tolerant by conventional techniques, while the hardware overhead is often negligible. The proposed approach is compatible with state-of-the-art methods for FSM decomposition, state encoding and logic synthesis.
作者:
Maples, CreveLawrence Berkeley Lab
Advanced Computer Architecture Lab Berkeley CA USA Lawrence Berkeley Lab Advanced Computer Architecture Lab Berkeley CA USA
The author discusses various approaches that have been taken in developing high-speed computers for scientific problems. He then discusses the MIDAS system, under development at Lawrence Berkeley laboratory.
The author discusses various approaches that have been taken in developing high-speed computers for scientific problems. He then discusses the MIDAS system, under development at Lawrence Berkeley laboratory.
Cache memories are essential components in all existing commercial microprocessors. In order to attain best performance, cache memories have to be managed either with hardware support or compiler support. The compiler...
详细信息
Cache memories are essential components in all existing commercial microprocessors. In order to attain best performance, cache memories have to be managed either with hardware support or compiler support. The compiler approach makes use of specialized cache memory management instructions to generate an optimal management. This is done by generating an optimal scheduling of these specialized instructions for the program being compiled. Up to now, conservative approaches have been used to tackle this issue despite the occurrence of unpredictable real time events and the fact that many variables are imprecise. This explains the unstable performances of these algorithms varying from excellent to very poor. We propose to make use of a fuzzy scheduling approach to deal with the problem.
暂无评论