SPEQvm2008 is a new benchmark program suite for measuring client-side java runtime environment. It replaces JVM98, which has been used for the same purpose for more than ten years. It consists of 38 benchmark programs...
详细信息
ISBN:
(纸本)9780769535210
SPEQvm2008 is a new benchmark program suite for measuring client-side java runtime environment. It replaces JVM98, which has been used for the same purpose for more than ten years. It consists of 38 benchmark programs grouped into eleven categories and has wide variety of workloads from computation-intensive kernels to XML file processors. In this paper, we present the results of running SPECjvm2008 on three machines that have CPUs with the same microarchitecture and different cache sizes and clock speeds. The result of measurements include instruction and data cache reference and miss rates, and the effect of the multi-threading. We relate these workload parameters to the SPECjvm2008 performance metrics. Roughly speaking, an L2 cache of IMB sufficiently lows the cache miss rates of SPECjvm2008 and compared to the single-core, 1.5 to 2 times speed-ups are achieved by dual-core executions.
作者:
Li, ZheWu, MingyuShanghai Jiao Tong Univ
Inst Parallel & Distributed Syst Engn Res Ctr Domainspecif Operating Syst Minist Educ Shanghai Peoples R China Shanghai Jiao Tong Univ
Engn Res Ctr Domain specif Operating Syst Inst Parallel & Distributed Syst Minist EducShanghai Artificial Intelligence Lab Shanghai Peoples R China
Managed workloads show strong demand for large memory capacity, which can be satisfied by a hybrid memory sub-system composed of traditional DRAM and the emerging non-volatile memory (NVM) technology. Nevertheless, NV...
详细信息
ISBN:
(纸本)9781450392518
Managed workloads show strong demand for large memory capacity, which can be satisfied by a hybrid memory sub-system composed of traditional DRAM and the emerging non-volatile memory (NVM) technology. Nevertheless, NVM devices are limited by deficiencies like write endurance and asymmetric bandwidth, which threatens managed applications' performance and reliability. Prior work has proposed different object placement mechanisms to mitigate problems introduced by NVM, but they require domain-specific knowledge on applications or significant change on managed runtime. By analyzing the performance of representative data-intensive workloads atop NVM, this paper finds that reducing write operations is key for performance and wear-leveling. To this end, this paper proposes GCMove, a transparent and efficient object placement mechanism for hybrid memories. GCMove embraces a lightweight write barrier for write detection and relies on garbage collections (GC) to copy objects into different devices according to their write-related behaviors. Compared with prior work, GCMove does not require significant changes in heap layout and thus can be easily integrated with mainstream copy-based garbage collection. The evaluation on various managed workloads shows that GCMove can eliminate 99.8% ofNVMwrite operations on average and improve the performance by up to 19.81x compared with the NVM-only version.
Decentralized information flow control (DIFC) is a promising model for writing programs with powerful, end-to-end security guarantees. Current DIFC systems that run on commodity hardware can be broadly categorized int...
详细信息
ISBN:
(纸本)9781605583921
Decentralized information flow control (DIFC) is a promising model for writing programs with powerful, end-to-end security guarantees. Current DIFC systems that run on commodity hardware can be broadly categorized into two types: language-level and operating system-level DIFC. Language level solutions provide no guarantees against security violations on system resources, like files and sockets. Operating system solutions can mediate accesses to system resources, but are inefficient at monitoring the flow of information through fine-grained program data structures. This paper describes Laminar, the first system to implement decentralized information flow control using a single set of abstractions for OS resources and heap-allocated objects. Programmers express security policies by labeling data with secrecy and integrity labels, and then access the labeled data in lexically scoped security regions. Laminar enforces the security policies specified by the labels at runtime. Laminar is implemented using a modified java virtual machine and a new Linux security module. This paper shows that security regions ease incremental deployment and limit dynamic security checks, allowing us to retrofit DIFC policies on four application case studies. Replacing the applications' ad-hoe security policies changes less than 10% of the code, and incurs performance overheads from 1% to 56%. Whereas prior DIFC systems only support limited types of multithreaded programs, Laminar supports a more general class of multithreaded DIFC programs that can access heterogeneously labeled data.
While managed languages such as C# and java have become quite popular in enterprise computing, they are still considered unsuitable for hard real-time systems. In particular, the presence of garbage collection has bee...
详细信息
ISBN:
(纸本)9781605585772
While managed languages such as C# and java have become quite popular in enterprise computing, they are still considered unsuitable for hard real-time systems. In particular, the presence of garbage collection has been a sore point for their acceptance for low-level system programming tasks. Real-time extensions to these languages have the dubious distinction of, at the same time, eschewing the benefits of high-level programming and failing to offer competitive performance. The goal of our research is to explore the limitations of high-level managed languages for real-time systems programming. To this end we target a real-world embedded platform, the LEON3 architecture running the RTEMS real-time operating system, and demonstrate the feasibility of writing garbage collected code in critical parts of embedded systems. We show that java with a concurrent, real-time garbage collector, can have throughput close to that of C programs and comes within 10% in the worst observed case on realistic benchmark. We provide a detailed breakdown of the costs of java features and their execution times and compare to real-time and throughput-optimized commercial java virtual machines.
The java Vector API (JVA) is a novel feature of the java virtual machine (JVM), allowing developers to express vector computations that are automatically translated to vector hardware instructions at runtime. This pap...
详细信息
ISBN:
(纸本)9798331505356;9798331505349
The java Vector API (JVA) is a novel feature of the java virtual machine (JVM), allowing developers to express vector computations that are automatically translated to vector hardware instructions at runtime. This paper focuses on the vectorization capability of the API, which has not been studied by the literature yet. We investigate how effective the JVA is in automatically vectorizing scalar instructions, comparing it with the auto-vectorization capability of the HotSpot C2 compiler. Our results show that using the JVA results in much fewer (i.e., 79.62% on average on processors supporting AVX-512) instructions executed than using C2 auto-vectorization for carrying out the same work.
A 64-bit RISC processor is designed for large applications that need large memory address. Due to the restriction of the instruction fixed length, loading a 64-bit address needs a number of instructions, leading to a ...
详细信息
ISBN:
(纸本)9781479940875
A 64-bit RISC processor is designed for large applications that need large memory address. Due to the restriction of the instruction fixed length, loading a 64-bit address needs a number of instructions, leading to a penalty both of memory performance and memory consumption. This paper describes an address computation method based on hardware and software co-design. In our extended MIPS processor which supports register + register addressing, we achieve an approximate effect of memory access as their 32-bit counterparts;we propose a software load-address method, which simplifies the calculation of 64-bit address. We implement our methods in the 64-bit OpenJDK 6 on MIPS, and give both performance and consumption comparisons for SPECjvm2008 and Dacapo. The experimental results show that the performance of SPECjvm2008 is improved by 5.1%, the performance of Dacapo is improved by 7.3% and near to 24% for some benchmarks. The size of method generated by JVM compiler is reduced by an average of 13%.
The ability of tiny embedded devices to run large feature-rich programs is typically constrained by the amount of memory installed on such devices. Furthermore, the useful operation of these devices in wireless sensor...
详细信息
ISBN:
(纸本)9781450312127
The ability of tiny embedded devices to run large feature-rich programs is typically constrained by the amount of memory installed on such devices. Furthermore, the useful operation of these devices in wireless sensor applications is limited by their battery life. This paper presents a call stack redesign targeted at an efficient use of RAM storage and CPU cycles by a java program running on a wireless sensor mote. Without compromising the application programs, our call stack redesign saves 30% of RAM, on average, evaluated over a large number of benchmarks. On the same set of benchmarks, our design also avoids frequent RAM allocations and deallocations, resulting in average 80% fewer memory operations and 23% faster program execution. These may be critical improvements for tiny embedded devices that are equipped with small amount of RAM and limited battery life. However, our call stack redesign is equally effective for any complex multi-threaded object oriented program developed for desktop computers. We describe the redesign, measure its performance and report the resulting savings in RAM and execution time for a wide variety of programs.
Garbage collection (GC) is a standard feature for high productivity programming, saving a programmer from many nasty memory-related bugs. However, these productivity benefits come with a cost in terms of application t...
详细信息
ISBN:
(纸本)9781450369381
Garbage collection (GC) is a standard feature for high productivity programming, saving a programmer from many nasty memory-related bugs. However, these productivity benefits come with a cost in terms of application throughput, worst-case latency, and energy consumption. Since the first introduction of GC by the Lisp programming language in the 1950s, a myriad of hardware and software techniques have been proposed to reduce this cost. While the idea of accelerating GC in hardware is appealing, its impact has been very limited due to narrow coverage, lack of flexibility, intrusive system changes, and significant hardware cost. Even with specialized hardware GC performance is eventually limited by memory bandwidth bottleneck. Fortunately, emerging 3D stacked DRAM technologies shed new light on this decades-old problem by enabling efficient near-memory processing with ample memory bandwidth. Thus, we propose Charon(1), the first 3D stacked memory-based GC accelerator. Through a detailed performance analysis of HotSpot JVM, we derive a set of key algorithmic primitives based on their GC time coverage and implementation complexity in hardware. Then we devise a specialized processing unit to substantially improve their memory-level parallelism and throughput with a low hardware cost. Our evaluation of Charon with the full-production HotSpot JVM running two big data analytics frameworks, Spark and GraphChi, demonstrates a 3.29x geomean speedup and 60.7% energy savings for GC over the baseline 8-core out-of-order processor.
It is noticeably hard to predict the effect of optimization strategies in java without implementing them. "Maximal sharing" (a.k.a. "hash-consing") is one of these strategies that may have great be...
详细信息
ISBN:
(纸本)9781450340809
It is noticeably hard to predict the effect of optimization strategies in java without implementing them. "Maximal sharing" (a.k.a. "hash-consing") is one of these strategies that may have great bene fit in terms of time and space, or may have detrimental overhead. It all depends on the redundancy of data and the use of equality. We used a combination of new techniques to predict the impact of maximal sharing on existing code: Object Redundancy Profiling (ORP) to model the effect on memory when sharing all immutable objects, and Equals-Call Profiling (ECP) to reason about how removing redundancy impacts runtime performance. With comparatively low effort, using the MAximal SHaring Oracle (MASHO), a prototype profiler based on ORP and ECP, we can uncover optimization opportunities that otherwise would remain hidden. This is an experience report on applying MASHO to real and complex case: we conclude that ORP and ECP combined can accurately predict gains and losses of maximal sharing, and also that (by isolating variables) a cheap predictive model can sometimes provide more accurate information than an expensive experiment can.
As a successful programming language, java has extended its application from general-purpose systems to embedded, real-time systems. However, some of java's excellent features, like automatic memory management and...
详细信息
ISBN:
(纸本)9780769537870
As a successful programming language, java has extended its application from general-purpose systems to embedded, real-time systems. However, some of java's excellent features, like automatic memory management and dynamic compilation, bring indeterminism to the execution time of real-time java programs. In this paper, we propose several multi-core approaches to remove or at least reduce the unpredictability inside of java virtual machine (JVM). Our goal is to retain high performance competitive to dynamic compilation and, at the same time, obtain better time predictability for JVM. Also, we study pre-compilation techniques to utilize second core more efficiently. Furthermore, we develop Pre-optimization on Another Core (PoAC) scheme to replace the Adaptive Optimization System (AOS) in jikes JVM, which is very sensitive to execution time variation and impacts time predictability greatly. The experimental results indicate that our approaches are able to reach high performance while greatly reduce execution time variation of java applications.
暂无评论