This paper discusses several techniques used in developing a parallel, production quality data mining application in Java. We started by developing three sequential versions of a product recommendation data mining app...
详细信息
This paper discusses several techniques used in developing a parallel, production quality data mining application in Java. We started by developing three sequential versions of a product recommendation data mining application: (i) a Fortran 90 version used as a performance reference, (ii) a plain Java implementation that only uses the primitive array structures from the language, and (iii) a baseline Java implementation that uses our Array package for Java. This Array package provides parallelism at the level of individual Array and BLAS operations. Using this Array package, we also developed two parallel Java versions of the data mining application: one that relies entirely on the implicit parallelism provided by the Array package, and another that is explicitly parallel at the application level. We discuss the design of the Array package, as well as the design of the data mining application. We compare the trade-offs between performance and the abstraction level the different Java versions present to the application programmer. Our studies show that, although a plain Java implementation performs poorly, the Java implementation with the Array package is quite competitive in performance with Fortran. We achieve a single processor performance of 109 Mflops, or 91% of Fortran performance, on a 332 MHz PowerPC 604e processor. Both the implicitly and explicitly parallel forms of our Java implementations also parallelize well. On an SMP with four of those PowerPC processors, the implicitly parallel form achieves 290 Mflops with no effort from the application programmer, while the explicitly parallel form achieves 340 Mflops.
Load-reuse analysis finds instructions that repeatedly access the same memory location. This location can be promoted to a register, eliminating redundant loads by reusing the results of prior memory accesses. This pa...
详细信息
ISBN:
(纸本)9781581130942
Load-reuse analysis finds instructions that repeatedly access the same memory location. This location can be promoted to a register, eliminating redundant loads by reusing the results of prior memory accesses. This paper develops a load-reuse analysis and designs a method for evaluating its *** designing the analysis, we aspire for completeness---the goal of exposing all reuse that can be harvested by a subsequent program transformation. For register promotion, a suitable transformation is partial redundancy elimination (PRE). To approach the ideal goal of PRE-completeness, the load-reuse analysis is phrased as a data-flow problem on a program representation that is path-sensitive, as it detects reuse even when it originates in a different instruction along each control flow path. Furthermore, the analysis is comprehensive, as it treats scalar, array and pointer-based loads *** evaluating the analysis, we compare it with an ideal analysis. By observing the run-time stream of memory references, we collect all PRE-exploitable reuse and treat it as the ideal analysis performance. To compare the (static) load-reuse analysis with the (dynamic) ideal reuse, we use an estimator algorithm that computes, given a data-flow solution and a program profile, the dynamic amount of reuse detected by the analysis. We developed a family of estimators that differ in how well they bound the profiling error inherent in the edge profile. By bounding the error, the estimators offer a precise and practical method for determining the run-time optimization *** experiments show that about 55% of loads executed in Spec95 exhibit reuse. Of those, our analysis exposes about 80%.
The design and implementation of a compiler that translates programs written in a type-safe subset of the C programminglanguage into highly optimized DEC Alpha assembly language programs and a certifier that automati...
详细信息
The design and implementation of a compiler that translates programs written in a type-safe subset of the C programminglanguage into highly optimized DEC Alpha assembly language programs and a certifier that automatically checks the type safety and memory safety of any assembly language program produced by the compiler is presented. The result of the certifier is either a formal proof of type safety or a counterexample pointing to a potential violation of the type system by the target program. The ensemble of the compiler and the certifier is called a certifying compiler.
Two techniques for improving garbage collection were presented, namely, generational stack collection and profile-driven pretenuring. The first is applicable to stack based implementations of functional languages whil...
详细信息
Two techniques for improving garbage collection were presented, namely, generational stack collection and profile-driven pretenuring. The first is applicable to stack based implementations of functional languages while the second is useful for any generational collector. Both techniques were implemented in a generational collector used by the TIL compiler and have observed decreases in garbage collection times of as much as 70% and 30%, respectively.
Many program analyses are naturally formulated and implemented using inclusion constraints. We present new results on the scalable implementation of such analyses based on two insights: first, that online elimination ...
详细信息
Many program analyses are naturally formulated and implemented using inclusion constraints. We present new results on the scalable implementation of such analyses based on two insights: first, that online elimination of cyclic constraints yields orders-of-magnitude improvements in analysis time for large problems;second, that the choice of constraint representation affects the quality and efficiency of online cycle elimination. We present an analytical model that explains our design choices and show that the model's predictions match well with results from a substantial experiment.
Partial redundancy elimination (PRE), the most important component of global optimizers, generalizes the removal of common subexpressions and loop-invariant computations. Achieving a complete PRE while incurring an ac...
详细信息
ISBN:
(纸本)9780897919876
Partial redundancy elimination (PRE), the most important component of global optimizers, generalizes the removal of common subexpressions and loop-invariant computations. Achieving a complete PRE while incurring an acceptable code growth is the main focus of the study. An algorithm for complete removal of partial redundancies, based on the integration of code motion and control flow restructuring is presented. In contrast to existing complete techniques, resort to restructuring merely to remove obstacles to code motion, rather than to carry out the actual optimization.
Most existing reference-based distributed object systems include some kind of acyclic garbage collection, but fail to provide acceptable collection of cyclic garbage. Those that do provide such GC currently suffer fro...
详细信息
Most existing reference-based distributed object systems include some kind of acyclic garbage collection, but fail to provide acceptable collection of cyclic garbage. Those that do provide such GC currently suffer from one or more problems: synchronous operation, the need for expensive global consensus or termination algorithms, susceptibility to communication problems, or an algorithm that does not scale. We present a simple, complete, fault-tolerant, asynchronous extension to the (acyclic) cleanup protocol of the SSP Chains system. This extension is scalable, consumes few resources, and could easily be adapted to work in other reference-based distributed object systems - rendering them usable for very large-scale applications.
language-supported synchronization is a source of serious performance problems in many Java programs. Even single-threaded applications may spend up to half their time performing useless synchronization due to the thr...
详细信息
language-supported synchronization is a source of serious performance problems in many Java programs. Even single-threaded applications may spend up to half their time performing useless synchronization due to the thread-safe nature of the Java libraries. We solve this performance problem with a new algorithm that allows lock and unlock operations to be performed with only a few machine instructions in the most common cases. Our locks only require a partial word per object, and were implemented without increasing object size. We present measurements from our implementation in the JDK 1.1.2 for AIX, demonstrating speedups of up to a factor of 5 in micro-benchmarks and up to a factor of 1.7 in real programs.
A `Just-In-Time' (JIT) Java compiler produces native code from Java byte code instructions during program execution. As such, compilation speed is more important in a Java JIT compiler than in a traditional compil...
详细信息
ISBN:
(纸本)9780897919876
A `Just-In-Time' (JIT) Java compiler produces native code from Java byte code instructions during program execution. As such, compilation speed is more important in a Java JIT compiler than in a traditional compiler, requiring optimization algorithms to be lightweight and effective. We present the structure of a Java JIT compiler for the Intel Architecture, describe the lightweight implementation of JIT compiler optimizations (e.g., common subexpression elimination, register allocation, and elimination of array bounds checking), and evaluate the performance benefits and tradeoffs of the optimizations. This JIT compiler has been shipped with version 2.5 of Intel's VTune for Java product.
Much research has been devoted to studies of and algorithms for memory management based on garbage collection or explicit allocation and deallocation. An alternative approach, region-based memory management, has been ...
详细信息
Much research has been devoted to studies of and algorithms for memory management based on garbage collection or explicit allocation and deallocation. An alternative approach, region-based memory management, has been known for decades, but has not been well-studied. In a region-based system each allocation specifies a region, and memory is reclaimed by destroying a region, freeing all the storage allocated therein. We show that on a suite of allocation-intensive C programs, regions are competitive with malloc/free and sometimes substantially faster. We also show that regions support safe memory management with low overhead. Experience with our benchmarks suggests that modifying many existing programs to use regions is not difficult.
暂无评论