This paper presents the results of the Cedar Hand-parallelization Experiment, conducted from 1989 through 1992, within the Center for Supercomputing Research and Development (CSRD) at the University of Illinois. In th...
详细信息
This paper presents the results of the Cedar Hand-parallelization Experiment, conducted from 1989 through 1992, within the Center for Supercomputing Research and Development (CSRD) at the University of Illinois. In this experiment, we manually transformed the Perfect Benchmarks(R) into parallel program versions. In doing so, we used techniques that may be automated in an optimizing compiler. We then ran these programs on the Cedar multiprocessor (built at CSRD during the 1980s) and measured the speed improvement due to each technique. The results presented here extend the findings previously reported in [11]. The techniques credited most for the performance gains include array privatization, parallelization of reduction operations, and the substitution of generalized induction variables. All these techniques can be considered extensions of transformations that were available in vectorizers and commercial restructuring compilers of the late 1980s. We applied these transformations by hand to the given programs, in a mechanical manner, similar to that of a parallelizing compiler. Because of our success with these transformations, we believed that it would be possible to implement many of these techniques in a new parallelizing compiler. Such a compiler has been completed in the meantime and we show preliminary results.
This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of paralle...
详细信息
This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines.
One of the most crucial qualities of an optimizing compiler is its ability to detect when different data references access the same storage location. Such references are said to be data-dependent and they impose const...
详细信息
One of the most crucial qualities of an optimizing compiler is its ability to detect when different data references access the same storage location. Such references are said to be data-dependent and they impose constraints on the amount of program modifications the compiler can apply for improving the program's performance. For parallelizing compilers, the most important program constructs to investigate are loops and the array references they contain. In previous work. we have found a serious limitation of current data dependence tests to be that they cannot handle loop bounds or array subscripts that are symbolic, nonlinear expressions. In this paper, we describe a dependence test, called the Range Test. that can handle such expressions. Briefly, the Range Test proves independence by determining whether certain symbolic inequalities hold for a permutation of the loop nest. Powerful symbolic analyses and constraint propagation techniques were developed to prove such inequalities. The Range Test has been implemented in Polaris, a parallelizing compiler developed at the University of Illinois. We will present measurements of the Range Test's performance and compare it with state-of-the-art tests.
We systematically derive a systolic algorithm for the algebraic path problem (APP). We start with an algorithm with (parallel) running time 3n, but whose data dependencies are not local. The first step in obtaining a ...
详细信息
We systematically derive a systolic algorithm for the algebraic path problem (APP). We start with an algorithm with (parallel) running time 3n, but whose data dependencies are not local. The first step in obtaining a systolic algorithm is to localize these dependencies. The standard localization (used by all authors till now), forces slow-down the algorithm to the point that it takes 5n - 4 time steps. We show that much of this can be avoided, and give a new localization scheme which reduces the time to 4n - 2 (or 4n - 3 if n is odd). We also introduce a new approach to the problem of scheduling such parallel algorithms: we treat the equations defining the algorithm as an executable specification, but execute it on a non standard value domain. Such an abstract interpretation of the program then computes the optimal schedule of the original algorithm. A closed form can be inferred by observing a trace of this execution. We use this approach to obtain optimal schedules for all the algorithms in the paper-local as well as non-local.
programming shared memory multiprocessors seems to be easier than developing programs for distributed memory systems. The reason is the existence of a global names space for parallel threads providing uniform access t...
详细信息
ISBN:
(纸本)9783540554370
programming shared memory multiprocessors seems to be easier than developing programs for distributed memory systems. The reason is the existence of a global names space for parallel threads providing uniform access to all global data. This programming model seems to be inadequate for systems with a larger number of processors since memory hierarchies are integrated to eliminate the bottleneck of global memory access. Therefore programming these systems has to be done with respect to the distribution of data, thus to enforce the parallel processes exploiting locality of references. This paper describes an ongoing project in which we investigate the applicability of the distributed memory parallelization strategy for shared memory systems with memory hierarchies.
暂无评论