A run-time technique based on the inspector-executor scheme is proposed in this paper to find available parallelism on loops. Our inspector can determine the wavefronts by building a DEF-USE table for each loop of a p...
详细信息
A run-time technique based on the inspector-executor scheme is proposed in this paper to find available parallelism on loops. Our inspector can determine the wavefronts by building a DEF-USE table for each loop of a program. Additionally, the process the inspector uses to find the wavefronts can be parallelized fully without any synchronization. Our executor executes loop iterations concurrently. For each wavefront, the auto-adapted function is used to get a tailored thread number instead of using a fixed number of thread for execution. Experimental results show that our new parallel inspector can handle complex data dependency patterns and significantly reduce the execution time.
With current compilers for High Performance Fortran (HPF), substantial restructuring and hand-optimization may be required to obtain acceptable performance from an HPF port of an existing Fortran application. A key go...
详细信息
ISBN:
(纸本)9780897919845
With current compilers for High Performance Fortran (HPF), substantial restructuring and hand-optimization may be required to obtain acceptable performance from an HPF port of an existing Fortran application. A key goal of the Rice dHPF compiler project is to develop optimization techniques that can provide consistently high performance for a broad spectrum of scientific applications with minimal restructuring of existing Fortran 77 or Fortran 90 applications. This paper presents four new optimization techniques we developed to support efficient parallelization of codes with minimal restructuring. These optimizations include computation partition selection for loop nests that use privatizable arrays, along with partial replication of boundary computations to reduce communication overhead; communication-sensitive loop distribution to eliminate inner-loop communications; interprocedural selection of computation partitions; and data availability analysis to eliminate redundant communications. We studied the effectiveness of the dHPF compiler, which incorporates these optimizations, in parallelizing serial versions of the NAS SP and BT application benchmarks. We present experimental results comparing the performance of hand-written MPI code for the benchmarks against code generated from HPF using the dHPF compiler and the Portland Group's pghpf compiler. Using the compilation techniques described in this paper we achieve performance within 15% of hand-written MPI code on 25 processors for BT and within 33% for SP. Furthermore, these results are obtained with HPF versions of the benchmarks that were created with minimal restructuring of the serial code (modifying only approximately 5% of the code).
In this paper we propose a knowledge-based approach for solving data dependence testing and loop scheduling problems. A rule-based system, called the K-Test, is developed by repertory grid and attribute ording table t...
详细信息
In this paper we propose a knowledge-based approach for solving data dependence testing and loop scheduling problems. A rule-based system, called the K-Test, is developed by repertory grid and attribute ording table to construct the knowledge base. The K-Test chooses an appropriate testing algorithm according to some features of the input program by using knowledge-based techniques, and then applies the resulting test to detect data dependences for loop parallelization. Another rule-based system, called the KPLS, is also proposed to be able to choose an appropriate scheduling by inferring some features of loops and assign parallel loops on multiprocessors for achieving high speedup. The experimental results show that the graceful speedup obtained by our compiler is obvious.
Currently, dataflow architectures are programmed using applicative languages to ease the task of deriving the dataflow graph during compilation. We summarise our experience gained in prototyping a FORTRAN nested loop ...
详细信息
Currently, dataflow architectures are programmed using applicative languages to ease the task of deriving the dataflow graph during compilation. We summarise our experience gained in prototyping a FORTRAN nested loop kernel compiler for a pipeline-ring dataflow architecture. We present the status of the current implementation and future directions which the development of the compiler will take. Current evidence suggests that it is possible to efficiently compile FORTRAN nested loop kernels directly onto dataflow architectures without the need for additional run-time support mechanisms. We present a scheme for deriving the dataflow graph from the analysis of ''carried'' array variable subscript expressions, and a scheme to map the actors in the dataflow graph onto a pipeline-ring of Field Programmable Gate Array (FPGA) devices.
Data distribution has been one of the most important research topics in parallelizing compilers for distributed memory parallel computers. Good data distribution schema should consider both the computation load balanc...
详细信息
Data distribution has been one of the most important research topics in parallelizing compilers for distributed memory parallel computers. Good data distribution schema should consider both the computation load balance and the communication overhead. In this paper, we show that data redistribution is necessary for executing a sequence of Do-loops if the communication cost due to performing this sequence of Do-loops is larger than a threshold value. Based on this observation, we can prune the searching space and derive efficient dynamic programming algorithms for determining effective data distribution schema to execute a sequence of Do-loops with a general structure. Experimental studies on a 32-node nCUBE-2 computer are also presented.
This paper presents a technique to map automatically a complete digital signal processing (DSP) application onto a parallel machine with distributed memory. Unlike other applications where coarse or medium grain sched...
详细信息
ISBN:
(纸本)0818679581
This paper presents a technique to map automatically a complete digital signal processing (DSP) application onto a parallel machine with distributed memory. Unlike other applications where coarse or medium grain scheduling techniques can be used DSP applications integrate several thousand of tasks and hence necessitate fine grain considerations. Moreover finding an effective mapping imperatively require to take into account both architectural resources constraints and real time constraints. The main contribution of this paper is to show how it as possible to handle and to solve data partitioning, and fine-grain scheduling under the above operational constraints using Concurrent Constraints Logic Programming languages (CCLP). Our concurrent resolution technique undertaking linear and non linear constraints takes advantage of the special features of signal processing applications and provides a solution equivalent to a manual solution for the representative Panoramic Analysis (PA) application.
Due to the complexity of programming scalable multiprocessors with physically distributed memories, it is onerous to manually generate parallel code for these machines. As a consequence, there has been much research o...
详细信息
It is well known that extracting parallel loops plays a significant role in designing parallelizing compilers. The execution efficiency of a loop is enhanced when the loop can be executed in parallel or partial parall...
详细信息
It is well known that extracting parallel loops plays a significant role in designing parallelizing compilers. The execution efficiency of a loop is enhanced when the loop can be executed in parallel or partial parallel, like a DOALL or DOACROSS loop. This paper reports on the practical parallelism detector (PPD) that is implemented in PFPC (a portable FORTRAN parallelizing compiler running on OSF/1) at NCTU to concentrate on finding the parallelism available in loops. The PPD can extract the potential DOALL and DOACROSS loops in a program by invoking a combination of the ZIV test and the I test for verifying array subscripts. Furthermore, if DOACROSS loops are available, an optimization of synchronization statement is made. Experimental results show that PPD is more reliable and accurate than previous approaches.
In this paper an optimum algorithm to translate control flow graphs to dataflow graphs is proposed for dataflow execution of sequential programs. Some of the existing analysis methods restrict the specification of a p...
详细信息
In this paper an optimum algorithm to translate control flow graphs to dataflow graphs is proposed for dataflow execution of sequential programs. Some of the existing analysis methods restrict the specification of a program to be processed while others require a very high analysis cost. The algorithm proposed in this paper (CD translation algorithm), (1) with a very low cost, and (2) for any control structure that can be described by a control flow graph, (3) can generate dataflow programs that give an optimum dataflow execution, Furthermore, this proposed analysis algorithm is designed to handle task level control flow graphs as well as instruction level control flow graphs, which are accepted by the existing methods, so that optimum control is-possible for task level dataflow execution.
One of the most intellectual steps in compiling for distributed memory parallel machines is to determine a suitable data partitioning scheme for a particular program. Most of the parallelizing compilers for these mach...
详细信息
One of the most intellectual steps in compiling for distributed memory parallel machines is to determine a suitable data partitioning scheme for a particular program. Most of the parallelizing compilers for these machines provide no or little support to the user in this difficult task. We have developed DPART, an automatic data partitioning system for Fortran 77 procedures. This paper describes the partitioning strategics of alignment, distribution, and processor layout in DPART. Finally we present experimental results for TRED2, DGEFA, and JACOBI procedures to demonstrate the effectiveness of this system.
暂无评论