Most scientific applications rely on parallel Multiprocessor computing to enhance Performance. However, the irregular loops within these applications obstruct the parallefism analysis at compile-time. Rauchwerger et a...
详细信息
Most scientific applications rely on parallel Multiprocessor computing to enhance Performance. However, the irregular loops within these applications obstruct the parallefism analysis at compile-time. Rauchwerger et al. presented a run-time method to extract the hidden parallelism in a program using dependence chains. The relative overhead degrades this approach's performance due to the mass storage requirement and huge array reference processing. In this Study, a new predecessor/successor approach is developed in which high-level predecessor/successor information is recorded and processed efficiently. A predecessor/successor table is constructed first in the inspector phase so that only the successor iterations in the current wavefront need to be examined, instead of the entire loop iterations during wavefront scheduling. Usually, the performance of dependence chain approach degrades dramatically for a hot-spot access pattern, but Our scheme works very efficiently in this case. The experimental results using synthetic code and real programs are presented to prove the superiority of the proposed approach. (c) 2005 Elsevier Inc. All rights reserved.
The mathematical model for the parallelization, or ''space-time mapping,'' of loop nests is the polyhedron model. The presence of while loops in the nest complicates matters for two reasons: (I) the pa...
详细信息
The mathematical model for the parallelization, or ''space-time mapping,'' of loop nests is the polyhedron model. The presence of while loops in the nest complicates matters for two reasons: (I) the parallelized loop nest does not correspond to a polyhedron but instead to a subset that resembles a (multidimensional) comb and (2) it is not clear when the entire loop nest has terminated. We describe a communication scheme which can deal with both problems and which can be added to the parallel target loop nest by a compiler.
There are many methods for nested loop partitioning. However, most of them perform poorly when partitioning loops with nonuniform dependences. This paper proposes a generalized and optimized loop partitioning mechanis...
详细信息
There are many methods for nested loop partitioning. However, most of them perform poorly when partitioning loops with nonuniform dependences. This paper proposes a generalized and optimized loop partitioning mechanism to exploit parallelism from nested loops with non-uniform dependences. Our approach, based on dependence convex theory, will divide the loop into variable size partitions. Furthermore, the proposed algorithm partitions a nested loop by using the copy-renaming and the optimized partitioning techniques to minimize the number of parallel regions of the iteration space. Consequently, it outperforms the previous partitioning mechanisms of nested loops with non-uniform dependences. Many optimization techniques are used to reduce the complexity of the algorithm. Compared with other popular techniques, our scheme shows a dramatic improvement in the preliminary performance results. (C) 2001 Elsevier Science Inc. All rights reserved.
One central problem in the execution of parallel nested loops with non-affine bounds is the precise scanning (i.e., enumeration) of the points in their iteration space and the detection of their termination. Scanning ...
详细信息
One central problem in the execution of parallel nested loops with non-affine bounds is the precise scanning (i.e., enumeration) of the points in their iteration space and the detection of their termination. Scanning schemes have been proposed for both shared-memory and distributed-memory implementations. However, these schemes work only for perfectly nested while loops. We propose a scheme which also works for not perfectly nested while loops on shared memory. This scheme has been incorporated in our loop parallelizer loopo. (C) 1999 Elsevier Science B.V. All rights reserved.
Mobile vehicular cloud has become popular with the rapid development of cloud computing and mobile computing. Nested loops are usually the most critical part in multimedia and high performance Digital Signal Processin...
详细信息
Mobile vehicular cloud has become popular with the rapid development of cloud computing and mobile computing. Nested loops are usually the most critical part in multimedia and high performance Digital Signal Processing (DSP) systems which are widely used in vehicular applications and systems. In order to further explore the parallelism in nested loops, we study how to maximize the system performance with considering the energy reduction for applications on Chip Multiprocessor (CMP) architectures. We propose an algorithm Energy-Aware loop Parallelism Maximization (EALPM) to maximize the system performance with the consideration of energy reduction for applications with multidimensional nested loops. Our experiment shows that using the EALPM algorithm significantly improves both performance and energy consumption on average in comparision to other algorithms.
Two-dimensional arrays with linear subscripts occur quite frequently in real programs. In general, for multi-dimensional linear arrays under constant bounds the Lambda test is an efficient data dependence method to ch...
详细信息
Two-dimensional arrays with linear subscripts occur quite frequently in real programs. In general, for multi-dimensional linear arrays under constant bounds the Lambda test is an efficient data dependence method to check whether there exist real solutions. In this paper, we propose a multi-dimensional version of the I test, the multi-dimensional I test. that can be applied to testing whether there are integer solutions for multi-dimensional linear arrays under constant limits. Experiments with benchmark showing the effects of the multi-dimensional I test on testing precision and testing efficiency are also presented. (C) 2001 Elsevier Science B.V. All rights reserved.
Our work investigates how to map loops efficiently onto Coarse-Grained Reconfigurable Architecture(CGRA).This paper examines the properties of CGRA and builds Map Reduce inspired models for the loop parallelization **...
详细信息
Our work investigates how to map loops efficiently onto Coarse-Grained Reconfigurable Architecture(CGRA).This paper examines the properties of CGRA and builds Map Reduce inspired models for the loop parallelization *** proposed model has a more detailed performance metric and a more flexible unrolling scheme that can unroll different loop levels with different factors.A Geometric Programming based approach is proposed to resolve the optimization problem of loop parallelization *** proposed approach can find the optimal unrolling factor for each level loop,resulting in better parallelization of *** results show that the proposed approach achieved up to 44%performance gain compared to the state-of-the-art loop mapping scheme.
In process of automatic parallelizing/vectorizing constant-bound loops with multi-dimensional arrays under specific dependence direction, the Lambda test is claimed to be an efficient and precise data dependence analy...
详细信息
In process of automatic parallelizing/vectorizing constant-bound loops with multi-dimensional arrays under specific dependence direction, the Lambda test is claimed to be an efficient and precise data dependence analysis method that can check whether there exist generally inexact 'real-valued' solutions to the derived dependence equations. In this paper, we propose a precise data dependence analysis method - the multi-dimensional direction vector I test. The multi-dimensional direction vector I test can be applied towards testing whether there exist generally accurate 'integer-valued' solutions to the dependence equations derived from multi-dimensional arrays under specific dependence direction in constant-bound loops. Experiments with benchmark showed that the accuracy rate and the improvement rate for the proposed method are approximately 33.3% and 21.6%, respectively. (C) 2001 Elsevier Science Inc. All rights reserved.
loop tiling is an efficient loop transformation, mainly applied to detect coarse-grained parallelism in loops. It is a difficult task to apply n-dimensional non-rectangular tiles to generate parallel loops. This paper...
详细信息
loop tiling is an efficient loop transformation, mainly applied to detect coarse-grained parallelism in loops. It is a difficult task to apply n-dimensional non-rectangular tiles to generate parallel loops. This paper offers an efficient scheme to apply non-rectangular n-dimensional tiles in non-rectangular iteration spaces, to generate parallel loops. In order to exploit wavefront parallelism efficiently, all the tiles with equal sum of coordinates are assumed to reside on the same wavefront. Also, in order to assign parallelepiped tiles on each wavefront to different processors, an improved block scheduling strategy is offered in this paper.
One-dimensional arrays with subscripts formed by induction variables in real programs appear quite frequently. For most famous data dependence testing methods, checking if integer-valued solutions exist for one-dimens...
详细信息
One-dimensional arrays with subscripts formed by induction variables in real programs appear quite frequently. For most famous data dependence testing methods, checking if integer-valued solutions exist for one-dimensional arrays with references created by induction variable is very difficult. The I test, which is a refined combination of the GCD and Banerjee tests, is an efficient and precise data dependence testing technique to compute if integer-valued solutions exist for one-dimensional arrays with constant bounds and single increments. In this paper, the non-continuous I test, which is an extension of the I test, is proposed to figure out whether there are integer-valued solutions for one-dimensional arrays with constant bounds and non-sing ularincrements or not. Experiments with the benchmarks that have been cited from Livermore and Vector loop, reveal that there are definitive results for 67 pairs of one-dimensional arrays that were tested.
暂无评论