作者:
Lee, PKedem, ZMAcad Sinica
Inst Informat Sci Taipei 11529 Taiwan Acad Sinica
Ctr Appl Sci & Engn Res Taipei 11529 Taiwan NYU
Courant Inst Math Sci Dept Comp Sci New York NY 10003 USA
To exploit parallelism on shared memory parallel computers (SMPCs), it is natural to focus on decomposing the computation (mainly by distributing the iterations of the nested Do-Loops). In contrast, on distributed mem...
详细信息
To exploit parallelism on shared memory parallel computers (SMPCs), it is natural to focus on decomposing the computation (mainly by distributing the iterations of the nested Do-Loops). In contrast, on distributed memory parallel computers (DMPCs), the decomposition of computation and the distribution of data must both be handled-in order to balance the computation load and to minimize the migration of data. We propose and validate experimentally a method for handling computations and data synergistically to minimize the overall execution time on DMPCs. The method is based on a number of novel techniques, also presented in this article. The core idea is to rank the "importance" of data arrays in a program and specify some of the dominant. The intuition is that the dominant arrays are the ones whose migration would be the most expensive. Using the correspondence between iteration space mapping vectors and distributed dimensions of the dominant data array in each nested Do-loop, allows us to design algorithms for determining data and computation decompositions at the same time. Based on data distribution, computation decomposition for each nested Do-loop is determined based on either the "owner computes" rule or the "owner stores" rule with respect to the dominant data array. If all temporal dependence relations across iteration partitions are regular, we use tiling to allow pipelining and the overlapping of computation and communication. However, in order to use tiling on DMPCs, we needed to extend the existing techniques for determining tiling vectors and tile sizes, as they were originally suited for SMPCs only. The overall method is illustrated on programs for the 2D heat equation, for the Gaussian elimination with pivoting, and for the 2D fast Fourier transform on a linear processor array and on a 2D processor grid.
An automatic parallelization method for tightly-nested loops running on multi-core system has been proposed. First, according to the physical characteristics of multi-core processors, a way has been presented to solve...
详细信息
ISBN:
(纸本)9780769538594
An automatic parallelization method for tightly-nested loops running on multi-core system has been proposed. First, according to the physical characteristics of multi-core processors, a way has been presented to solve the problem on dada locality during data decomposition;Second, for increasing parallel granularity of tight nested loops, the method discussed in this article studied computation decomposition based on workload, and brought forward how to compute the workload of loop iteration that can be run in parallel, and at last according to the size of the workload, determined the granularity of parallel loops to achieve to reduce the parallel overhead brought by the parallel iteration of small workload. Using this method,an automatic parallelization model based on workload can be constructed.
Automatic decomposition is a compile technique that maps computation and data onto different processors, and array is one of the main targets processed. Certain arrays, whose migration is the most expensive, are terme...
详细信息
ISBN:
(纸本)9780769548791
Automatic decomposition is a compile technique that maps computation and data onto different processors, and array is one of the main targets processed. Certain arrays, whose migration is the most expensive, are termed as dominant arrays. Since every computing node has its own memory on distributed memory parallel computers (DMPCs), decompositions of dominant arrays have directly impact on the performance of parallel program. To avoid remote data accessing, each definition and use of arrays needs to be distributed consistently, so as that there are too many partition constraints to increase decomposition choices of dominant arrays. We propose an automatic computation and data decomposition algorithm of prioritized dominant array in this paper. Our algorithm ranks arrays according to their potential communication costs, and then finds data decomposition for arrays in the decreasing order of rank. We serialize low rank arrays to enhance the decomposition priority of high rank ones. Finally, serial arrays are to be partitioned if maintaining parallel benefits of previous results. The experimental results show that this algorithm can improve the performance of parallel programs.
Data locality is critical to achieving high performance on high-performance parallel *** how to find a good data decomposition is becoming a key issue in parallelizing *** have developed a compiler system that fully a...
详细信息
Data locality is critical to achieving high performance on high-performance parallel *** how to find a good data decomposition is becoming a key issue in parallelizing *** have developed a compiler system that fully automatically parallelizes sequential programs and optimizes data decomposition to improve data *** data decomposition algorithm consists of two *** first step chooses the basic data and computation decomposition without considering read-only *** second step then changes the data decomposition considering read-only data. We ran our compiler on a set of application *** results show that the algorithm can effectively discovers parallelism.
Social network analysis is used to extract features of human communities and proves to be very instrumental in a variety of scientific domains. The dataset of a social network is often so large that a cloud data analy...
详细信息
Social network analysis is used to extract features of human communities and proves to be very instrumental in a variety of scientific domains. The dataset of a social network is often so large that a cloud data analysis service, in which the computation is performed on a parallel platform in the could, becomes a good choice for researchers not experienced in parallel programming. In the cloud, a primary challenge to efficient data analysis is the computation and communication skew (i. e., load imbalance) among computers caused by humanity's group behavior (e. g., bandwagon effect). Traditional load balancing techniques either require significant effort to re-balance loads on the nodes, or cannot well cope with stragglers. In this paper, we propose a general straggler-aware execution approach, SAE, to support the analysis service in the cloud. It offers a novel computational decomposition method that factors straggling feature extraction processes into more fine-grained sub-processes, which are then distributed over clusters of computers for parallel execution. Experimental results show that SAE can speed up the analysis by up to 1.77 times compared with state-of-the-art solutions.
Minimizing communication by increasing the locality of data references is an important optimization for achieving high performance on distributed memory machines. But in the progress of decomposition, reorganization i...
详细信息
ISBN:
(纸本)9780769529097
Minimizing communication by increasing the locality of data references is an important optimization for achieving high performance on distributed memory machines. But in the progress of decomposition, reorganization is inevitable. And the communication produced by reorganization is inevitable too. In this paper, the authors present a linear decomposition algorithm that automatically finding computation and data decomposition, including finding data and computations decomposition that has data reorganization communication. And the authors improve the method and reduce the communication cost by merging parallel regions with the same data decomposition.
暂无评论