Irregular loop nests in which the loop bounds are determined dynamically by indexed arrays are difficult to compile into expressive parallel constructs, such as segmented scans and reductions. In this paper, we descri...
详细信息
ISBN:
(纸本)9780897917001
Irregular loop nests in which the loop bounds are determined dynamically by indexed arrays are difficult to compile into expressive parallel constructs, such as segmented scans and reductions. In this paper, we describe a suite of transformations to automatically parallelize such irregular loop nests, even in the presence of recurrences. We describe a simple, general loop flattening transformation, along with new optimizations which make it a viable compiler transformation. A robust recurrence parallelization technique is coupled to the loop flattening transformation, allowing parallelization of segmented reductions, scans, and combining-sends over arbitrary associative operators. We discuss the implementation and performance results of the transformations in a parallelizing Fortran 77 compiler for the Cray C90 supercomputer. In particular, we focus on important sparse matrix-vector multiplication kernels, for one of which we are able to automatically derive an algorithm used by one of the fastest library routines available.
An increasing number of programming languages, such as Fortran 90 and APL, are providing a rich set of intrinsic array functions and array expressions. These constructs which constitute an important part of data paral...
详细信息
An increasing number of programming languages, such as Fortran 90 and APL, are providing a rich set of intrinsic array functions and array expressions. These constructs which constitute an important part of data parallel languages provide excellent opportunities for compiler optimizations. In this paper, we present a new approach to combine consecutive data access patterns of array constructs into a composite access function to the source arrays. Our scheme is based on the composition of access functions, which is similar to a composition of mathematic functions. Our new scheme can handle not only data movements of arrays of different numbers of dimensions and segmented array operations but also masked array expressions and multiple sources array operations. As a result, our proposed scheme is the first synthesis scheme which can synthesize Fortran 90 RESHAPE, EOSHIFT, MERGE, and WHERE constructs together. Experimental results show speedups from 1.21 to 2.95 for code fragments from real applications on a Sequent multiprocessor machine by incorporating the proposed optimizations.
The goal of the Olden project is to build a system that provides parallelism for general purpose C programs with minimal programmer annotations. We focus on programs using dynamic structures such as trees, lists, and ...
详细信息
ISBN:
(纸本)9780897917001
The goal of the Olden project is to build a system that provides parallelism for general purpose C programs with minimal programmer annotations. We focus on programs using dynamic structures such as trees, lists, and DAGs. We demonstrate that providing both software caching and computation migration can improve the performance of these programs, and provide a compile-time heuristic that selects between them for each pointer dereference. We have implemented a prototype system on the Thinking Machines CM-5. We describe our implementation and report on experiments with ten benchmarks.
We have developed compiler algorithms that analyze explicitly parallel programs and restructure their shared data to reduce the number of false sharing misses. The algorithms analyze per-process shared data accesses, ...
详细信息
ISBN:
(纸本)9780897917001
We have developed compiler algorithms that analyze explicitly parallel programs and restructure their shared data to reduce the number of false sharing misses. The algorithms analyze per-process shared data accesses, pinpoint the data structures that are susceptible to false sharing and choose an appropriate transformation to reduce it. The transformations either group data that is accessed by the same processor or separate individual data items that are shared. This paper evaluates that technique. We show through simulation that our analysis successfully identifies the data structures that are responsible for most false sharing misses, and then transforms them without unduly decreasing spatial locality. The reduction in false sharing positively impacts both execution time and program scalability when executed on a KSR2. Both factors combine to increase the maximum achievable speedup for all programs, more than doubling it for several. Despite being able to only approximate actual inter-processor memory accesses, the compiler-directed transformations always outperform programmer efforts to eliminate false sharing.
This paper presents efficient and portable implementations of two useful primitives in image processing algorithms, histogramming and connected components. Our general framework is a single-address space, distributed ...
详细信息
This paper presents efficient and portable implementations of two useful primitives in image processing algorithms, histogramming and connected components. Our general framework is a single-address space, distributed memory programming model. We use efficient techniques for distributing and coalescing data as well as efficient combinations of task and data parallelism. Our connected components algorithm uses a novel approach for parallel merging which performs drastically limited updating during iterative steps, and concludes with a total consistency update at the final step. The algorithms have been coded in Split-C and run on a variety of platforms. Our experimental results are consistent with the theoretical analysis and provide the best known execution times for these two primitives, even when compared with machine-specific implementations. More efficient implementations of Split-C will likely result in even faster execution times.
It is widely acknowledged in high-performance computing circles that parallel input/output needs substantial improvement in order to make scalable computers truly usable. We present a data storage model that allows pr...
ISBN:
(纸本)9780897917001
It is widely acknowledged in high-performance computing circles that parallel input/output needs substantial improvement in order to make scalable computers truly usable. We present a data storage model that allows processors independent access to their own data and a corresponding compilation strategy that integrates data-parallel computation with data distribution for out-of-core problems. Our results compare several communication methods and I/O optimizations using two out-of-core problems, Jacobi iteration and LU factorization.
In this paper, we propose a straightforward solution to the problems of compositional parallelprogramming by using skeletons as the uniform mechanism for structured composition. In our approach parallel programs are ...
ISBN:
(纸本)9780897917001
In this paper, we propose a straightforward solution to the problems of compositional parallelprogramming by using skeletons as the uniform mechanism for structured composition. In our approach parallel programs are constructed by composing procedures in a conventional base language using a set of high-level, pre-defined, functional, parallel computational forms known as skeletons. The ability to compose skeletons provides us with the essential tools for building further and more complex application-oriented skeletons specifying important aspects of parallel computation. Compared with the process network based composition approach, such as PCN, the skeleton approach abstracts away the fine details of connecting communication ports to the higher level mechanism of making data distributions conform, thus avoiding the complexity of using lower level ports as the means of interaction. Thus, the framework provides a natural integration of the compositional programming approach with the data parallelprogramming paradigm.
Data-parallel languages, such as High Performance Fortran, are widely regarded as a promising means for writing portable programs for distributed-memory machines. Novel features of these languages call for the develop...
ISBN:
(纸本)9780897917001
Data-parallel languages, such as High Performance Fortran, are widely regarded as a promising means for writing portable programs for distributed-memory machines. Novel features of these languages call for the development of new techniques in both compilers and run-time systems. In this paper, we present an improved algorithm for finding the local memory access sequence in computations involving regular sections of arrays with cyclic(k) distributions. After establishing the fact that regular section indices correspond to elements of an integer lattice, we show how to find a lattice basis that allows for simple and fast enumeration of memory accesses. The complexity of our algorithm is shown to be lower than that of the previous solution for the same problem. In addition, the experimental results demonstrate the efficiency of our method in practice.
This paper presents efficient and portable implementations of two useful primitives in image processing algorithms, histogramming and connected components. Our general framework is a single-address space, distributed ...
详细信息
ISBN:
(纸本)9780897917001
This paper presents efficient and portable implementations of two useful primitives in image processing algorithms, histogramming and connected components. Our general framework is a single-address space, distributed memory programming model. We use efficient techniques for distributing and coalescing data as well as efficient combinations of task and data parallelism. Our connected components algorithm uses a novel approach for parallel merging which performs drastically limited updating during iterative steps, and concludes with a total consistency update at the final step. The algorithms have been coded in Split-C and run on a variety of platforms. Our experimental results are consistent with the theoretical analysis and provide the best known execution times for these two primitives, even when compared with machine-specific implementations. More efficient implementations of Split-C will likely result in even faster execution times.
For a long time efficient use of parallel computers has been hindered by dependencies introduced in software through low-level implementation practice. In this paper we present a programming environment and language c...
ISBN:
(纸本)9780897917001
For a long time efficient use of parallel computers has been hindered by dependencies introduced in software through low-level implementation practice. In this paper we present a programming environment and language called Object-Math (Object oriented Mathematical language for scientific computing), which aims at eliminating this problem by allowing the user to represent mathematical equation-based models directly in the system. The system performs analysis of mathematical models to extract parallelism and automatically generates parallel code for numerical *** the context of industrial applications in mechanical analysis, we have so far primarily explored generation of parallel code for solving systems of ordinary differential equations (ODEs), in addition to preliminary work on generating code for solving partial differential equations. Two approaches to extracting parallelism have been implemented and evaluated: extracting parallelism at the equation system level and at the single equation level, respectively. We found that for several applications the corresponding systems of equations do not partition well into subsystems. This means that the equation system level approach is of restricted general applicability. Thus, we focused on the equation-level approach which yielded significant parallelism for ODE systems solution. For the bearing simulation applications we present here, the achieved speedup is however critically dependent on low communication latency of the parallel computer.
暂无评论