Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance Fortran. We show that for an array A affinely aligned t...
详细信息
ISBN:
(纸本)0897915895
Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance Fortran. We show that for an array A affinely aligned to a template that is distributed across p processors with a cyclic(κ) distribution, and a computation involving the regular section A(: h: s), the local memory access sequence for any processor is characterized by a finite state machine of at most κ;states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little runtime overhead and acceptable preprocessing time.
Data-flow analysis of scalar variables and data dependence analysis on array elements are two important program analyses used in optimizing and parallelizing compilers. Traditional data-flow analysis models accesses o...
详细信息
ISBN:
(纸本)0897915607
Data-flow analysis of scalar variables and data dependence analysis on array elements are two important program analyses used in optimizing and parallelizing compilers. Traditional data-flow analysis models accesses of array elements simply as accesses to the entire array, and is inadequate for parallelizing loops in array-based programs. On the other hand, data dependence analysis differentiates between different array elements but is flow-insensitive. this paper studies the combination of these two analyses - data-flow analysis of accesses to individual array elements. the problem of finding precise array data-flow information in the domain of loop nests where the loop bounds and array indices are affine functions of loop indices was first formulated by Feautrier. Feautrier's algorithm, based on parametric integer programming techniques, is general but inefficient. this paper presents an efficient algorithm that can find the same precise information for many of the programs found in practice. In this paper, we argue that data-flow analysis of individual array elements is necessary for effective automatic parallelization. In particular, we demonstrate the use of array data-flow analysis in an important optimization known as array privatization. By demonstrating that array data-flow analysis can be computed efficiently and by showing the importance of the optimizations enabled by the analysis, this paper suggests that array data-flow analysis may become just as important in future optimizing and parallelizing compilers as data-flow and data dependence analysis are in today's compilers.
Several novel techniques for efficient implementation of concurrent object-oriented languages on general purpose, stock multicomputers are presented. these techniques have been developed in implementing our concurrent...
详细信息
ISBN:
(纸本)0897915895
Several novel techniques for efficient implementation of concurrent object-oriented languages on general purpose, stock multicomputers are presented. these techniques have been developed in implementing our concurrent object-oriented language ABCL on a Fujitsu Laboratory's experimental multicomputer AP1000 consisting of 512 SPARC chips. the proposed intra-node scheduling mechanism reduces the cost of local message passing. the cost of intra-node asynchronous message passing is about 20 SPARC instructions in the best case, including locality checking, dynamic method lookup, and scheduling. the minimum latency of asynchronous internode message passing is about 9 μs, or about 120 instructions, employing the self-dispatching mechanism independently proposed by Eicken et al. A large scale benchmark which involves 9,000,000 message passings shows 440 times speedup on the 512 nodes system compared to the sequential version of the same algorithm. We rely on simple hardware support for message passing and use no specialized architectural supports for object-oriented computing. thus, we are able to enjoy the benefits of future progress in standard processor technology. Our result shows that concurrent object-oriented languages can be implemented efficiently on conventional multicomputers.
A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of mo...
详细信息
ISBN:
(纸本)0897915895
A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. this paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. It is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. the model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. the utility of the model is demonstrated through examples that are implemented on the CM-5.
Understanding synchronization is important for a parallelprogramming tool that uses dependence analysis as the basis for advising programmers on the correctness of parallel constructs. this paper discusses static ana...
详细信息
A technique for implementing algorithms on a multiprocessor computer system is data partitioning, in which input data for a problem is partitioned among many processors that cooperate to solve the problem. this paper ...
详细信息
the parallelization of sequential programs is an adequate approach for the easy use of architectural features provided on parallel computers. We propose to enhance this technique withthe notion of semantic paralleliz...
详细信息
ISBN:
(纸本)0897912152
the parallelization of sequential programs is an adequate approach for the easy use of architectural features provided on parallel computers. We propose to enhance this technique withthe notion of semantic parallelization. the principle is to consider the transformations of programs induced by parallelization as non-standard denotational semantics of the source programming language. We show how to apply this concept to detect, in a toy language ALL, parallelizable complex commands and to deal with some type of programs using indirections. thanks to results in domain theory and abstract interpretation, we give correctness proofs for these transformations with respect to ALL standard semantic. A byproduct of our approach is that a denotational specification is also an executable prototype;hence, we were able to implement a semantic parallelizer in the ML functional language. Running examples are supplied.
One of the most important pragmatic advantages of functional languages is that concurrency in a program is implicit-there is no need for special constructs to express parallelism as is required in most conventional la...
详细信息
ISBN:
(纸本)089791175X
One of the most important pragmatic advantages of functional languages is that concurrency in a program is implicit-there is no need for special constructs to express parallelism as is required in most conventional languages. Furthermore, it is fairly easy for compilers to automatically determine the concurrency as a step toward decomposing a program for execution on a suitable parallel architecture. Yet it is often the case that one knows precisely the optimal decomposition for execution on a particular machine, and one can never expect a compiler to determine such optimal mappings in all cases. this paper is concerned with ways to allow the programmer to explicitly express this mapping of program to machine, by using annotations that, given a few minor constraints, cannot alter the functional semantics of the program. We show through several deeded examples the expressiveness and conciseness of the resulting " para-functional programming methodology, using an experimental language called ParAlfl based on our ideas. We also give a formal denotational description of the mapping semantics using a notion of execution trees.
Maple is a statically typed language system, with very general generic facilities. It incorporates its own programming environment and provides parallel processing with a new method of process synchronization. the syn...
暂无评论