Many loop nests in scientific codes contain a parallelizable outer loop but have an inner loop for which the number of iterations varies between different iterations of the outer loop. When running this kind of loop n...
详细信息
ISBN:
(纸本)0897914759
Many loop nests in scientific codes contain a parallelizable outer loop but have an inner loop for which the number of iterations varies between different iterations of the outer loop. When running this kind of loop nest on a SIMD machine, the SIMD-inherent restriction to single program counter common to all processors will cause a performance degradation relative to comparable MIMD implementations. This problem is not due to limited parallelism or bad load balance, it is merely a problem of control flow. This paper presents a loop transformation, which we call loop flattening, that overcomes this limitation by letting each processor advance to the next loop iteration containing useful computation, if there is such an iteration for the given processor. We study a concrete example derived from a molecular dynamics code and compare performance results for flattened and unflattened versions of this kernel on two SIMD machines, the CM-2 and the DECmpp 12000. We then evaluate loop flattening from the compiler's perspective in terms of applicability, cost, profitability, and safety. We conclude with arguing that loop flattening, whether performed by the programmer or by the compiler, introduces negligible overhead and can significantly improve the performance of scientific codes for solving irregular problems.
Higher order functional programs constantly allocate objects dynamically. These objects are typically cons cells, closures, and records and are generally allocated in the heap and reclaimed later by some garbage colle...
详细信息
ISBN:
(纸本)0897914759
Higher order functional programs constantly allocate objects dynamically. These objects are typically cons cells, closures, and records and are generally allocated in the heap and reclaimed later by some garbage collection process. This paper describes a compile time analysis, called escape analysis, for determining the lifetime of dynamically created objects in higher order functional programs, and describes optimizations that can be performed, based on the analysis, to improve storage allocation and reclamation of such objects. In particular, our analysis can be applied to programs manipulating lists, in which case optimizations can be performed to allow cons cells in spines of lists to be either reclaimed immediately or reused without incurring any garbage collection overhead. In a previous paper on escape analysis, we had left open the problem of performing escape analysis on lists. Escape analysis simply determines when the argument (or some part of the argument) to a function call is returned by that call. This simple piece of information turns out to be sufficiently powerful to allow stack allocation of objects, compile-time garbage collection, reduction of run-time storage reclamation overhead, and other optimizations that are possible when the lifetimes of objects can be computed statically. Our approach is to define a high-level non-standard semantics that, in many ways, is similar to the standard semantics and captures the escape behavior caused by the constructs in a functional language. The advantage of our analysis lies in its conceptual simplicity and portability (i.e. no assumption is made about an underlying abstract machine).
This paper discusses the requirement for intermachine communication and how this requirement is met with APL2's cross-system shared variables. One problem encountered in the design is the mapping of APL2's coo...
详细信息
ISBN:
(纸本)0897914775
This paper discusses the requirement for intermachine communication and how this requirement is met with APL2's cross-system shared variables. One problem encountered in the design is the mapping of APL2's cooperative peer-to-peer communication model onto a client server data transport mechanism. Another is the ability of APL2's sharing mechanism to address applications on other machines. The goal of the facility is to provide a high level of communication between applications running on different machines while introducing the minimum language extension.
We present the design and implementation of APL Intrinsic Functions for a Finite State Machine (also known as a Finite State Automaton) which recognizes regular languages, and a Parser which recognizes a subset of con...
详细信息
ISBN:
(纸本)0897914775
We present the design and implementation of APL Intrinsic Functions for a Finite State Machine (also known as a Finite State Automaton) which recognizes regular languages, and a Parser which recognizes a subset of context free languages, including SLR(1), LALR(1), and LR(1). These are currently being used on a commercial APL mainframe system as part of a large real-time financial trading system, where they are part of a compiler which translates dealer specification statements into APL functions. Use of these Intrinsic Functions more than doubled the performance of the compiler. In addition to making a significant performance improvement for a large production system, we show these functions to have value in effectively solving many common programming problems, especially those which are inherently or apparently iterative.
A wide variety of object-oriented (OO) methodologies and tools are currently available for software development Each methodology emphasizes various phases and activities of the software life cycle using different term...
详细信息
This paper develops a methodology for compiling and executing irregular parallel programs. Such programs implement parallel operations whose size and work distribution depend on input data. We show a fundamental relat...
详细信息
ISBN:
(纸本)0897914759
This paper develops a methodology for compiling and executing irregular parallel programs. Such programs implement parallel operations whose size and work distribution depend on input data. We show a fundamental relationship between three quantities that characterize an irregular parallel computation: the total available parallelism, the optimal grain size, and the statistical variance of execution times for individual tasks. This relationship yields a dynamic scheduling algorithm that substantially reduces the overhead of executing irregular parallel operations. We incorporated this algorithm into an extended Fortran compiler. The compiler accepts as input a subset of Fortran D which includes blocked and cyclic decompositions and perfect alignment;it outputs Fortran 77 augmented with calls to library routines written in C. For irregular parallel operations, the compiled code gathers information about available parallelism and task execution time variance and uses this information to schedule the operation. On distributed memory architectures, the compiler encodes information about data access patterns for the runtime scheduling system so that it can preserve communication locality. We evaluated these compilation techniques using a set of application programs including climate modeling, circuit simulation, and x-ray tomography, that contain irregular parallel operations. The results demonstrate that, for these applications, the dynamic techniques described here achieve near-optimal efficiency on large numbers of processors. In addition, they perform significantly better, on these problems, than any previously proposed static or dynamic scheduling algorithm.
We consider the problem of supporting compacting garbage collection in the presence of modern compiler optimizations. Since our collector may move any heap object, it must accurately locate, follow, and update all poi...
Alphonse is a program transformation system that uses dynamic dependency analysis and incremental computation techniques to automatically generate efficient dynamic implementations from simple exhaustive imperative pr...
ISBN:
(纸本)9780897914758
Alphonse is a program transformation system that uses dynamic dependency analysis and incremental computation techniques to automatically generate efficient dynamic implementations from simple exhaustive imperative program specifications.
The proceedings contain 28 papers. The topics discussed include: efficient and exact data dependence analysis;practical dependence testing;a data locality optimizing algorithm;CCG: a prototype coagulating code generat...
ISBN:
(纸本)0897914287
The proceedings contain 28 papers. The topics discussed include: efficient and exact data dependence analysis;practical dependence testing;a data locality optimizing algorithm;CCG: a prototype coagulating code generator;predicting program behavior using real or estimated profiles;procedure merging with instruction caches;strictness and binding-time analyses: two for the price of one;parameterized partial evaluation;the semantic approach to program slicing;automatic generation of global optimizers;size and access inference for data-parallel programs;and Fortran at ten gigaflops: the connection machine convolution compiler.
HoME is a version of Smalltalk which can be efficiently executed on a multiprocessor and can be executed in parallel by combining a Smalltalk process with a Mach thread and executing the process on the thread. HoME is...
ISBN:
(纸本)9780897914758
HoME is a version of Smalltalk which can be efficiently executed on a multiprocessor and can be executed in parallel by combining a Smalltalk process with a Mach thread and executing the process on the thread. HoME is nearly the same as ordinary Smalltalk except that multiple processes may execute in parallel. Thus, almost all applications running on ordinary Smalltalk can be executed on HoME without changes in their *** was designed and implemented based on the following fundamental policies: (1) theoretically, an infinite number of processes can become active; (2) the moment a process is scheduled, it becomes active; (3) no process switching occurs; (4) HoME is equivalent to ordinary Smalltalk except for the previous three *** performance of the current implementation of HoME running on OMRON LUNA-88K, which had four processors, was measured by benchmarks which execute in parallel with multiple processes. In all benchmarks, the results showed that HoME's performance is much better than HPS on the same workstation.
暂无评论