this paper presents MUSTANG, a system for translating Fortran to single assignment form in an effort to automatically extract parallelism. Specifically, a sequential Fortran source program is translated into IF1, a ma...
详细信息
ISBN:
(纸本)0818656026
this paper presents MUSTANG, a system for translating Fortran to single assignment form in an effort to automatically extract parallelism. Specifically, a sequential Fortran source program is translated into IF1, a machine-independent dataflow graph description language that is the intermediate form for the SISAL language. During this translation, Parafrase 2 is used to detect opportunities for parallelization which are then explicitly introduced into the IF1 program. the resulting IF1 program is then processed by the Optimizing SISAL Compiler which produces parallel executables on multiple target platforms. the execution results of several Livermore Loops are presented and compared against Fortran and SISAL implementation on two different platforms. the results show that the translation is an efficient method for exploiting parallelism form the sequential Fortran source code.
An important problem in graph embeddings and parallel computing is to embed a rectangular grid into other graphs. We present a novel, general combinatorial approach to (one-to-one) embedding rectangular grids into the...
详细信息
ISBN:
(纸本)0818656026
An important problem in graph embeddings and parallel computing is to embed a rectangular grid into other graphs. We present a novel, general combinatorial approach to (one-to-one) embedding rectangular grids into their ideal rectangular grids and optimal hypercubes. In contrast to earlier approaches of Aleliunas and Rosenberg, and Ellis, our approach is based on a special kind of doubly stochastic matrix. We prove that any rectangular grid can be embedded into its ideal rectangular grid with dilation equal to the ceiling of the compression ratio, which is both optimal up a multiplicative constant and a substantial generalization of previous work. We also show that any rectangular grid can be embedded into its nearly ideal square grid with dilation at most 3.
A parallel transputer-based emulator has been developed to evaluate the DDM--a highly parallel virtual shared memory architecture. the emulator provides performance results of a hardware implementation of the DDM usin...
详细信息
ISBN:
(纸本)0818656026
A parallel transputer-based emulator has been developed to evaluate the DDM--a highly parallel virtual shared memory architecture. the emulator provides performance results of a hardware implementation of the DDM using a calibrated virtual clock. Unlike the virtual clock of a simulator, the emulator clock is bound to a fixed fraction of real time so individual processors may time action independently without the need for a global clock value. Each component of the emulator is artificially slowed down so that the balance of the speeds of all components reflects the balance of the expected hardware implementation. the calibrated emulator runs an order of magnitude faster than a simulator (the application program is executed directly and there is no overhead for the maintenance of event lists) and more importantly, the emulator is inherently parallel. this results in a peak emulation speed of 27 million instructions per second when simulating a machine with81 leaf nodes on a 121 node transputer system.
In this paper, we present a new scalable algorithm called the Regular Schedule, for parallel evaluation of band linear recurrences (BLR's, i.e., mth-order linear recurrences for m ≥1). Its scalability and simplic...
详细信息
ISBN:
(纸本)0818656026
In this paper, we present a new scalable algorithm called the Regular Schedule, for parallel evaluation of band linear recurrences (BLR's, i.e., mth-order linear recurrences for m ≥1). Its scalability and simplicity make it well suited for vector supercomputers and massively parallel computers. We describe our implementation of the regular Schedule on two types of machines: the Convex C240 and the MasPar MP-2. the scalability of our scheduling techniques is demonstrated on the two machines. Significant improvements in CPU performance for a range of programs containing BLR implemented using the Regular Schedule in C over the same programs implemented using the highly-optimized coded-in-assembly BLAS routines [17] are demonstrated on the Convex C240.
Run-time data redistribution can effect algorithm performance in distributed-memory machines. Redistribution of data can be performed between algorithm phases when a different data decomposition is expected to deliver...
详细信息
ISBN:
(纸本)0818656026
Run-time data redistribution can effect algorithm performance in distributed-memory machines. Redistribution of data can be performed between algorithm phases when a different data decomposition is expected to deliver increased performance for a subsequent phase of computation. Additionally, data redistribution, can occur at subprogram boundaries. Redistribution, however, represents increased program overhead as algorithm computation is necessarily discontinued while data are exchanged among processor memories. In this paper, we present a technique for data-processor mapping, applicable to data redistribution, that minimizes the total amount of data that must be communicated among processors. the mapping technique is architecture-independent and represents our initial work toward achieving efficient redistribution in distributed-memory machines.
We present several techniques that we have used to optimize the performance of a message-passing C code for molecular dynamics on the CM-5. We describe our use of the CM-5 vector units and a parallel memory coaching s...
详细信息
ISBN:
(纸本)0818656026
We present several techniques that we have used to optimize the performance of a message-passing C code for molecular dynamics on the CM-5. We describe our use of the CM-5 vector units and a parallel memory coaching scheme that we have developed to speed up the code by more than 50%. A modification that decreases our communication time by 35% is also presented along with a discussion of how we have been able to take advantage of the CM-5 hardware without significantly compromising code portability. We have been able to speed up our original code by a factor of ten and we feel that our modifications may be useful in optimizing the performance of other message-passing C applications on the CM-5.
the exploitation of the inherent parallelism in applications written for shared-memory systems depends critically on the efficiency of the synchronization and data exchange primitives provided by the hardware. this pa...
详细信息
ISBN:
(纸本)0818656026
the exploitation of the inherent parallelism in applications written for shared-memory systems depends critically on the efficiency of the synchronization and data exchange primitives provided by the hardware. this paper discusses and analyzes such primitives as they are implemented in the Scalable Coherent Interface (SCI). the SCI synchronization primitives are based on QOLB, a hardware primitive that shows much promise for reducing/eliminating the synchronization and access latencies of shared data. Introducing finergrained programs in the absence of such latency reduction will have little or no benefit. In particular, we discuss how QOLB fits the underlying linked-list cache coherence protocol of SCI. We also show how, for some important scenarios (critical sections and pairwise-sharing), the QOLB primitives in SCI can greatly reduce data communication latencies.
the KSR1 has a shared address space, which spreads over physically distributed memory modules with various latencies. thus performance depends considerably on the program's locality of reference and the effectiven...
详细信息
ISBN:
(纸本)0818656026
the KSR1 has a shared address space, which spreads over physically distributed memory modules with various latencies. thus performance depends considerably on the program's locality of reference and the effectiveness of using prefetch and post-store instructions. this paper analyzes the various memory latency factors which stall the processor during program execution, running on 32-processor system. A suitable model for evaluating these factors is developed for the execution of tiled do-loops withthe slice strategy. the benchmark used is a sparse matrix solver. the limited size of the prefetch queue is shown to stall the processor for a long period of time, which reduces the benefit of prefetch considerably. the post-store operation is shown to have a high overhead. However, delaying the post-store operation improved performance considerably.
We are developing a massively parallel special-purpose computer system for astrophysical N-body simulations, GRAPE-4 (GRAvity-PipE 4). the GRAPE-4 system is designed to simulate dynamics of classical particles which i...
详细信息
ISBN:
(纸本)0818656026
We are developing a massively parallel special-purpose computer system for astrophysical N-body simulations, GRAPE-4 (GRAvity-PipE 4). the GRAPE-4 system is designed to simulate dynamics of classical particles which interact each other gravitationally by using predictor-corrector methods. We have developed two application-specific LSIs, the HARP-(Hermite AcceleratoR Pipe) chip and the PROMEthEUS chip for the GRAPE-4 system. the HARP chip calculates gravitational forces and its performance exceeds 600 megaflops. the PROMEthEUS chip calculates predictors for the time-integration. Using multi-chip module technology, we can integrate 1920 HARP chips into a single system. the GRAPE-4 system consists of 4 clusters, which are connected to a single host workstation. the peak speed of GRAPE-4 will exceed 1 teraflops even in the worst case, and will reach around 1.8 teraflops in the typical case.
Present-day parallel computers often face the problems of large software overheads for process switching and interprocessor communication. these problems are addressed by the Multi-threaded Architecture (MTA), a multi...
详细信息
ISBN:
(纸本)0818656026
Present-day parallel computers often face the problems of large software overheads for process switching and interprocessor communication. these problems are addressed by the Multi-threaded Architecture (MTA), a multiprocessor model designed for efficient parallel execution of both numerical and non-numerical programs. We begin with a conventional processor, and add what we believe to be the minimal external hardware necessary for efficient support of multithreaded programs. the presentation begins withthe top-level architecture and the program execution model. the latter includes a description of activation frames and thread synchronization. this is followed by a detailed presentation of the processor. Major features of the MTA include the Register-Use Cache for exploiting temporal locality in multiple register set microprocessors, support for programs requiring non-determinism and speculation, and local function invocations which can utilize registers for parameter passing.
暂无评论