Communication using blocking receives is the commonly used mechanism is parallel programming today. Message driven execution is an alternate mechanism which does not use receive style statements at all. the message dr...
详细信息
ISBN:
(纸本)0818656026
Communication using blocking receives is the commonly used mechanism is parallel programming today. Message driven execution is an alternate mechanism which does not use receive style statements at all. the message driven execution style promotes the overlap of computation and communication. Programs written in this style exhibit increased latency tolerance: their performance does not degrade significantly with latency. It also induces compositionality: multiple independently developed modules can be combined correctly without loss of efficiency. However, as the flow of control is not explicit in such programs, they are often difficult to develop and debug. We present a coordination language called Dagger to alleviate this problem. the language has been implemented in the Charm parallel programming system, and runs programs portably on a variety of parallel machines.
Superscalar microprocessors obtain high performance by exploiting parallelism at the instruction level. To effectively use the instruction-level parallelism found in general purpose, non-numeric code, future processor...
详细信息
ISBN:
(纸本)0818656026
Superscalar microprocessors obtain high performance by exploiting parallelism at the instruction level. To effectively use the instruction-level parallelism found in general purpose, non-numeric code, future processors will need to speculatively execute far beyond instruction fetch limiting Conditional branches. One result of this deep speculation is an increase in the number of instruction and data memory references due to the execution of mispredicted paths. Using a tool we developed to generate speculative traces from Intel architecture Unix binaries, we examine the differences in cache performance between speculative and non-speculative execution models. the results pertaining to increased memory traffic, mispredicted path reference effects, allocation strategies, and speculative write buffers are discussed.
Synchronization primitives for large shared-memory multiprocessors need to minimize latency and contention. Software queue-based locks address these goals, but suffer if a process near the end of the queue waits for a...
详细信息
ISBN:
(纸本)0818656026
Synchronization primitives for large shared-memory multiprocessors need to minimize latency and contention. Software queue-based locks address these goals, but suffer if a process near the end of the queue waits for a preempted processes ahead of it. To solve this problem, we present two queue-based locks that recover from in-queue preemption. the simpler, faster lock employs an extended kernel interface that shares information in both directions across the user-kernel boundary. Results from experiments with real and synthetic application on SGI and KSR multiprocessors demonstrate that high-performance software spin locks are compatible with multiprogramming on both largescale and bus-based machines.
the problem of efficiently storing a d-dimensional array into multiple memory modules of a shared memory machine is an important problem in parallelprocessing. In this paper, we consider the problem for the three-dim...
详细信息
ISBN:
(纸本)0818656026
the problem of efficiently storing a d-dimensional array into multiple memory modules of a shared memory machine is an important problem in parallelprocessing. In this paper, we consider the problem for the three-dimensional arrays. More specifically, given an array A of size n×n×n and a shared memory machine with n memory modules, we show how to store A so that no two elements within any row, any column, any diagonal of a face of A and main sub-arrays of A are stored in the same memory module. Our scheme thus achieves no memory conflicts when the processors of the shared memory machine simultaneously access elements within a row, column, sub-array, etc. We also show how to store A efficiently, if diagonals of A are required to be accessed conflict-free in addition to rows and columns. All of our schemes use latin cubes.
this paper gives an O(lg3n)-time and n/lgn processor algorithm for solving the matrix chain ordering problem and for finding optimal triangulations of a convex polygon on the Common CRCW PRAM model. this algorithm wor...
详细信息
ISBN:
(纸本)0818656026
this paper gives an O(lg3n)-time and n/lgn processor algorithm for solving the matrix chain ordering problem and for finding optimal triangulations of a convex polygon on the Common CRCW PRAM model. this algorithm works by finding shortest paths in special digraphs modeling dynamic programming tables. Also, a key part of the algorithm is improved by computing row minima of a totally monotone matrix, this lets the algorithm run in O(lg2n) time with n processors on the EREW PRAM or even O(log2n lglg n) time with n/lglgn processors on the CRCW PRAM.
this paper presents an efficient parallel logic-circuit simulation scheme based on the Time Warp optimistic algorithm. the Time Warp algorithm is integrated with a new Global Virtual Time (GVT) computation scheme for ...
详细信息
ISBN:
(纸本)0818656026
this paper presents an efficient parallel logic-circuit simulation scheme based on the Time Warp optimistic algorithm. the Time Warp algorithm is integrated with a new Global Virtual Time (GVT) computation scheme for fossil collection. the new GVT computation is based on a token ring passing method, so that global synchronization is not required in a shared-memory multiprocessor system. this allows us to process large logic simulation problems, where the GVT computation is executed frequently for fossil collection due to limited memory space. We also presented how to reduce the frequency of the GVT computation and the rollback ratio by scheduling the process withthe smallest timestamp first. We implemented the parallel logic-circuit simulator using the Time Warp on BBN Butterfly machines, and the experimental results show that our algorithm provides a significant speedup in processing time even for very large circuits.
When mapping a parallel program onto a parallel architecture, the number of available processor is usually less than the number of tasks in the program. this gives rise to clustering and intra-processor scheduling pro...
详细信息
ISBN:
(纸本)0818656026
When mapping a parallel program onto a parallel architecture, the number of available processor is usually less than the number of tasks in the program. this gives rise to clustering and intra-processor scheduling problems. In this paper, we address these two problems for distributed-memory systems where programs are explicitly-parallel in nature. We show that existing models of program representation are insufficient to capture temporal behavior of such programs. We use a new Temporal Communication Graph model with allows identification of overlap of communication with computation and inter-task parallelism. Clustering and intra-processor scheduling heuristics, attempting to minimize program completion time, are proposed using this model. Simulations results on random task graphs show 10-25% improvement in completion time over existing heuristics.
We propose an optimized array handling scheme called the Direct Array Injection Technique (DAIT) which can be used in conjunction with I-Structures to provide non-strict and parallel array accesses in a fine/medium gr...
详细信息
ISBN:
(纸本)0818656026
We propose an optimized array handling scheme called the Direct Array Injection Technique (DAIT) which can be used in conjunction with I-Structures to provide non-strict and parallel array accesses in a fine/medium grain multithreaded computing environment. By overlapping array production and consumption, better processor utilization can be achieved resulting in higher performance. the Direct Array Injection Technique differs from the I-Structure approach in that array elements are forwarded directly from the producer activation to the consumer activation without being buffered in a structure store. the advantages of this technique over I-Structures are, 1) network traffic is reduced and 2) dynamic memory allocation form the global heap space to store arrays is avoided.
this paper presents MUSTANG, a system for translating Fortran to single assignment form in an effort to automatically extract parallelism. Specifically, a sequential Fortran source program is translated into IF1, a ma...
详细信息
ISBN:
(纸本)0818656026
this paper presents MUSTANG, a system for translating Fortran to single assignment form in an effort to automatically extract parallelism. Specifically, a sequential Fortran source program is translated into IF1, a machine-independent dataflow graph description language that is the intermediate form for the SISAL language. During this translation, Parafrase 2 is used to detect opportunities for parallelization which are then explicitly introduced into the IF1 program. the resulting IF1 program is then processed by the Optimizing SISAL Compiler which produces parallel executables on multiple target platforms. the execution results of several Livermore Loops are presented and compared against Fortran and SISAL implementation on two different platforms. the results show that the translation is an efficient method for exploiting parallelism form the sequential Fortran source code.
An important problem in graph embeddings and parallel computing is to embed a rectangular grid into other graphs. We present a novel, general combinatorial approach to (one-to-one) embedding rectangular grids into the...
详细信息
ISBN:
(纸本)0818656026
An important problem in graph embeddings and parallel computing is to embed a rectangular grid into other graphs. We present a novel, general combinatorial approach to (one-to-one) embedding rectangular grids into their ideal rectangular grids and optimal hypercubes. In contrast to earlier approaches of Aleliunas and Rosenberg, and Ellis, our approach is based on a special kind of doubly stochastic matrix. We prove that any rectangular grid can be embedded into its ideal rectangular grid with dilation equal to the ceiling of the compression ratio, which is both optimal up a multiplicative constant and a substantial generalization of previous work. We also show that any rectangular grid can be embedded into its nearly ideal square grid with dilation at most 3.
暂无评论