This paper addresses the problem of partitioning data for distributed memory machines (multicomputers). In current day multicomputers, interprocessor communication is more time-consuming than instruction execution. If...
详细信息
This paper addresses the problem of partitioning data for distributed memory machines (multicomputers). In current day multicomputers, interprocessor communication is more time-consuming than instruction execution. If insufficient attention is paid to the data allocation problem, then the amount of time spent in interprocessor communication might be so high as to seriously undermine the benefits of parallelism. It is therefore worthwhile for a compiler to analyze patterns of data usage to determine allocation, in order to minimize interprocessor communication. We present a machine-independent analysis of communication-free partitions. We present a matrix notation to describe array accesses in fully parallel loops which lets us derive sufficient conditions for communication-free partitioning (decomposition) of arrays. In the case of a commonly occurring class of accesses, we present a problem formulation to minimize communication costs, when communication-free partitioning of arrays is not possible.
Scheduling of data-flow graphs onto parallel processors consists in assigning actors to processors, ordering the execution of actors within each processor, and firing the actors at particular times. Many scheduling st...
详细信息
Scheduling of data-flow graphs onto parallel processors consists in assigning actors to processors, ordering the execution of actors within each processor, and firing the actors at particular times. Many scheduling strategies do at least one of these operations at compile time to reduce run-time cost. In this paper, we classify four scheduling strategies: 1) fully dynamic, 2) static-assignment, 3) self-timed, and 4) fully static. These are ordered in decreasing run-time cost. Optimal or near-optimal compile-time decisions require deterministic, data-independent program behavior known to the compiler. Thus, moving from strategy 1) toward 4) either sacrifices optimality, decreases generality by excluding certain program constructs, or both. This paper proposes scheduling techniques valid for strategies 2), 3), and 4). In particular, we focus on data-flow graphs representing data-dependent iteration;for such graphs, although it is impossible to deterministically optimize the schedule at compile time, reasonable decisions can be made. For many applications, good compile-time decisions remove the need for dynamic scheduling or load balancing. We assume a known probability mass function for the number of cycles in the data-dependent iteration and show how a compile-time decision about assignment and/or ordering as well as timing can be made. The criterion we use is to minimize the expected total idle time caused by the iteration;in certain cases, this will also minimize the expected makespan of the schedule. We will also show how to determine the number of processors that should be assigned to the data-dependent iteration. The method is illustrated with a practical programming example, yielding preliminary results that are very promising.
Intensive scientific algorithms can usually be formulated as nested loops which are the main source of parallelism. When a nested loop is executed in parallel, the total execution time is composed of two parts-the com...
详细信息
Intensive scientific algorithms can usually be formulated as nested loops which are the main source of parallelism. When a nested loop is executed in parallel, the total execution time is composed of two parts-the computation time and the communication time. For a message-passing multiprocessor system, performance declines rapidly as the communication overhead is higher than the corresponding computation. In this paper, a method for parallel executing nested loops with constant loop-carried dependencies on message-passing multiprocessor systems to reduce the communication overhead is presented. First, we partition the nested loop into blocks which result in little communication without concern for the topology of machines. For a given linear time transformation found by the hyperplane method, the iterations of a nested loop are partitioned into blocks such that the communication among the blocks is reduced while the execution ordering defined by the time transformation is not perturbed. Then, the partitioned blocks generated by the partitioning method are mapped onto multiprocessor systems according to the specific properties of various machines. We propose a heuristic mapping algorithm for the hypercube machines.
A virtual shared memory architecture (VSMA) is a distributed memory architecture that looks to the application software as if it were a shared memory system. The major problem with such a system is to maintain the coh...
详细信息
ISBN:
(纸本)3540539514
A virtual shared memory architecture (VSMA) is a distributed memory architecture that looks to the application software as if it were a shared memory system. The major problem with such a system is to maintain the coherence of the distributed data entities. Shared virtual memory means that the shared data entities are pages of local virtual memories with demand paging. Memory coherence may be strong or weak. Strong coherence is a scheme where all the shared data entities look from the outside as if they were stored in one coherent memory. This simplifies programming of a distributed memory system at the cost of a high message traffic in the system, needed to maintain the strong coherence. The efficiency of the system can be increased by adding a weak coherence scheme that allows for multiple writes by different threads of control into the same page. The price of the weak coherence scheme is the need for explicit program synchronizations, needed to reestablish at the end the strong coherence of the result. For the computer architect, the challenging question is how to implement a VSMA most efficiently and, specifically, by what architectural means to support the implementation. In the paper a new solution to this question is presented based upon an innovative distributed memory architecture in which communication is conducted by a dedicated communication processor in each node rather than by the node CPU. This will make the exchange of short, fixed-size messages, e.g., invalidation notices, very efficient. Therefore, it becomes more appropriate to minimize the overall administrative overhead, even at the cost of more message traffic. On that rationale, a novel, capability-based mechanism for both strong and weak coherence of shared virtual memory is presented. The weak coherence scheme is built on top of the strong coherence, utilizing its mechanisms. The proposed implementation is totally distributed and based on a strict need to know philosophy. Consequently, the e
Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation ...
详细信息
Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. Firstly, the execution of the barrier may be slow since it requires execution of several instructions. Secondly, processors that are stalled waiting for other processors to reach the barrier cannot do any useful work. In this paper, the notion of thefuzzy barrier is presented, that avoids these drawbacks. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by extending the barrier concept to include a region of statements that can be executed by a processor while it awaits synchronization. The barrier regions are constructed by a compiler and consist of instructions such that a processor is ready to synchronize upon reaching the first instruction and must synchronize before exiting the region. When synchronization does occur, the processors could be executing at any point in their respective barrier regions. The larger the barrier region, the more likely it is that none of the processors will have to stall. Hardware fuzzy barriers have been implemented as part of a RISC-based multi-processor system. Results based on a software implementation of the fuzzy barrier on the Encore multiprocessor indicate that the synchronization overhead can be greatly reduced using the mechaism.
This paper discusses the use of shared register channels as a data exchange mechanism among processors in a fine-grained MIMD system with a load/store architecture. A register channel is provided with a synchronizatio...
详细信息
ISBN:
(纸本)9780897914130
This paper discusses the use of shared register channels as a data exchange mechanism among processors in a fine-grained MIMD system with a load/store architecture. A register channel is provided with a synchronization bit that is used to ensure that a processor succeeds in reading a channel only after a value has been written to the channel. The instructions supported by this load/store architecture allow both registers and register channels to be used as operand sources and result destinations. Conditional load, store, and move instructions are provided to allow processors to exchange values through channels in presence of aliasing caused by array references. Compiler support required to take proper advantage of channels is briefly discussed. In contrast to a VLIW machine a system with channels does not require strict lockstep operation of its processors. This reduces the delays caused by unpredictable events such as memory bank conflicts.
暂无评论