We investigate synchronization activities in application executing on distributed-memory MIMD architectures. Three applications are used to quantify the performance impact of synchronization as the number of processor...
详细信息
We investigate synchronization activities in application executing on distributed-memory MIMD architectures. Three applications are used to quantify the performance impact of synchronization as the number of processors is increased. We also investigate the performance improvement possible when synchronization is supported in hardware. The results show that significant performance improvement can be achieved. The hardware support should include barrier synchronization, operate-and-broadcast, and operations over subsets of processors.< >
Some parallel applications do not require a precise imitation of the behavior of the physically shared memory programming model. Consequently, certain parallel machine architectures have elected to emphasize different...
详细信息
Some parallel applications do not require a precise imitation of the behavior of the physically shared memory programming model. Consequently, certain parallel machine architectures have elected to emphasize different required coherency properties because of possible efficiency gains. This has led to various definitions of models of store coherency. These definitions have not been amenable to detailed analysis and, consequently, inconsistencies have resulted. In this paper a unified framework is proposed in which different models of store coherency are developed systematically by progressively relaxing the constraints that they have to satisfy. A demonstration is given of how formal reasoning can be carried out to compare different models. Some real-life systems are considered and a definition of a version of weak coherency is found to be incomplete.< >
A dynamic storage allocator manages space for objects whose lifetimes are not known by the system at the time for their creation. This paper examines parallel dynamic storage allocation algorithms. Three new algorithm...
详细信息
A dynamic storage allocator manages space for objects whose lifetimes are not known by the system at the time for their creation. This paper examines parallel dynamic storage allocation algorithms. Three new algorithms, multiple free list fit I, modified quick fit, and multiple free list fit II, are presented which outperform all previous algorithms we tested. The performances of the three algorithms differ when the percentage of requests for large blocks is high. Multiple free list fit I is the fastest algorithm when large blocks of many different sizes are requested. Multiple free list fit II is the fastest algorithm when large blocks of only a few different sizes are requested.< >
The authors are concerned with dynamic programming (DP) algorithms whose solution is given by a recurrence relation similar to that for the matrix parenthesization problem. Guibas, Kung and Thompson (1979), presented ...
详细信息
The authors are concerned with dynamic programming (DP) algorithms whose solution is given by a recurrence relation similar to that for the matrix parenthesization problem. Guibas, Kung and Thompson (1979), presented a systolic array algorithm for this problem that uses O(n/sup 2/) processing cells and solves the problem in O(n) time. The authors present three different mappings of this systolic algorithm on a mesh connected parallel computer. The first two mappings use commonly known techniques for mapping systolic arrays to mesh computers. Both of them are able to obtain only a fraction of maximum possible performance. The primary reason for the poor performance of these formulations is that different nodes at different levels in the multistage graph in the DP formulation require different amounts of computation. Any adaptation has to take this into consideration and evenly distribute the work among the processors. The third mapping balances the work load among processors and thus is capable of providing efficiency approximately equal to 1 (i.e., speedup approximately equal to the number of processors) for any number of processors and sufficiently large problem. They experimentally evaluate these mappings on a mesh embedded onto a 256 processor nCUBE/2.< >
A partitionable hypercube allows simultaneous execution of multiple tasks, where each task can be executed on a choice of subcubes. This paper considers the problem of static nonpreemptive scheduling of w independent ...
详细信息
A partitionable hypercube allows simultaneous execution of multiple tasks, where each task can be executed on a choice of subcubes. This paper considers the problem of static nonpreemptive scheduling of w independent tasks on a n processor partitionable hypercube system to minimize the overall finishing time of the w tasks. Each task can be executed on subcubes of different sizes, with smaller execution times on larger subcubes. A schedule determines the size of the subcube to be assigned to each task and schedules these tasks on the processors in the hypercube system. The problem of finding the optimal schedule, with minimum finishing time, is known to be NP-hard. This paper presents a fast polynomial time approximation algorithm for the problem, and derives a tight worst-case performance bound of 2 for the algorithm.< >
A programming methodology based on tensor products has been used for designing and implementing block recursive algorithms for parallel and vector multiprocessors. A previous tensor product formulation of Strassen'...
详细信息
A programming methodology based on tensor products has been used for designing and implementing block recursive algorithms for parallel and vector multiprocessors. A previous tensor product formulation of Strassen's matrix multiplication algorithm requires working arrays of size O(7/sup n/) for multiplying 2/sup n/*2/sup n/ matrices. The authors present a modified tensor product formulation of Strassen's algorithm in which the size of working arrays can be reduced to O(4/sup n/). The modified formulation exhibits sufficient parallel and vector operations for efficient implementation. Performance results on the Cray Y-MP are presented.< >
Total exchange is the densest parallel communication primitive and poses a severe test for the capability of any parallel architecture. This operation is very important and arises in many applications. We show that a ...
详细信息
Total exchange is the densest parallel communication primitive and poses a severe test for the capability of any parallel architecture. This operation is very important and arises in many applications. We show that a reconfigurable parallel architecture called MICA can perform total exchange efficiently with simple algorithms. The mechanisms and structures of the reconfigurable switches, supporting simple total exchange algorithms, are presented. We introduce the basic operations the network supports as they are issued by the host to efficiently manage the network reconfiguration. Applications in linear algebra are briefly mentioned showing the way they can benefit from our architecture.< >
Efficient partitioning and scheduling of parallel programs and the distribution of data among pmessing elements are very important issues in parallel and distributed systems. Existing tools fdl short in addressing the...
详细信息
Efficient partitioning and scheduling of parallel programs and the distribution of data among pmessing elements are very important issues in parallel and distributed systems. Existing tools fdl short in addressing the issues satisfactorily. On one hand, it is believed to k unreasonable to leave the burden of these complex tasks to the programmers. On the other hand, fully automated schedulers have ken shown to be d little practical significance, or suitable only for restricted cases. In this paper we address the issues and algorithms for efficient partitioning and scheduling d parallel programs, including the distribution of data, in dislributed-memory multiprocessor systems, using the PARSA parallel software development environment. PARSA consists of a set of visual, interactive, compiletime tools that will provide automated program partitions and schedules whenever possible, while permitting the user to exert conml over these operations for a htter performance. The support program assessment tool provides the users the opportunity b fine-tune the program and achieve their performance objectives.
In programming massively parallel computers, it is often necessary to have sets of processes cooperate in performing certain computations and communications. Most run-time libraries require that such sets of processes...
详细信息
In programming massively parallel computers, it is often necessary to have sets of processes cooperate in performing certain computations and communications. Most run-time libraries require that such sets of processes be explicitly specified in the program. In the Venus run-time communication library however, a Process Group abstraction is used to enable implicit coordination of and communication over dynamically determined sets of processes. The Process Groups mechanism in Venus offers an object-oriented approach for handling sets of processes and enhances the debugging and monitoring of programs. The authors describe the Process Groups mechanism in Venus, illustrate its use on the class of N-body problems, and outline some of the data structures and algorithms used to implement this mechanism in Venus.< >
暂无评论