A programming methodology based on tensor products has been used for designing and implementing block recursive algorithms for parallel and vector multiprocessors. A previous tensor product formulation of Strassen'...
详细信息
A programming methodology based on tensor products has been used for designing and implementing block recursive algorithms for parallel and vector multiprocessors. A previous tensor product formulation of Strassen's matrix multiplication algorithm requires working arrays of size O(7/sup n/) for multiplying 2/sup n/*2/sup n/ matrices. The authors present a modified tensor product formulation of Strassen's algorithm in which the size of working arrays can be reduced to O(4/sup n/). The modified formulation exhibits sufficient parallel and vector operations for efficient implementation. Performance results on the Cray Y-MP are presented.< >
Total exchange is the densest parallel communication primitive and poses a severe test for the capability of any parallel architecture. This operation is very important and arises in many applications. We show that a ...
详细信息
Total exchange is the densest parallel communication primitive and poses a severe test for the capability of any parallel architecture. This operation is very important and arises in many applications. We show that a reconfigurable parallel architecture called MICA can perform total exchange efficiently with simple algorithms. The mechanisms and structures of the reconfigurable switches, supporting simple total exchange algorithms, are presented. We introduce the basic operations the network supports as they are issued by the host to efficiently manage the network reconfiguration. Applications in linear algebra are briefly mentioned showing the way they can benefit from our architecture.< >
Efficient partitioning and scheduling of parallel programs and the distribution of data among pmessing elements are very important issues in parallel and distributed systems. Existing tools fdl short in addressing the...
详细信息
Efficient partitioning and scheduling of parallel programs and the distribution of data among pmessing elements are very important issues in parallel and distributed systems. Existing tools fdl short in addressing the issues satisfactorily. On one hand, it is believed to k unreasonable to leave the burden of these complex tasks to the programmers. On the other hand, fully automated schedulers have ken shown to be d little practical significance, or suitable only for restricted cases. In this paper we address the issues and algorithms for efficient partitioning and scheduling d parallel programs, including the distribution of data, in dislributed-memory multiprocessor systems, using the PARSA parallel software development environment. PARSA consists of a set of visual, interactive, compiletime tools that will provide automated program partitions and schedules whenever possible, while permitting the user to exert conml over these operations for a htter performance. The support program assessment tool provides the users the opportunity b fine-tune the program and achieve their performance objectives.
This paper discusses the architecture and performance of a prototype switch for interconnecting IBM RISC System/6000 workstations. The paper describes the interconnection architecture and performance on a cluster of f...
详细信息
This paper discusses the architecture and performance of a prototype switch for interconnecting IBM RISC System/6000 workstations. The paper describes the interconnection architecture and performance on a cluster of four IBM RISC System 6000 model 340 workstations. It also describes the driver level software interface to the switch and the features incorporated to minimize communication overhead. The performance measurements cover communication latency and bandwidth. In addition, performance measurements of Express, a popular parallel-programming interface, are provided.< >
In programming massively parallel computers, it is often necessary to have sets of processes cooperate in performing certain computations and communications. Most run-time libraries require that such sets of processes...
详细信息
In programming massively parallel computers, it is often necessary to have sets of processes cooperate in performing certain computations and communications. Most run-time libraries require that such sets of processes be explicitly specified in the program. In the Venus run-time communication library however, a Process Group abstraction is used to enable implicit coordination of and communication over dynamically determined sets of processes. The Process Groups mechanism in Venus offers an object-oriented approach for handling sets of processes and enhances the debugging and monitoring of programs. The authors describe the Process Groups mechanism in Venus, illustrate its use on the class of N-body problems, and outline some of the data structures and algorithms used to implement this mechanism in Venus.< >
The author has implemented a set of computational physics codes on a network of IBM RS/6000 workstations used as a distributed parallel computer. He compares the performance of the codes on this network, using both st...
详细信息
The author has implemented a set of computational physics codes on a network of IBM RS/6000 workstations used as a distributed parallel computer. He compares the performance of the codes on this network, using both standard Ethernet connections and a fast prototype switch, and also on the nCUBE/2, a MIMD parallel computer. The algorithms used range from simple, local, and regular to complex, non-local, and irregular. He describes his experiences with the hardware, software and parallel languages used, and discusses ideas for making distributed parallel computing on workstation networks more easily usable for computational physicists.< >
The maximum clique problem is to find the maximum complete subgraph of a given graph G. A computation model for large-scale maximum clique problems is proposed and was tested. A parallel algorithm based on the maximum...
详细信息
ISBN:
(纸本)0780312813
The maximum clique problem is to find the maximum complete subgraph of a given graph G. A computation model for large-scale maximum clique problems is proposed and was tested. A parallel algorithm based on the maximum neural network which resembles the winner-takes-all circuit is introduced which solves large-scale problems in reasonable computation time that the best known algorithms cannot solve. The maximum clique problem is first formulated as an unconstrained quadratic zero-one programming problem and is solved by minimizing the weight summation over the same partition in a newly constructed graph.< >
The parallel implementation of the revised simplex algorithm (RSA) using eta-factorization holds the promise of significant improvement in the execution time by virtue of the existence of a high degree of parallelism ...
详细信息
The parallel implementation of the revised simplex algorithm (RSA) using eta-factorization holds the promise of significant improvement in the execution time by virtue of the existence of a high degree of parallelism in the computation within an iteration of the algorithm. However, the scheme employed to partition key data structures in a distributed memory parallel processor has a great impact on the achievable performance. The paper explores the trade-offs between block-row and block-column partitioning schemes for the matrix of constraint coefficients vis-a-vis the communication overheads and granularity of parallel computations. The results of an approximate analysis of the compute-communication balance are compared with measurements from practical implementation of the partitioning schemes on C-DAC's PARAM 8000 distributed memory parallel processor.< >
The local-network-based computer system, in which some workstations are connected is coming into practical use. Software development is, however, very difficult for end-users because the system has complicated problem...
详细信息
The local-network-based computer system, in which some workstations are connected is coming into practical use. Software development is, however, very difficult for end-users because the system has complicated problems such as load balancing, communication among processes on different workstations and so on. The authors propose a C-specific parallelizing compiler to solve these problems. The compiler adopts function call parallelization, and takes less communication overhead than DO loop parallelization.< >
We present a parallel algorithm for solving the dual transshipment problem. The traditional dual simplex method does not offer much scope for parallelization, because it moves from one basic feasible solution to anoth...
详细信息
We present a parallel algorithm for solving the dual transshipment problem. The traditional dual simplex method does not offer much scope for parallelization, because it moves from one basic feasible solution to another, performing one pivot operation at a time. We present a new method called modified network dual simplex method which uses concurrent pivots. This departure from the traditional LP approach raises several issues such as the need to convert a non-basic feasible solution to a basic feasible solution. We present our strategies to handle these issues as well as the corresponding parallel algorithms. We also present results of testing this algorithm on large graphs to solve the integrated layout compaction and wire balancing problem.< >
暂无评论