We present an embedding of the complete binary tree with n leaves in the √n × √n mesh, for any n = 22m where m is a positive integer. The embedding has the following properties: at most two tree nodes (one of w...
详细信息
ISBN:
(纸本)089791483X
We present an embedding of the complete binary tree with n leaves in the √n × √n mesh, for any n = 22m where m is a positive integer. The embedding has the following properties: at most two tree nodes (one of which is a leaf) are mapped onto each mesh node, paths of the tree are mapped onto edge-disjoint paths in the mesh (each mesh edge being considered as two anti-parallel directed edges) and the maximum distance from a leaf to the root of the tree is √n + O(log n) mesh steps. This embedding facilitates efficient implementation of many P-RAM algorithms on the mesh, particularly those using the balanced binary tree technique. Such an embedding offers greater flexibility of use and improves the time complexity of these implementations by a constant factor compared with previously described embeddings.
We describe an efficient parallel implementation of Goldberg's maximum flow algorithm for a shared-memory multiprocessor. Our main technical innovation is a method that allows a 'global relabeling' heurist...
详细信息
ISBN:
(纸本)089791483X
We describe an efficient parallel implementation of Goldberg's maximum flow algorithm for a shared-memory multiprocessor. Our main technical innovation is a method that allows a 'global relabeling' heuristic to be executed concurrently with the main algorithm. This heuristic is essential for good performance in practice. We present performance results from a Sequent Symmetry for a variety of input distributions. We achieve speed-ups of up to 8.8 with 16 processors, relative to the parallel program with 1 processor (5.8 when compared to our best sequential program). We consider these speed-ups very good and we provide evidence that hardware effects and insufficient parallelism in certain inputs are the main obstacles to achieving better performance.
When a leveled hypercube algorithm (one dimension used at a time) is mapped in the straightforward way into a hypercube in which all edges are useable at once, most of the host machine's bandwidth is unused. This ...
详细信息
ISBN:
(纸本)089791483X
When a leveled hypercube algorithm (one dimension used at a time) is mapped in the straightforward way into a hypercube in which all edges are useable at once, most of the host machine's bandwidth is unused. This paper shows how to construct embeddings which better utilize the host's edges. In particular, I show how to map an n-dimensional leveled hypercube algorithm into an n-dimensional host hypercube so that the communication throughput of every guest edge is Θ ((n/log n)log(6)2) = ω (n0.386) times the communication throughput of a host edge. Furthermore, the routing can be done on edge-disjoint paths of length at most n. This result can be applied to other algorithms that are run on hypercubes. For example, if an algorithm runs on a mesh with a axes each of length 2l, but uses only one axis at a time, then it can be embedded in an la-dimensional hypercube so that each mesh edge has throughput Θ (l(a/log a)log(6)2).
Nested dissection is a method for solving large sparse systems of linear equations. The method is inherently parallelizable and efficient algorithms for parallel nested dissection (PND) exist for various parallel arch...
详细信息
ISBN:
(纸本)089791483X
Nested dissection is a method for solving large sparse systems of linear equations. The method is inherently parallelizable and efficient algorithms for parallel nested dissection (PND) exist for various parallelarchitectures. Birkhoff and George showed that for a sparse grid graph, a parallel algorithm with n processors takes O(√n) parallel time. The algorithm in works on a PRAM model, using O(n3/2) processors and taking O((log n)3) time for a general class of matrices which have have O(√n) size separators. Both these algorithms use a logarithmic factor of space larger than the input matrix. Our results are as follows: (1) We describe a FAST PND algorithm for the PRAM model, which reduces the time bound from O((log n)3) to O((log n)2), without significantly increasing the processor bound. (2) We implement a PND algorithm that works on a mesh-connected processor array, using O(n) processors and taking O(√n) time for the important and more general case where the graph of the matrix is of constant degree and separator size √n (as required for many 2D PDE applications). (3) We describe a COMPACT PND algorithm which reduces the space bounds to a constant factor of the input matrix, without significant increase in time or processor bounds. In addition to these practical results, we show that it is theoretically possible to achieve tighter bounds on the amount of work performed (and thus on time and processor bounds), using known theoretical results about parallel matrix multiplication. Finally, we show that our algorithms generalize to solve all pairs minimal cost path problems within the same complexity.
Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distribut...
详细信息
ISBN:
(纸本)089791483X
Flashsort [RV83,86] and Samplesort [HC83] are related parallel sorting algorithms proposed in the literature. Both utilize a sophisticated randomized sampling technique to form a splitter set, but Samplesort distributes the splitter set to each processor while Flashsort uses splitter-directed routing. In this paper we present B-Flashsort, a new batched-routing variant of Flashsort designed to sort N > P values using P processors connected in a d-dimensional mesh and using constant space in addition to the input and output. The key advantage of the Flashsort approach over Samplesort is a decrease in memory requirements, by avoiding the broadcast of the splitter set to all processors. The practical advantage of B-Flashsort over Flashsort is that it replaces pipelined splitter-directed routing with a set of synchronous local communications and bounds recursion, while still being demonstrably efficient. The performance of B-Flashsort and Samplesort is compared using a parameterized analytic model in the style of [BLM+91] to show that on a d-dimensional toroidal mesh B-Flashsort improves on Samplesort when (N/P) 1log P + c2dP1/d + c3), for machine-dependent parameters c1, c2, and c3. Empirical confirmation of the analytical model is obtained through implementations on a MasPar MP-1 of Samplesort and two B-Flashsort variants.
In this paper, we discuss the use of multipath multistage interconnection networks (MINs) in the design of a fault-tolerant parallel computer. Multipath networks have multiple paths between any input and any output. I...
详细信息
ISBN:
(纸本)089791483X
In this paper, we discuss the use of multipath multistage interconnection networks (MINs) in the design of a fault-tolerant parallel computer. Multipath networks have multiple paths between any input and any output. In particular, we examine networks with either the property of expansion or maximal-fanout. We present an Ω(n 1/2 ) lower time bound for a worst-case permutation on deterministic maximal-fanout networks. We further show how a randomized approach to maximal-fanout avoids the regularity from which this worst case arises. Unlike most previous work, we examine systems which can tolerate node failure and isolation. We describe mechanisms for fault identification and system reconfiguration. In reconfiguring a faulty system, a naive approach is to preserve processing power by maximizing the number of processing nodes left in operation. However, our results show that the synchronization requirements of applications make it critical to eliminate nodes with poor network connections. We find that a conservative fault-propagation algorithm for reconfiguration, adapted from work by Leighton and Maggs [LM92], performs well for all of our multipath networks. We also address some practical issues of network construction and present performance simulations based upon the MIT Transit architecture [DcH90]. Simulation results for 1024 node systems demonstrate that multipath networks, reconfigured with our fault-propagation algorithm, perform well not only in theory, but also in practice. In fact, our systems suffer only a small decrease in performance from network faults;the degradation is linear in the percentage of network failure.
Given a pattern string, we describe a way to preprocess it. We design a constant-time optimal parallel algorithm for finding all occurrences of the (prepro-cessed) pattern in any given text.
ISBN:
(纸本)0897915119
Given a pattern string, we describe a way to preprocess it. We design a constant-time optimal parallel algorithm for finding all occurrences of the (prepro-cessed) pattern in any given text.
The strong link between matroids and matching is used to extend the ideas that resulted in the design of Random NC algorithms for matching to obtain RNC algorithms for the matroid union, intersection and matching prob...
详细信息
ISBN:
(纸本)089791466X
The strong link between matroids and matching is used to extend the ideas that resulted in the design of Random NC algorithms for matching to obtain RNC algorithms for the matroid union, intersection and matching problems, for linearly representable matroids. As a consequence, we obtain RNC algorithms for the well-known problems of finding an arboresence and a maximum cardinality set of edge-disjoint spanning trees in a graph. The key tools used are linear algebra and randomization.
The question of what the architecture of each node in a general purpose, massively parallel architecture (MPA) should be is framed in concrete terms by describing two fundamental problems that must be solved well in a...
详细信息
ISBN:
(纸本)0897915097
The question of what the architecture of each node in a general purpose, massively parallel architecture (MPA) should be is framed in concrete terms by describing two fundamental problems that must be solved well in any general-purpose MPA. From this, the required logical organization of an MPA node is systematically developed, and some details of *T (pronounced start), a concrete architecture designed to these requirements, are presented. *T is a direct descendant of dynamic dataflow architectures, and unifies them with von Neumann architectures. A hand-complied example and some compilation issues are discussed.
暂无评论