We present a parallel solution to the problem of determining the triconnected components of an undirected graph. We obtain significant speedups over the only published optimal (linear-time) serial implementation of a ...
详细信息
ISBN:
(纸本)9781450312134
We present a parallel solution to the problem of determining the triconnected components of an undirected graph. We obtain significant speedups over the only published optimal (linear-time) serial implementation of a triconnected components algorithm running on a modern CPU. This is accomplished on the PRAM-inspired XMT many-core architecture. To our knowledge, no other parallel implementation of a triconnected components algorithm has been published for any platform. Copyright is held by the author/owner(s).
As parallel machines scale to one million nodes and beyond, it becomes increasingly difficult to build a reliable network that is able to guarantee packet delivery. Eventually large systems will need to employ fault-t...
详细信息
ISBN:
(纸本)9781581135299
As parallel machines scale to one million nodes and beyond, it becomes increasingly difficult to build a reliable network that is able to guarantee packet delivery. Eventually large systems will need to employ fault-tolerant messaging protocols that afford correct execution in the presence of a lossy network. In this paper we present a lightweight protocol that preserves message idempotence and is easy to implement in hardware. We identify the requirements for a correct implementation of the protocol. Experiments are performed in simulation to determine implementation parameters that optimize performance. We find that an aggressive implementation on a fat tree network results in a slowdown of less than 2x compared to buffered wormhole routing on a fault-free network.
We study integrated prefetching and caching in single and parallel disk systems. There exist two very popular approximation algorithms called Aggressive and Conservative for minimizing the total elapsed time in the si...
详细信息
ISBN:
(纸本)9781581136616
We study integrated prefetching and caching in single and parallel disk systems. There exist two very popular approximation algorithms called Aggressive and Conservative for minimizing the total elapsed time in the single disk problem. For D parallel disks, approximation algorithms are known for both the elapsed time and stall time performance measures. In particular, there exists a D-approximation algorithm for the stall time measure that uses D-1 additional memory locations in cache. In the first part of the paper we investigate approximation algorithms for the single disk problem. We give a refined analysis of the Aggressive algorithm, showing that the original analysis was too pessimistic. We prove that our new bound is tight. Additionally we present a new family of prefetching and caching strategies and give algorithms that perform better than Aggressive and Conservative. In the second part of the paper we investigate the problem of minimizing stall time in parallel disk systems. We present a polynomial time algorithm for computing a prefetching/caching schedule whose stall time is bounded by that of an optimal solution. The schedule uses at most 3(D - 1) extra memory locations in cache. This is the first polynomial time algorithm for computing schedules with a minimum stall time. Our algorithm is based on the linear programming approach of [1]. However, in order to achieve minimum stall times, we introduce the new concept of synchronized schedules in which fetches on the D disks are performed completely in parallel.
Two hardware barrier synchronization schemes are presented which can support deep levels of control nesting in data parallel programs. Hardware barriers are usually an order of magnitude faster than software implement...
详细信息
ISBN:
(纸本)9780897917179
Two hardware barrier synchronization schemes are presented which can support deep levels of control nesting in data parallel programs. Hardware barriers are usually an order of magnitude faster than software implementations. Since large data parallel programs often have several levels of nested barriers, these schemes provide significant speedups in the execution of such programs on MIMD computers. The first scheme performs code transformations and uses two single-bit-trees to implement unlimited levels of nested barriers. However, this scheme increases the code size. The second scheme uses a more expensive integer-tree to support an exponential number of nested barriers without increasing the code size. Using hardware already available on commercial MIMD computers, this scheme can support more than four billion levels of nesting.
We describe a simple algorithm for spectral graph sparsification, based on iterative computations of weighted spanners and uniform sampling. Leveraging the algorithms of Baswana and Sen for computing spanners, we obta...
详细信息
ISBN:
(纸本)9781450328210
We describe a simple algorithm for spectral graph sparsification, based on iterative computations of weighted spanners and uniform sampling. Leveraging the algorithms of Baswana and Sen for computing spanners, we obtain the first distributed spectral sparsification algorithm. We also obtain a parallel algorithm with improved work and time guarantees. Combining this algorithm with the parallel framework of Peng and Spielman for solving symmetric diagonally dominant linear systems, we get a parallel solver which is much closer to being practical and significantly more efficient in terms of the total work.
We describe an efficient parallel implementation of Goldberg's maximum flow algorithm for a shared-memory multiprocessor. Our main technical innovation is a method that allows a 'global relabeling' heurist...
详细信息
ISBN:
(纸本)089791483X
We describe an efficient parallel implementation of Goldberg's maximum flow algorithm for a shared-memory multiprocessor. Our main technical innovation is a method that allows a 'global relabeling' heuristic to be executed concurrently with the main algorithm. This heuristic is essential for good performance in practice. We present performance results from a Sequent Symmetry for a variety of input distributions. We achieve speed-ups of up to 8.8 with 16 processors, relative to the parallel program with 1 processor (5.8 when compared to our best sequential program). We consider these speed-ups very good and we provide evidence that hardware effects and insufficient parallelism in certain inputs are the main obstacles to achieving better performance.
A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset ...
详细信息
ISBN:
(纸本)089791483X
A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward;a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for private-memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern private memory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(log N) for common cases, and O(√N) in the worst case. The algorithms have been implemented on iWarp, a private-memory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.
The PRAM model of parallel computation is examined with respect to wordsize, the number of bits which can be held in each global memory cell. First, adversary arguments are used to show the incomparability of certain ...
详细信息
In this paper we give efficient parallelalgorithms for a number of problems from computational geometry by using generalized versions of parallel plane sweeping. We illustrate our approach with a number of applicatio...
详细信息
ISBN:
(纸本)0897913701
In this paper we give efficient parallelalgorithms for a number of problems from computational geometry by using generalized versions of parallel plane sweeping. We illustrate our approach with a number of applications, which include general hidden-surface elimination (even if the overlap relation contains cycles), CSG boundary evaluation, computing the contour of a collection of rectangles, and hidden-surface elimination for rectangles. Our algorithms are for the CREW PRAM.
A unified framework for finding efficient permutation routes on parallel networks in an off-line setting is presented. If the underlying graph of a parallel network contains an appropriate 'approximate' produc...
详细信息
ISBN:
(纸本)0897913701
A unified framework for finding efficient permutation routes on parallel networks in an off-line setting is presented. If the underlying graph of a parallel network contains an appropriate 'approximate' product structure then our method guarantees the existence of non-blocking near-optimal permutation routes. The routes in question can be determined in polynomial time. Furthermore, our results are extended to finding permutation routes among the remaining 'live' nodes in a faulty network.
暂无评论