We present parallelalgorithms for computing all pair shortest paths in directed graphs. Our algorithm has time complexity O(f(n)/p + I(n)log n) on the PRAM using p processors, where I(n) is log n on the EREW PRAM, lo...
详细信息
ISBN:
(纸本)089791483X
We present parallelalgorithms for computing all pair shortest paths in directed graphs. Our algorithm has time complexity O(f(n)/p + I(n)log n) on the PRAM using p processors, where I(n) is log n on the EREW PRAM, log log n on the CRCW PRAM, f(n) is o(n3). On the randomized CRCW PRAM we are able to achieve time complexity O(n3/p + log n) using p processors.
This paper presents a method for mapping binary hypercube-algorithms onto lower dimensional meshes and analyzes this method in a model derived from the architecture of modern mesh machines. We outline the criteria use...
详细信息
ISBN:
(纸本)089791483X
This paper presents a method for mapping binary hypercube-algorithms onto lower dimensional meshes and analyzes this method in a model derived from the architecture of modern mesh machines. We outline the criteria used to evaluate graph embeddings for mapping supercomputer communication networks. The measured sorting rate of more than 2x106 keys per second on an iWarp torus with just 64 processors shows the excellent absolute performance of our approach.
We present an algorithm that computers a minimum spanning tree (MST) of an undirected weighted graph G = (V, E) of n = |V| vertices and m = |E| edges on an EREW PRAM in O(log3/2 n) time using n + m processors. This re...
详细信息
ISBN:
(纸本)089791483X
We present an algorithm that computers a minimum spanning tree (MST) of an undirected weighted graph G = (V, E) of n = |V| vertices and m = |E| edges on an EREW PRAM in O(log3/2 n) time using n + m processors. This represents a substantial improvement in the running time over the previous results for this problem using at the same time the weakest of the PRAM models. It also implies the existence of algorithms having the same complexity bounds for the EREW PRAM, for connectivity, ear decomposition, biconnectivity, strong orientation, st-numbering and Euler tours problems.
This paper addresses fast parallel methods for the computation of the Radon (Hough) Transform. The Radon Transform (RT) of an image is a set of projections of the image taken at different angles. Its computation is ex...
详细信息
ISBN:
(纸本)089791483X
This paper addresses fast parallel methods for the computation of the Radon (Hough) Transform. The Radon Transform (RT) of an image is a set of projections of the image taken at different angles. Its computation is extremely important in image processing and computer vision, for problems such as pattern recognition and reconstruction of CAT scan images. We present a unique new method for combining partial results which allows us to construct a parallel algorithm which computes an approximation to the RT in O(logN) time for an N by N input array, using O(N2) processors. This is a lower processor-time product than all previous techniques. The method appears to be quite simple and practical. In addition, this algorithm appears to be quite well-suited to implementation on strict SIMD array-type computers. We discuss SIMD-parallel mappings of the algorithm for mesh, butterfly, and hypercube architectures.
Using the ideas from the NC algorithm of Ben-Or and Tiwari [Journal of Complexity 6, 417-442, 1990], we develop a practical parallel algorithm that approximates the roots of a polynomial whose roots are all real. A ne...
详细信息
ISBN:
(纸本)089791483X
Using the ideas from the NC algorithm of Ben-Or and Tiwari [Journal of Complexity 6, 417-442, 1990], we develop a practical parallel algorithm that approximates the roots of a polynomial whose roots are all real. A new elementary proof of correctness is provided and the complexity of the algorithm is analyzed. A particular implementation of the algorithm that performs well in practice is described and its run-time behaviour is compared with the analytical predictions. Its performance is also compared with that of the root-finding algorithm in the PARI package.
Brent's scheduling principle provides a general simulation scheme when fewer processors are available than specified by the fastest parallel algorithm. Such a scheme preserves the actual number of executed operati...
详细信息
ISBN:
(纸本)089791483X
Brent's scheduling principle provides a general simulation scheme when fewer processors are available than specified by the fastest parallel algorithm. Such a scheme preserves the actual number of executed operations, and when applicable, it provides a processor balancing technique that significantly reduces the work, expressed as the number of executable operators. In this paper we discuss a new technique, called supereffective slow-down, that yields quite fast an algorithm with work significantly smaller than that of the fastest algorithm for the same problem. This technique can be viewed as a work-preserving acceleration of an existing recursive sequential algorithm for the considered problem. The presented examples include the computation of path algebras in graphs and digraphs and various computations in linear albegra. Some of the new algorithms may have practical value;for instance, we substantially improve the performance of the known parallelalgorithms for triangular linear systems of equations.
A straight-line grid embedding of a planar graph is a drawing of the graph on a plane where the vertices are located at grid points and the edges are represented by nonintersecting segments of straight lines joining t...
详细信息
ISBN:
(纸本)089791483X
A straight-line grid embedding of a planar graph is a drawing of the graph on a plane where the vertices are located at grid points and the edges are represented by nonintersecting segments of straight lines joining their incident vertices. Given an n-vertex planar graph with n ≥ 3, a straight-line embedding on a grid of size (n - 2) × (n - 2) can be computed deterministically in O(log n log log n) time with O(n log log n) work on a parallel random access machine. If randomization is used, the complexity is improved to O(log n) expected time with the same work bound. The parallel random access machine used by these algorithms allows concurrent reads and concurrent writes of the shared memory;in case of a write conflict, an arbitrary processor succeeds.
The scheduling of a set of interdependent tasks on a specific family of parallel/distributed architectures is a basic problem arising in various areas. We consider tasks whose interdependency relationship has the form...
详细信息
ISBN:
(纸本)089791483X
The scheduling of a set of interdependent tasks on a specific family of parallel/distributed architectures is a basic problem arising in various areas. We consider tasks whose interdependency relationship has the form of a tree and architectures that have constant-dimensional grid-like structures. The total size of the task tree is known beforehand but no knowledge is assumed about the shape of the tree, implying that the scheduling must be dynamic. Important problems such as divide and conquer (D&C) satisfy this condition. On the other hand, many practical parallel/distributed architectures, such as linear array and mesh, fall into the class of constant-dimensional architectures. Here we give an efficient deterministic distributed scheme for mapping tasks to processors and for routing results in the architecture. Our scheme finishes the tasks in optimal time. If the maximum degree and the height of the tree can be estimated beforehand, the optimal time processor product can also be achieved. This paper focuses on optimizing execution time on the smallest possible number of processors in a model that considers both task dependency and communication configuration, in addition to computational cost.
We present two asymptotically optimal Θ(n) time algorithms for labeling the connected components of a gray-scale image on a mesh-connected computer. We assume that the input is an n × n gray-scale image mapped o...
详细信息
ISBN:
(纸本)089791483X
We present two asymptotically optimal Θ(n) time algorithms for labeling the connected components of a gray-scale image on a mesh-connected computer. We assume that the input is an n × n gray-scale image mapped one pixel per processor onto an n × n mesh-connected computer. Our algorithms label the components so that every component is connected, the maximum difference in the gray-scale values of the pixels within any component does not exceed a given value, and no component can be merged with a neighboring component. The first algorithm is based on a divide-and-conquer approach. Although it is simple, this algorithm has the potential drawback of possibly assigning two adjacent pixels with the same gray-scale value to different components. The second algorithm avoids this potential drawback, and exploits the ability of a mesh-connected computer to efficiently determine a maximal independent set of a planar graph.
A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset ...
详细信息
ISBN:
(纸本)089791483X
A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in parallel. The user model of a subset barrier is straight forward;a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. This paper identifies two general communication models for private-memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. The models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern private memory parallel system. The anonymous destination message passing model is the most attractive. The time complexity to synchronize over a uni-directional ring of N processors is O(log N) for common cases, and O(√N) in the worst case. The algorithms have been implemented on iWarp, a private-memory parallel system and are now in daily use. The paper concludes with timing measurements obtained on a 64-node system.
暂无评论