This paper provides a new approach to labeling the connected components of an n × n image on a scan line array processor (comprised of n processing elements). Variations of this approach yield an algorithm guaran...
详细信息
ISBN:
(纸本)9780897917179
This paper provides a new approach to labeling the connected components of an n × n image on a scan line array processor (comprised of n processing elements). Variations of this approach yield an algorithm guaranteed to complete in o(n lg n) time as well as algorithms likely to approach O(n) time for all or most images. The best previous solutions require using a more complicated architecture or require Ω(n lg n) time. We also show that on a restricted version of the architecture, any algorithm requires Ω(n lg n) time in the worst case.
This paper experimentally validates performance related issues for parallel computation models on several parallel platforms (a MasPar MP-1 with 1024 processors, a 64-node GCel and a CM-5 of 64 processors). Our work c...
详细信息
This paper experimentally validates performance related issues for parallel computation models on several parallel platforms (a MasPar MP-1 with 1024 processors, a 64-node GCel and a CM-5 of 64 processors). Our work consists of three parts. First, there is an evaluation part in which we investigate whether the models correctly predict the execution time of an algorithm implementation. Unlike previous work, which mostly demonstrated a close match between the measured and predicted running times, this paper shows that there are situations in which the models do not precisely predict the actual execution time of an algorithm implementation. Second, there is a comparison part in which the models are contrasted with each other in order to determine which model induces the fastest algorithms. Finally, there is an efficiency validation part in which the performance of the model derived algorithms are compared with the performance of highly optimized library routines to show the effectiveness of deriving fast algorithms through the formalisms of the models.
We present a parallel solution to the Maximum-Flow (Max-Flow) problem, suitable for a modern many-core architecture. We show that by starting from a PRAM algorithm, following an established "programmer's work...
详细信息
ISBN:
(纸本)9781450307437
We present a parallel solution to the Maximum-Flow (Max-Flow) problem, suitable for a modern many-core architecture. We show that by starting from a PRAM algorithm, following an established "programmer's workflow" and targeting XMT, a PRAM-inspired many-core architecture, we achieve significantly higher speed-ups than previous approaches. Comparison with the fastest known serial max-flow implementation on a modern CPU demonstrates for the first time potential for orders-of-magnitude performance improvement for Max-Flow. Using XMT, the PRAM Max-Flow algorithm is also much easier to program than for other parallel platforms, contributing a powerful example toward dual validation of both PRAM algorithmics and XMT.
The Bulk-Synchronous parallel (BSP) model was proposed by Valiant as a model for general-purpose parallel computation. The objective of the model is to allow the design of parallel programs that can be executed effici...
详细信息
The Bulk-Synchronous parallel (BSP) model was proposed by Valiant as a model for general-purpose parallel computation. The objective of the model is to allow the design of parallel programs that can be executed efficiently on a variety of architectures. While many theoretical arguments in support of the BSP model have been presented, the degree to which the model can be efficiently utilized on existing parallel machines remains unclear. To explore this question, we implemented s small library of BSP functions, called the Green BSP library, on several parallel platforms. We also created a number of parallel applications based on this library. Here, we report on the performance of six of these applications on three different parallel platforms. Our preliminary results suggest that the BSP model can be used to develop efficient and portable programs for a range of machines and applications.
We give an efficient parallel algorithm for constructing the arrangement of n line segments in the plane, i.e., the planar graph 'determined by the segment endpoints and intersections. Our algorithm is efficient r...
详细信息
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n =...
详细信息
ISBN:
(纸本)9798400704161
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n = 1 clusterings obtained by contracting the edges of T in order of weight. Existing algorithms for computing the SLD all require Omega(n log n) work where n = vertical bar T vertical bar. Furthermore, to the best of our knowledge no prior work provides a parallel algorithm obtaining non-trivial speedup for this problem. In this paper, we design faster parallelalgorithms for computing SLDs both in theory and in practice based on new structural results about SLDs. In particular, we obtain a deterministic output-sensitive parallel algorithm based on parallel tree contraction that requires O(n log h) work and O(log(2) n log(2) h) depth, where h is the height of the output SLD. We also give a deterministic bottom-up algorithm for the problem inspired by the nearest-neighbor chain algorithm for hierarchical agglomerative clustering, and show that it achieves O(n log h) work and O(h log n) depth. Our results are based on a novel divide-and-conquer framework for building SLDs, inspired by divide-and-conquer algorithms for Cartesian trees. Our new algorithms can quickly compute the SLD on billion-scale trees, and obtain up to 150x speedup over the highly-efficient Union-Find algorithm typically used to compute SLDs in practice.
We design a parallel algorithm for the Constrained Shortest Path (CSP) problem. The CSP problem is known to be NP-hard and there exists a pseudo-polynomial time sequential algorithm that solves it. To design the paral...
详细信息
In this paper we explore a simple and general approach for developing parallelalgorithms that lead to good cache complexity on parallel machines with private or shared caches The approach is to design nested-parallel...
详细信息
ISBN:
(纸本)9781450300797
In this paper we explore a simple and general approach for developing parallelalgorithms that lead to good cache complexity on parallel machines with private or shared caches The approach is to design nested-parallelalgorithms that have low depth (span. critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators Using known mappings. our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.
A parallel real-time complexity theory was presented to study infinity hierarchy for parallel real-time systems. Set of timed ω-languages was closed under intersection, union, complement, concatenation and Kleene clo...
详细信息
A parallel real-time complexity theory was presented to study infinity hierarchy for parallel real-time systems. Set of timed ω-languages was closed under intersection, union, complement, concatenation and Kleene closure. The proposed real-time algorithm consisted of an input tape containing timed ω-word and an output tape containing symbols from alphabet δ in finite control. The analysis suggested that the hierarchy of real-time algorithms with respect to number of processors was infinite.
An earlier parallel list ranking algorithm performs well for problem sizes N that are extremely large in comparison to the number of PUs P. However, no existing algorithm gives good performance for reasonable loads. W...
详细信息
ISBN:
(纸本)9780897918909
An earlier parallel list ranking algorithm performs well for problem sizes N that are extremely large in comparison to the number of PUs P. However, no existing algorithm gives good performance for reasonable loads. We present a novel family of algorithms, that achieve a better trade-off between the number of start-ups and the routing volume. We have implemented them on an Intel Paragon, and they turn out to considerably outperform all earlier algorithms: with P = 2 the sequential algorithm is already beaten for N = 25,000;for P = 100 and N = 107, the speed-up is 21, and for N = 108 it even reaches 30. A modification of one of our algorithms solves a theoretical question: we show that on one-dimensional processor arrays, list ranking can be solved with a number of steps equal to the diameter of the network.
暂无评论