We present work-optimal PRAM algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet. For a string of length n, the depth of the compression algorithm is O(log(2)n), and the de...
详细信息
We present work-optimal PRAM algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet. For a string of length n, the depth of the compression algorithm is O(log(2)n), and the depth of the corresponding decompression algorithm is O(logn). These appear to be the first polylogarithmic-time work-optimal parallel algorithms for any standard lossless compression scheme. The algorithms for the individual stages of compression and decompression may also be of independent interest: (1) a novel O(logn)-time, O(n)-work PRAM algorithm for Huffman decoding;(2) original insights into the stages of the BW compression and decompression problems, bringing out parallelism that was not readily apparent, allowing them to be mapped to elementary parallel routines that have O(logn)-time, O(n)-work solutions, such as: (i) prefix-sums problems with an appropriately-defined associative binary operator for several stages, and (ii) list ranking for the final stage of decompression. Follow-up empirical work suggests potential for considerable practical speedups on a PRAM-driven many-core architecture, against a backdrop of negative contemporary results on common commercial platforms. (C) 2013 Elsevier B.V. All rights reserved.
We consider two problems pertaining to P-4-comparability graphs, namely, the problem of recognizing whether a simple undirected graph is a P-4-comparability graph and the problem of producing an acyclic P-4-transitive...
详细信息
We consider two problems pertaining to P-4-comparability graphs, namely, the problem of recognizing whether a simple undirected graph is a P-4-comparability graph and the problem of producing an acyclic P-4-transitive orientation of such a graph. Sequential algorithms for these problems have been presented by Hoang and Reed and very recently by Raschle and Simon, and by Nikolopoulos and Palios. In this paper, we establish properties of P-4-comparability graphs which allow us to describe parallel algorithms for the recognition and orientation problems on this class of graphs;for a graph on n vertices and in edges, our algorithms run in O(nm) time and require O(nm/log n) processors on the CREW PRAM model. Since the currently fastest sequential algorithms for these problems run in O(nm) time, our algorithms are cost-efficient;moreover, to the best of our knowledge, this is the first attempt to introduce parallelization in problems involving P-4-comparability graphs. Our approach relies on the parallel computation and proper orientation of the P-4-components of the input graph. (C) 2003 Elsevier Inc. All rights reserved.
The nearest neighbor search problem in general dimensions finds application in computational geometry, computational statistics, pattern recognition, and machine learning. Although there is a significant body of work ...
详细信息
The nearest neighbor search problem in general dimensions finds application in computational geometry, computational statistics, pattern recognition, and machine learning. Although there is a significant body of work on theory and algorithms, surprisingly little work has been done on algorithms for high-end computing platforms, and no open source library exists that can scale efficiently to thousands of cores. In this paper, we present algorithms and a library built on top of the message passing interface (MPI) and OpenMP that enable nearest neighbor searches to hundreds of thousands of cores for arbitrary-dimensional datasets. The library supports both exact and approximate nearest neighbor searches. The latter is based on iterative, randomized, and greedy KD-tree (k-dimensional tree) searches. We describe novel algorithms for the construction of the KD-tree, give complexity analysis, and provide experimental evidence for the scalability of the method. In our largest runs, we were able to perform an all-neighbors query search on a 13 TB synthetic dataset of 0.8 billion points in 2,048 dimensions on the 131K cores on Oak Ridge's XK6 "Jaguar" system. These results represent several orders of magnitude improvement over current state-of-the-art methods. Also, we apply our method to nonsynthetic data from machine learning data repositories. For example, we perform an all-nearest-neighbors search on a variant of the "MNIST" handwritten digit dataset with 8 million points in 784 dimensions on 16,384 cores of the "Stampede" system at the Texas Advanced Computing Center, achieving less than one second per RKDT iteration.
We present parallel algorithms for computation of time-slot assignments in time-division multiplex (TDM) switching systems. The algorithms apply to a general class of TDM switching systems called hierarchical switchin...
详细信息
We present parallel algorithms for computation of time-slot assignments in time-division multiplex (TDM) switching systems. The algorithms apply to a general class of TDM switching systems called hierarchical switching systems (HSS), which have a three-stage switching structure. The algorithms are based on modeling the time-slot assignment problem as a network-flow problem. Previous algorithms for finding an optimal time-slot assignment in these switching systems are inherently sequential and no parallel algorithms are known for this problem. If M is the number of users of the switching system, N is the switch-size, and L is the length of an optimal time-slot assignment, the best-known sequential TSA algorithm runs in O(M(2) . min(N, root M) . min(L, M(2))) time. We first describe an algorithm using L/2 processors with running time O(M(3) log L) on a PRAM model of computation. We then generalize it to P less than or equal to L/2 processors, with running time O(M(3) log P + M(2) . min(N, root M) . min(L/P, M(2))). An efficient implementation of the algorithm on a hypercube multiprocessor with P processors has the same time-complexity. A massively parallel version of the algorithm runs in O(M(2) log M log L) time on M L/2 processors. Finally, we discuss how the above algorithms can be applied to the class of SS/TDMA switching systems.
Relational Coarsest Partition Problems (RCPPs) play a vital role in verifying concurrent systems. It is known that RCPPs are P-complete and hence it may not be possible to design polylog time parallel algorithms for t...
详细信息
Relational Coarsest Partition Problems (RCPPs) play a vital role in verifying concurrent systems. It is known that RCPPs are P-complete and hence it may not be possible to design polylog time parallel algorithms for these problems. In this paper, we present two efficient parallel algorithms for RCPP in which its associated label transition system is assumed to have m transitions and n states. The first algorithm runs in O(n(1+epsilon)) time using m/n(epsilon) CREW PRAM processors, for any fixed epsilon < 1. This algorithm is analogous to and optimal with respect to the sequential algorithm of Kanellakis and Smolka. The second algorithm runs in O(n log n) time using m/n CREW PRAM processors. This algorithm is analogous to and nearly optimal with respect to the sequential algorithm of Paige and Tarjan.
This paper surveys recent progress in the development of parallel algorithms for solving sparse linear systems on computer architectures having multiple processors. Attention is focused on direct methods for solving s...
详细信息
This paper surveys recent progress in the development of parallel algorithms for solving sparse linear systems on computer architectures having multiple processors. Attention is focused on direct methods for solving sparse symmetric positive definite systems, specifically by Cholesky factorization. Recent progress on parallel algorithms is surveyed for all phases of the solution process, including ordering, symbolic factorization, numeric factorization, and triangular solution.
We describe the first parallel algorithm with optimal speedup for constructing minimum-width tree decompositions of graphs of bounded treewidth. On n-vertex input graphs, the algorithm works in O((log n)(2)) time usin...
详细信息
We describe the first parallel algorithm with optimal speedup for constructing minimum-width tree decompositions of graphs of bounded treewidth. On n-vertex input graphs, the algorithm works in O((log n)(2)) time using O(n) operations on the EREW PRAM. We also give faster parallel algorithms with optimal speedup for the problem of deciding whether the treewidth of an input graph is bounded by a given constant and for a variety of problems on graphs of bounded treewidth, including all decision problems expressible in monadic second-order logic. On n-vertex input graphs, the algorithms use O(n) operations together with O(log n log*n) time on the EREW PRAM, or O(log n) time on the CRCW PRAM.
parallel algorithms for planar graph isomorphism and several related problems are presented. Two models of parallel computation are considered: the CREW-PRAM model and the two-dimensional array of processors. The resu...
详细信息
parallel algorithms for planar graph isomorphism and several related problems are presented. Two models of parallel computation are considered: the CREW-PRAM model and the two-dimensional array of processors. The results include O( square root n)-time mesh algorithms for finding a good separating cycle and the triconnected components of a planar graph, and for solving the single-function coarsest partitioning problem.< >
This paper presents parallel algorithms for priority queue operations on a p-processor EREW-PRAM. The algorithms are based on a new data structure, the Min-path Heap (MH), which is obtained as an extension of the trad...
详细信息
This paper presents parallel algorithms for priority queue operations on a p-processor EREW-PRAM. The algorithms are based on a new data structure, the Min-path Heap (MH), which is obtained as an extension of the traditional binary-heap organization. Using an MH, it is shown that insertion of a new item or deletion of the smallest item from a priority queue of n elements can be performed in O(logn/p + log logn) parallel time, while construction of an MH from a set of n items takes O(n/p + logn) time. The given algorithms for insertion and deletion achieve the best possible running time for any number of processors p, with p is an element of O(logn/(log logn)), while the MH construction algorithm employs up to theta(n/logn) processors optimally. The paper ends with a brief discussion of the applicability of MH's to the development of efficient parallel algorithms for some important combinatorial problems.
Perceptual grouping is a key intermediate-level vision problem. parallel solutions to this problem are characterized by uneven distribution of symbolic features among the processors, unbalanced workload, and irregular...
详细信息
Perceptual grouping is a key intermediate-level vision problem. parallel solutions to this problem are characterized by uneven distribution of symbolic features among the processors, unbalanced workload, and irregular interprocessor data dependency caused by the input image. In this paper, we propose two load-balancing techniques for parallelizing perceptual grouping on distributed-memory machines. By using an initial workload estimate, we first partition the computations to distribute the workload across the processors. In addition, we asynchronously perform ongoing task migrations to adapt to the unbalanced workload which may evolve differently from the initial estimate. We also discuss two strategies to manage the irregular interprocessor data dependency. To illustrate our ideas, perceptual grouping steps used in an integrated vision system for building detection are used as examples. Our experimental results show that, given 8K extracted line se,aments from a 1K x 1K image, both the line and junction grouping steps can be completed in 0.644 s on a 32-node SP2 and in 0.585 s on a 32-node T3D. For the same grouping steps, a serial implementation requires 10.550 s and 10.023 s on a single node of SP2 and T3D, respectively. The implementations were performed using the message passing interface standard and are portable to other high performance computing platforms. (C) 1998 Academic Press.
暂无评论