We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache ...
详细信息
ISBN:
(纸本)9781450312134
We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number of available processor cores compared to its sequential counterpart, thereby taking full advantage of multicore parallelism. The parallel buffer tree is a search tree data structure that supports the batched parallel processing of a sequence of N insertions, deletions, membership queries, and range queries in the optimal O(sortP (N) + K/PB) parallel I/O complexity, where K is the size of the output reported in the process and sortP (N) is the parallel I/O complexity of sorting N elements using P processors. Copyright 2012 acm.
Currently, the best known tradeoff between approximation ratio and complexity for the Sparsest Cut problem is achieved by the algorithm in [Sherman, FOCS 2009]: it computes O(root(log n)/epsilon)-approximation using O...
详细信息
ISBN:
(纸本)9798400704161
Currently, the best known tradeoff between approximation ratio and complexity for the Sparsest Cut problem is achieved by the algorithm in [Sherman, FOCS 2009]: it computes O(root(log n)/epsilon)-approximation using O(n(epsilon) log(O(1))n) maxflows for any epsilon is an element of[Theta(1/log n), Theta(1)]. It works by solving the SDP relaxation of [Arora-Rao-Vazirani, STOC 2004] using the Multiplicative Weights Update algorithm (MW) of [Arora-Kale, Jacm 2016]. To implement one MW step, Sherman approximately solves a multicommodity flow problem using another application of MW. Nested MW steps are solved via a certain "chaining" algorithm that combines results of multiple calls to the maxflow algorithm. We present an alternative approach that avoids solving the multicommodity flow problem and instead computes "violating paths". This simplifies Sherman's algorithm by removing a need for a nested application of MW, and also allows parallelization: we show how to compute O(root(log n)/epsilon)-approximation via O(log(O(1)) n) maxflows using O(n(epsilon)) processors. We also revisit Sherman's chaining algorithm, and present a simpler version together with a new analysis.
In this paper we present a simple parallel sorting algorithm and illustrate two applications. The algorithm (called the (l, m)-merge sort (LMM)) is an extension of the bitonic and odd-even merge sorts. Literature on p...
详细信息
In this paper we present a simple parallel sorting algorithm and illustrate two applications. The algorithm (called the (l, m)-merge sort (LMM)) is an extension of the bitonic and odd-even merge sorts. Literature on parallel sorting is abundant. Many of the algorithms proposed, though being theoretically important, may not perform satisfactorily in practice owing to large constants in their time bounds. The algorithm to be presented in this paper, due partly to its simplicity, results in small constants. We present an implementation for the parallel disk sorting problem. The algorithm is asymptotically optimal (assuming that N is a polynomial in M, where N is the number of records to be sorted and M is the internal memory size). The underlying constant is very small. This algorithm has a better performance than the disk-striped mergesort (DSM) algorithm when the number of disks is large. Our implementation is as simple as that of DSM (requiring no fancy data structures or prefetch techniques). As a second application, we prove that we can get a sparse enumeration sort on the hypercube that is simpler than that of the classical algorithm of Nassimi and Sahni. We also show that Leighton's columnsort algorithm is a special case of LMM.
The primary purpose of Cray Research computer systems is the timely solution of complex problems in science and engineering. A few examples illustrate that the CRAY C90 is currently the world's most powerful tool ...
详细信息
An acyclic set in an undirected graph G = (V, E) is a set A ⊂ V (G) such that the induced graph on A is acyclic. A maximal acyclic set (MAS) has the further property that no vertex can be added to it without creating ...
详细信息
ISBN:
(纸本)9781581138405
An acyclic set in an undirected graph G = (V, E) is a set A ⊂ V (G) such that the induced graph on A is acyclic. A maximal acyclic set (MAS) has the further property that no vertex can be added to it without creating a cycle in the induced graph. We present the first NC algorithm that finds an MAS in any graph, running in time O(log4 n) using O((m2+n)/log n) processors on a EREW PRAM, answering questions raised by Chen and He. Our NC algorithm for finding an MAS in a graph relies on a novel NC algorithm for finding a maximal forest in a hypergraph which may be of independent interest.
As databases increasingly integrate non-textual information it is becoming necessary to support efficient similarity searching in addition to range searching. Recently, declustering techniques have been proposed for i...
详细信息
ISBN:
(纸本)9780897919890
As databases increasingly integrate non-textual information it is becoming necessary to support efficient similarity searching in addition to range searching. Recently, declustering techniques have been proposed for improving the performance of similarity searches through parallel I/O. In this paper, we propose a new scheme which provides good declustering for similarity searching. In particular, it does global declustering as opposed to local declustering, exploits the availability of extra disks and does not limit the partitioning of the data space. Our technique is based upon the Cyclic declustering schemes which were developed for range and partial match queries. We establish, in general, that Cyclic declustering techniques outperform previously proposed techniques.
The circuit value update problem is the problem of updating values in a representation of a combinational circuit when some of the inputs are changed. We assume for simplicity that each combinatorial element has bound...
详细信息
ISBN:
(纸本)9780897917179
The circuit value update problem is the problem of updating values in a representation of a combinational circuit when some of the inputs are changed. We assume for simplicity that each combinatorial element has bounded fan-in and fan-out and can be evaluated in constant time. This problem is easily solved on an ordinary serial computer in O(W + D) time, where W is the number of elements in the altered subcircuit and D is the subcircuit's embedded depth (its depth measured in the original circuit). In this paper, we show how to solve the circuit value update problem efficiently on a P-processor parallel computer. We give a straightforward synchronous, parallel algorithm that runs in O(W/P + D lg P) expected time. Our main contribution, however, is an optimistic, asynchronous, parallel algorithm that runs in O(W/P + D + lg W + lg P) expected time, where W and D are the size and embedded depth, respectively, of the 'volatile' subcircuit, the subcircuit of elements that have inputs which either change or glitch as a result of the update. To our knowledge, our analysis provides the first analytical bounds on the running time of an optimistic algorithm.
The proceedings contain 42 papers. The topics discussed include: on dynamic bin packing for resource allocation in the cloud;on the online fault-tolerant server consolidation problem;on computing maximal independent s...
ISBN:
(纸本)9781450328210
The proceedings contain 42 papers. The topics discussed include: on dynamic bin packing for resource allocation in the cloud;on the online fault-tolerant server consolidation problem;on computing maximal independent sets of hypergraphs in parallel;simple parallel and distributed algorithms for spectral graph sparsification;concurrent data structures for efficient streaming aggregation;provably good scheduling for parallel programs that use data structures through implicit batching;scheduling selfish jobs on multidimensional parallel machines;LP rounding and combinatorial algorithms for minimizing active and busy time;a note on multiprocessor speed scaling with precedence constraints;executing dynamic data-graph computations deterministically using chromatic scheduling;the PCL theorem. transactions cannot be parallel, consistent and live;adaptive integration of hardware and software lock elision techniques;and transaction-friendly condition variables.
parallelalgorithms and dynamic algorithms possess an interesting duality property: compared to sequential algorithms, parallelalgorithms improve run-time while preserving work, while dynamic algorithms improve work ...
详细信息
ISBN:
(纸本)9781450307437
parallelalgorithms and dynamic algorithms possess an interesting duality property: compared to sequential algorithms, parallelalgorithms improve run-time while preserving work, while dynamic algorithms improve work but typically offer no parallelism.. Although they are often considered separately, parallel and dynamic algorithms employ similar design techniques. They both identify parts of the computation that are independent of each other. This suggests that dynamic algorithms could be parallelized to improve work efficiency while preserving fast parallel run-time. In this paper, we parallelize a dynamic algorithm for well-spaced point sets, an important problem related to mesh refinement in computational geometry. Our parallel dynamic algorithm computes a well-spaced superset of a dynamically changing set of points, allowing arbitrary dynamic modifications to the input set. On an EREW PRAM, our algorithm processes batches of k modifications such as insertions and deletions in O(k log Delta) total work and in O(log Delta) parallel time using k processors, where Delta is the geometric spread of the data, while ensuring that the output is always within a constant factor of the optimal size. EREW PRAM model is quite different from actual hardware such as modern multiprocessors. We therefore describe techniques for implementing our algorithm on modern multi-core computers and provide a prototype implementation. Our empirical evaluation shows that our algorithm can be practical, yielding a large degree of parallelism and good speedups.
In the many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the ...
详细信息
ISBN:
(纸本)9781450344937
In the many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the cache complexity. We propose cache-oblivious algorithms for MPI all to-all operations, in which data blocks are copied into the receive buffers in Morton order to exploit data locality. Experimental results on different many-core architectures show that our cache-oblivious implementations significantly outperform the naive implementations based on shared heap and the highly optimized MPI libraries.
暂无评论