Dynamic trees are a well-studied and fundamental building block of dynamic graph algorithms dating back to the seminal work of Sleator and Tarjan [STOC'81, (1981), pp. 114-122]. the problem is to maintain a tree s...
详细信息
ISBN:
(纸本)9798400704161
Dynamic trees are a well-studied and fundamental building block of dynamic graph algorithms dating back to the seminal work of Sleator and Tarjan [STOC'81, (1981), pp. 114-122]. the problem is to maintain a tree subject to online edge insertions and deletions while answering queries about the tree, such as the heaviest weight on a path, etc. In the parallel batch-dynamic setting, the goal is to process batches of edge updates work efficiently in low (polylog n) span. Two work-efficient algorithms are known: batch-parallel Euler Tour Trees by Tseng et al. [ALENEX'19, (2019), pp. 92-106] and parallel Rake-Compress (RC) Trees by Acar et al. [ESA'20, (2020), pp. 2:1-2:23]. Both however are randomized and work efficient in expectation. Several downstream results that use these data structures (and indeed to the best of our knowledge, all known workefficient parallel batch-dynamic graph algorithms) are therefore also randomized. In this work, we give the first deterministic work-efficient solution to the problem. Our algorithm maintains a parallel RC-Tree on n vertices subject to batches of k edge updates deterministically in worst-case O(k log(1 + n/k)) work and O(log n log log k) span on the Common-CRCW PRAM. We also show how to improve the span of the randomized algorithm from O(log n log* n) to O(log n). Lastly, as a result of our new deterministic algorithm, we also derandomize several downstream results that make use of parallel batch-dynamic dynamic trees, previously for which the only efficient solutions were randomized.
We study the well known problem of throwing m balls into n bins. If each ball in the sequential game is allowed to select more than one bin, the maximum load of the bins can be exponentially reduced compared to the `c...
详细信息
ISBN:
(纸本)9780897918091
We study the well known problem of throwing m balls into n bins. If each ball in the sequential game is allowed to select more than one bin, the maximum load of the bins can be exponentially reduced compared to the `classical balls into bins' game. We consider a static and a dynamic variant of a randomized parallel allocation where each ball can choose a constant number of bins. All results hold with high probability. In the static case all m balls arrive at the same time. We analyze for m = n a very simple optimal class of protocols achieving maximum load O (r√log n/log log n) if r rounds of communication are allowed. this matches the lower bound of [acmR95]. Furthermore, we generalize the protocols to the case of m>n balls. An optimal load of O(m/n) can be achieved using log log n/log(m/n) rounds of communication. Hence, for m = n log log n/log log log n balls this slackness allows to hide the amount of communication. In the `classical balls into bins' game this optimal distribution can only be achieved for m = n log n. In the dynamic variant n of the m balls arrive at the same time and have to be allocated. Each of these initial n balls has a list of m/n successor-balls. As soon as a ball is allocated its successor will be processed. We present an optimal parallel process that allocates all m = n log n balls in O(m/n) rounds. Hence, the expected allocation time is constant. the main contribution of this process is that the maximum allocation time is additionally bounded by O(log log n).
A new block algorithm for triangularization of regular or singular matrices with dimension m × n, is proposed. Taking benefit of fast block multiplication algorithms, it achieves the best known sequential complex...
详细信息
A new block algorithm for triangularization of regular or singular matrices with dimension m × n, is proposed. Taking benefit of fast block multiplication algorithms, it achieves the best known sequential complexity O(mω-1n) for any sizes and any rank. Moreover, the block strategy enables to improve locality with respect to previous algorithms as exhibited by practical performances.
Hydra PPS is a collection of annotations, classes, a run-time, and a compiler designed to provide Java programmers with a fairly simple method of producing programs for Symmetric Multiprocessing (SMP) architectures. T...
详细信息
ISBN:
(纸本)1595934529
Hydra PPS is a collection of annotations, classes, a run-time, and a compiler designed to provide Java programmers with a fairly simple method of producing programs for Symmetric Multiprocessing (SMP) architectures. this paper introduces the basics of this new system including the basic constructs for this new programming language and the relationship between the Java VM, the compiler, the runtime, and the parallel program. Hydra will exploit parallelism when the underlying architecture supports it and will run as normal sequential Java program when the architecture does not have support for parallelism. parallelism is expressed through events in Hydra, it is easy to use, and programs run efficiently on parallelarchitectures.
A straight-line grid embedding of a planar graph is a drawing of the graph on a plane where the vertices are located at grid points and the edges are represented by nonintersecting segments of straight lines joining t...
详细信息
ISBN:
(纸本)089791483X
A straight-line grid embedding of a planar graph is a drawing of the graph on a plane where the vertices are located at grid points and the edges are represented by nonintersecting segments of straight lines joining their incident vertices. Given an n-vertex planar graph with n ≥ 3, a straight-line embedding on a grid of size (n - 2) × (n - 2) can be computed deterministically in O(log n log log n) time with O(n log log n) work on a parallel random access machine. If randomization is used, the complexity is improved to O(log n) expected time withthe same work bound. the parallel random access machine used by these algorithms allows concurrent reads and concurrent writes of the shared memory;in case of a write conflict, an arbitrary processor succeeds.
We implemented and measured several methods to perform BMMC permutations on the MasPar MP-2. Our results indicate that, except for certain types of permutations or very high virtual processor ratios, the best method o...
详细信息
We implemented and measured several methods to perform BMMC permutations on the MasPar MP-2. Our results indicate that, except for certain types of permutations or very high virtual processor ratios, the best method overall is the naive method but with virtual-processor numbers computed in Gray-code order. For some permutations, however, the naive method performs very poorly;the best method in these cases is an adaptation of the block BMMC algorithm for parallel disk systems in which the processor elements are treated as independent devices.
In this paper, we study parallelalgorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that are scalable withthe number of cores. By focusing on private-cache CMPs, w...
详细信息
ISBN:
(纸本)9781595939739
In this paper, we study parallelalgorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that are scalable withthe number of cores. By focusing on private-cache CMPs, we show that we can design efficient algorithmsthat need no additional assumptions about the way cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallelalgorithms. Indeed, we present two sorting algorithms, a distribution sort and a mergesort. Our algorithms are asymptotically optimal in terms of parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks. In addition, we study sorting lower bounds in a computational model, which we call the parallel external-memory (PEM) model, that formalizes the essential properties of our algorithms for private-cache CMPs.
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n =...
详细信息
ISBN:
(纸本)9798400704161
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n = 1 clusterings obtained by contracting the edges of T in order of weight. Existing algorithms for computing the SLD all require Omega(n log n) work where n = vertical bar T vertical bar. Furthermore, to the best of our knowledge no prior work provides a parallel algorithm obtaining non-trivial speedup for this problem. In this paper, we design faster parallelalgorithms for computing SLDs both in theory and in practice based on new structural results about SLDs. In particular, we obtain a deterministic output-sensitive parallel algorithm based on parallel tree contraction that requires O(n log h) work and O(log(2) n log(2) h) depth, where h is the height of the output SLD. We also give a deterministic bottom-up algorithm for the problem inspired by the nearest-neighbor chain algorithm for hierarchical agglomerative clustering, and show that it achieves O(n log h) work and O(h log n) depth. Our results are based on a novel divide-and-conquer framework for building SLDs, inspired by divide-and-conquer algorithms for Cartesian trees. Our new algorithms can quickly compute the SLD on billion-scale trees, and obtain up to 150x speedup over the highly-efficient Union-Find algorithm typically used to compute SLDs in practice.
Partitioning a graph into blocks of roughly equal weight while cutting only few edges is a fundamental problem in computer science with numerous practical applications. While shared-memory parallel partitioners have r...
详细信息
ISBN:
(纸本)9798400704161
Partitioning a graph into blocks of roughly equal weight while cutting only few edges is a fundamental problem in computer science with numerous practical applications. While shared-memory parallel partitioners have recently matured to achieve the same quality as widely used sequential partitioners, there is still a pronounced quality gap between distributed partitioners and their sequential counterparts. In this work, we shrink this gap considerably by describing the engineering of an unconstrained local search algorithm suitable for distributed partitioners. We integrate the proposed algorithm in a distributed multilevel partitioner. Our extensive experiments show that the resulting algorithm scales to thousands of PEs while computing cuts that are, on average, only 3.5% larger than those of a state-of-the-art high-quality shared-memory partitioner. Compared to previous distributed partitioners, we obtain on average 6.8% smaller cuts than the best-performing competitor while being more than 9 times faster.
暂无评论