In this work, we design, analyze, and optimize sequential and shared-memory parallel algorithms for partitioned local depths (PaLD). Given a set of data points and pairwise distances, PaLD is a method for identifying ...
详细信息
ISBN:
(纸本)9781611977967
In this work, we design, analyze, and optimize sequential and shared-memory parallel algorithms for partitioned local depths (PaLD). Given a set of data points and pairwise distances, PaLD is a method for identifying strength of pairwise relationships based on relative distances, enabling the identification of strong ties within dense and sparse communities even if their sizes and within-community absolute distances vary greatly. We design two algorithmic variants that perform community structure analysis through triplet comparisons of pairwise distances. We present theoretical analyses of computation and communication costs and prove that the sequential algorithms are communication optimal, up to constant factors. We introduce performance optimization strategies that yield sequential speedups of up to 29x over a baseline sequential implementation and parallel speedups of up to 26.2x over optimized sequential implementations using up to 32 threads on an Intel multicore CPU.
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n =...
详细信息
ISBN:
(纸本)9798400704161
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n = 1 clusterings obtained by contracting the edges of T in order of weight. Existing algorithms for computing the SLD all require Omega(n log n) work where n = vertical bar T vertical bar. Furthermore, to the best of our knowledge no prior work provides a parallel algorithm obtaining non-trivial speedup for this problem. In this paper, we design faster parallel algorithms for computing SLDs both in theory and in practice based on new structural results about SLDs. In particular, we obtain a deterministic output-sensitive parallel algorithm based on parallel tree contraction that requires O(n log h) work and O(log(2) n log(2) h) depth, where h is the height of the output SLD. We also give a deterministic bottom-up algorithm for the problem inspired by the nearest-neighbor chain algorithm for hierarchical agglomerative clustering, and show that it achieves O(n log h) work and O(h log n) depth. Our results are based on a novel divide-and-conquer framework for building SLDs, inspired by divide-and-conquer algorithms for Cartesian trees. Our new algorithms can quickly compute the SLD on billion-scale trees, and obtain up to 150x speedup over the highly-efficient Union-Find algorithm typically used to compute SLDs in practice.
In this paper, we share our experience in teaching parallel algorithms with the binary-forking model. With hardware advances, multicore computers are now ubiquitous. This has created a substantial demand in both resea...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
In this paper, we share our experience in teaching parallel algorithms with the binary-forking model. With hardware advances, multicore computers are now ubiquitous. This has created a substantial demand in both research and industry to harness the capabilities of parallel computing. It is thus important to incorporate parallelism in computer science education, especially in the early stages of the curriculum. However, it is commonly believed that understanding and using parallelism requires a deep understanding of computer systems and architecture, which complicates introducing parallelism to young students and non-experts. We propose to use the binary-forking model in teaching parallel algorithms, proposed by our previous research work. This model is meant to capture the performance of algorithms on modern multicore shared-memory machines, which is a simple abstraction to isolate algorithm design ideas with system-level details. The abstraction allows for simple analysis based on the work-span model in theory, and can be directly implemented as parallel programs in practice. In this paper, we briefly overview some basic primitives in this model, and provide a list of algorithms that we believe are well-suited in parallel algorithm courses.
Nucleus decompositions have been shown to be a useful tool for finding dense subgraphs. The coreness value of a clique represents its density based on the number of other cliques it is adjacent to. One useful output o...
详细信息
Nucleus decompositions have been shown to be a useful tool for finding dense subgraphs. The coreness value of a clique represents its density based on the number of other cliques it is adjacent to. One useful output of nucleus decomposition is to generate a hierarchy among dense subgraphs at different resolutions. However, existing parallel algorithms for nucleus decomposition do not generate this hierarchy, and only compute the coreness values. This paper presents a scalable parallel algorithm for hierarchy construction, with practical optimizations, such as interleaving the coreness computation with hierarchy construction and using a concurrent union-find data structure in an innovative way to generate the hierarchy. We also introduce a parallel approximation algorithm for nucleus decomposition, which achieves much lower span in theory and better performance in practice. We prove strong theoretical bounds on the work and span (parallel time) of our *** a 30-core machine with two-way hyper-threading, our parallel hierarchy construction algorithm achieves up to a 58.84x speedup over the state-of-the-art sequential hierarchy construction algorithm by Sariyuce et al. and up to a 30.96x self-relative parallel speedup. On the same machine, our approximation algorithm achieves a 3.3x speedup over our exact algorithm, while generating coreness estimates with a multiplicative error of 1.33x on average.
The goal of this paper is to develop the parallel algorithms that, on input of a learning sample, identify a regular language by means of a nondeterministic finite automaton (NFA). A sample is a pair of finite sets co...
详细信息
The goal of this paper is to develop the parallel algorithms that, on input of a learning sample, identify a regular language by means of a nondeterministic finite automaton (NFA). A sample is a pair of finite sets containing positive and negative examples. Given a sample, a minimal NFA that represents the target regular language is sought. We define the task of finding an NFA, which accepts all positive examples and rejects all negative ones, as a constraint satisfaction problem, and then propose the parallel algorithms to solve the problem. The results of comprehensive computational experiments on the variety of inference tasks are reported. The question of minimizing an NFA consistent with a learning sample is computationally hard.
Maximizing a non-negative, monontone, submodular function f over n elements under a cardinality constraint k (SMCC) is a well-studied NP-hard problem. It has important applications in, e.g., machine learning and influ...
详细信息
ISBN:
(纸本)9783031498145;9783031498152
Maximizing a non-negative, monontone, submodular function f over n elements under a cardinality constraint k (SMCC) is a well-studied NP-hard problem. It has important applications in, e.g., machine learning and influence maximization. Though the theoretical problem admits polynomial-time approximation algorithms, solving it in practice often involves frequently querying submodular functions that are expensive to compute. This has motivated significant research into designing parallel approximation algorithms in the adaptive complexity model;adaptive complexity (adaptivity) measures the number of sequential rounds of poly(n) function queries an algorithm requires. The state-of-the-art algorithms can achieve (1- 1/e - e)-approximate solutions with O(1/e(2) log n) adaptivity, which approaches the known adaptivity lowerbounds. However, the O(1/e(2) log n) adaptivity only applies to maximizing worst-case functions that are unlikely to appear in practice. Thus, in this paper, we consider the special class of p-superseparable submodular functions, which places a reasonable constraint on f, based on the parameter p, and is more amenable to maximization, while also having real-world applicability. Our main contribution is the algorithm LS+GS, a finer-grained version of the existing LS+PGB algorithm, designed for instances of SMCC when f is p-superseparable;it achieves an expected (1- 1/e - e)-approximate solution with O(1/e(2) log(pk)) adaptivity independent of n. Additionally, unrelated to p-superseparability, our LS+GS algorithm uses only O(e(-1) n + e(-2) log n) oracle queries, which has an improved dependence on e(-1) over the state-of-the-art LS+PGB;this is achieved through the design of a novel thresholding subroutine.
Semisort is a fundamental algorithmic primitive widely used in the design and analysis of efficient parallel algorithms. It takes input as an array of records and a function extracting a key per record, and reorders t...
详细信息
ISBN:
(纸本)9781450395458
Semisort is a fundamental algorithmic primitive widely used in the design and analysis of efficient parallel algorithms. It takes input as an array of records and a function extracting a key per record, and reorders them so that records with equal keys are contiguous. Since many applications only require collecting equal values, but not fully sorting the input, semisort is broadly applicable, e.g., in string algorithms, graph analytics, and geometry processing, among many other domains. However, despite dozens of recent papers that use semisort in their theoretical analysis and the existence of an asymptotically optimal parallel semisort algorithm, most implementations of these parallel algorithms choose to implement semisort by using comparison or integer sorting in practice, due to potential performance issues in existing semisort implementations. In this paper, we revisit the semisort problem, with the goal of achieving a high-performance parallel semisort implementation with a flexible interface. Our approach can easily be extended to two related problems, histogram and collect-reduce. Our algorithms achieve strong speedups in practice, and importantly, outperform state-of-the-art parallel sorting and semisorting methods for almost all settings we tested, with varying input sizes, distribution, and key types. On average (geometric means), our semisort implementation is at least 1.27x faster the best of the tested baselines. We also test two important applications with real-world data, and show that our algorithms improve the performance (up to 2.13x) over existing approaches. We believe that many other parallel algorithm implementations can be accelerated using our results.
Recent advances in the design of efficient parallel algorithms have been largely focusing on the nowadays classical model of parallel computing called Massive parallel Computation (MPC), which follows the framework of...
详细信息
Neural algorithmic reasoners are parallel processors. Teaching them sequential algorithms contradicts this nature, rendering a significant share of their computations redundant. parallel algorithms however may exploit...
详细信息
Nucleus decompositions have been shown to be a useful tool for finding dense subgraphs. The coreness value of a clique represents its density based on the number of other cliques it is adjacent to. One useful output o...
详细信息
暂无评论