Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n =...
详细信息
ISBN:
(纸本)9798400704161
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n = 1 clusterings obtained by contracting the edges of T in order of weight. Existing algorithms for computing the SLD all require Omega(n log n) work where n = vertical bar T vertical bar. Furthermore, to the best of our knowledge no prior work provides a parallel algorithm obtaining non-trivial speedup for this problem. In this paper, we design faster parallelalgorithms for computing SLDs both in theory and in practice based on new structural results about SLDs. In particular, we obtain a deterministic output-sensitive parallel algorithm based on parallel tree contraction that requires O(n log h) work and O(log(2) n log(2) h) depth, where h is the height of the output SLD. We also give a deterministic bottom-up algorithm for the problem inspired by the nearest-neighbor chain algorithm for hierarchical agglomerative clustering, and show that it achieves O(n log h) work and O(h log n) depth. Our results are based on a novel divide-and-conquer framework for building SLDs, inspired by divide-and-conquer algorithms for Cartesian trees. Our new algorithms can quickly compute the SLD on billion-scale trees, and obtain up to 150x speedup over the highly-efficient Union-Find algorithm typically used to compute SLDs in practice.
The computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. This causes workload imbalance among processors on a parallel machine which, in turn, requires signific...
详细信息
The computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. This causes workload imbalance among processors on a parallel machine which, in turn, requires significant data movement at runtime. We present a dynamic load-balancing framework, called JOVE, that balances the workload across all processors with a global view each time the computational mesh is adapted. JOVE has been implemented on an SP2 in MPI for portability. Experimental results for two model meshes demonstrate that mesh adaption with load balancing gives more than a sixfold improvement over one without load balancing. Furthermore, JOVE gives a 24-fold speedup on 64 processors compared to sequential execution.
The proceedings contain 68 papers. The topics discussed include: faster deterministic all pairs shortest paths in congest model;contention resolution with message deadlines;memory tagging: minimalist synchronization f...
ISBN:
(纸本)9781450369350
The proceedings contain 68 papers. The topics discussed include: faster deterministic all pairs shortest paths in congest model;contention resolution with message deadlines;memory tagging: minimalist synchronization for scalable concurrent data structures;work-efficient batch-incremental minimum spanning trees with applications to the sliding-window model;closing the gap between cache-oblivious and cache-adaptive analysis;optimal resource allocation for elastic and inelastic jobs;optimal parallelalgorithms in the binary-forking model;randomized incremental convex hull is highly parallel;almost universal anonymous rendezvous in the plane;the online multi-commodity facility location problem;unconditional lower bounds for adaptive massively parallel computation;on the hardness of massively parallel computation;and graph sparsification for derandomizing massively parallel computation with low space.
This paper discusses a parallel implementation of the e-relaxation algorithm for the min-cost flow problem on the CM-5. There is considerable loss in efficiency in going from one processor sequential implementation to...
详细信息
As databases increasingly integrate non-textual information it is becoming necessary to support efficient similarity searching in addition to range searching. Recently, declustering techniques have been proposed for i...
详细信息
ISBN:
(纸本)9780897919890
As databases increasingly integrate non-textual information it is becoming necessary to support efficient similarity searching in addition to range searching. Recently, declustering techniques have been proposed for improving the performance of similarity searches through parallel I/O. In this paper, we propose a new scheme which provides good declustering for similarity searching. In particular, it does global declustering as opposed to local declustering, exploits the availability of extra disks and does not limit the partitioning of the data space. Our technique is based upon the Cyclic declustering schemes which were developed for range and partial match queries. We establish, in general, that Cyclic declustering techniques outperform previously proposed techniques.
The time required to compute any function of a collection of circular doubly linked lists on a CROW PRAM is shown to be at most a constant factor more than on a CREW PRAM, but this is not true for singly linked lists....
详细信息
ISBN:
(纸本)0897913701
The time required to compute any function of a collection of circular doubly linked lists on a CROW PRAM is shown to be at most a constant factor more than on a CREW PRAM, but this is not true for singly linked lists. A tight lower bound of Ω(log log n) for colouring an n node doubly linked list on a CROW PRAM using a constant number of colours is also obtained.
This paper concerns the problem of how to exploit parallelism during the phases of compilation involving syntax-directed analysis and translation. In particular, we address the problem of how to exploit parallelism du...
详细信息
parallel Reinforcement Learning (RL) frameworks are essential for mapping RL workloads to multiple computational resources, allowing for faster generation of samples, estimation of values, and policy improvement. Thes...
详细信息
ISBN:
(纸本)9798400704161
parallel Reinforcement Learning (RL) frameworks are essential for mapping RL workloads to multiple computational resources, allowing for faster generation of samples, estimation of values, and policy improvement. These computational paradigms require a seamless integration of training, serving, and simulation workloads. Existing frameworks, such as Ray, are not managing this orchestration efficiently, especially in RL tasks that demand intensive input/output and synchronization between actors on a single node. In this study, we have proposed a solution implementing the reactor model, which enforces a set of actors to have a fixed communication pattern. This allows the scheduler to eliminate work needed for synchronization, such as acquiring and releasing locks for each actor or sending and processing coordination-related messages. Our framework, Lingua Franca (LF), a coordination language based on the reactor model, also supports true parallelism in Python and provides a unified interface that allows users to automatically generate dataflow graphs for RL tasks. In comparison to Ray on a single-node multi-core compute platform, LF achieves 1.21x and 11.62x higher simulation throughput in OpenAI Gym and Atari environments, reduces the average training time of synchronized parallel Q-learning by 31.2%, and accelerates multi-agent RL inference by 5.12x.
A unified framework for finding efficient permutation routes on parallel networks in an off-line setting is presented. If the underlying graph of a parallel network contains an appropriate 'approximate' produc...
详细信息
ISBN:
(纸本)0897913701
A unified framework for finding efficient permutation routes on parallel networks in an off-line setting is presented. If the underlying graph of a parallel network contains an appropriate 'approximate' product structure then our method guarantees the existence of non-blocking near-optimal permutation routes. The routes in question can be determined in polynomial time. Furthermore, our results are extended to finding permutation routes among the remaining 'live' nodes in a faulty network.
Cache-oblivious algorithms have the advantage of achieving good sequential cache complexity across all levels of a multi-level cache hierarchy, regardless of the specifies (cache size and cache line size) of each leve...
详细信息
ISBN:
(纸本)9781605586069
Cache-oblivious algorithms have the advantage of achieving good sequential cache complexity across all levels of a multi-level cache hierarchy, regardless of the specifies (cache size and cache line size) of each level. In this paper, we describe cache-oblivious sorting algorithms with optimal work, optimal cache complexity and polylogarithmic depth. Using known mappings, these lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. Moreover, the low cache complexities extend to shared-memory multiprocessors with common configurations of multi-level caches. The key factor in the low cache complexity on multiprocessors is the low depth of the algorithms we propose.
暂无评论