This paper analyzes job scheduling for parallel computers by using theoretical and experimental means. Based on existing architectures we first present a machine and a job model. Then, we propose a simple on-line algo...
详细信息
This paper analyzes job scheduling for parallel computers by using theoretical and experimental means. Based on existing architectures we first present a machine and a job model. Then, we propose a simple on-line algorithm employing job preemption without migration and derive theoretical bounds for the performance of the algorithm. The algorithm is experimentally evaluated with trace data from a large computing facility. These experiments show that the algorithm is highly sensitive on parameter selection and that substantial performance improvements over existing (non-preemptive) scheduling methods are possible.
Dynamic trees are a well-studied and fundamental building block of dynamic graph algorithms dating back to the seminal work of Sleator and Tarjan [STOC'81, (1981), pp. 114-122]. The problem is to maintain a tree s...
详细信息
ISBN:
(纸本)9798400704161
Dynamic trees are a well-studied and fundamental building block of dynamic graph algorithms dating back to the seminal work of Sleator and Tarjan [STOC'81, (1981), pp. 114-122]. The problem is to maintain a tree subject to online edge insertions and deletions while answering queries about the tree, such as the heaviest weight on a path, etc. In the parallel batch-dynamic setting, the goal is to process batches of edge updates work efficiently in low (polylog n) span. Two work-efficient algorithms are known: batch-parallel Euler Tour Trees by Tseng et al. [ALENEX'19, (2019), pp. 92-106] and parallel Rake-Compress (RC) Trees by Acar et al. [ESA'20, (2020), pp. 2:1-2:23]. Both however are randomized and work efficient in expectation. Several downstream results that use these data structures (and indeed to the best of our knowledge, all known workefficient parallel batch-dynamic graph algorithms) are therefore also randomized. In this work, we give the first deterministic work-efficient solution to the problem. Our algorithm maintains a parallel RC-Tree on n vertices subject to batches of k edge updates deterministically in worst-case O(k log(1 + n/k)) work and O(log n log log k) span on the Common-CRCW PRAM. We also show how to improve the span of the randomized algorithm from O(log n log* n) to O(log n). Lastly, as a result of our new deterministic algorithm, we also derandomize several downstream results that make use of parallel batch-dynamic dynamic trees, previously for which the only efficient solutions were randomized.
It is well known that simple randomized load balancing schemes can balance load effectively while incurring only a small overhead, making such schemes appealing for practical systems. In this paper, we provide new ana...
详细信息
ISBN:
(纸本)9780897918909
It is well known that simple randomized load balancing schemes can balance load effectively while incurring only a small overhead, making such schemes appealing for practical systems. In this paper, we provide new analyses for several such dynamic randomized load balancing schemes. Unlike previous analyses, we do not assume that in equilibrium each server is stochastically independent from other servers. Our work extends a previous analysis of the supermarket model, a model that abstracts a simple, efficient load balancing scheme in the setting where jobs arrive at a large system of parallel processors. In this model, customers arrive at a system of n servers as a Poisson stream of rate λn, λ
The original LogP model for parallel computation use four parameters;the communication latency (L), overhead (n), bandwidth (g), and the number of processors (P). This basic model is extended with a linear model for l...
详细信息
The original LogP model for parallel computation use four parameters;the communication latency (L), overhead (n), bandwidth (g), and the number of processors (P). This basic model is extended with a linear model for long messages. The resulting combination which is called the LogGP model for parallel computation has one additional parameter, G, which captures the bandwidth obtained for long messages. In this paper, the algorithm design and analysis under the new model are discussed, and the all-to-all remap, FFT, and radix sort are examined. In addition, the single node scatter problem is, examined. Finally, solutions are derived for the problem, and the their optimality under the LogGP model is proven.
The proceedings contain 39 papers. The topics discussed include: randomized queue management for DiffServ;randomization does not reduce the average delay in parallel packet switches;dynamic circular work-stealing dequ...
详细信息
The proceedings contain 39 papers. The topics discussed include: randomized queue management for DiffServ;randomization does not reduce the average delay in parallel packet switches;dynamic circular work-stealing deque;coloring unstructured radio networks;name independent routing for growth bounded networks;windows scheduling of arbitrary length jobs on parallel machines;parallel scheduling of complex dags under uncertainty;on distributed smooth scheduling;scheduling malleable tasks with precedence constraints;an adaptive power conservation scheme for heterogeneous wireless sensor networks with node redeployment;irrigating ad hoc networks ion constant time;and constant density spanners for wireless ad-hoc networks.
In a breakthrough result, Spielman and Teng (2004) developed a nearly-linear time solver for Laplacian linear equations, i.e. equations where the coefficient matrix is symmetric with non-negative diagonals and zero ro...
详细信息
ISBN:
(纸本)9798400704161
In a breakthrough result, Spielman and Teng (2004) developed a nearly-linear time solver for Laplacian linear equations, i.e. equations where the coefficient matrix is symmetric with non-negative diagonals and zero rowsums. Since the development of the Spielman-Teng solver, there has been substantial progress, simplifying and improving their result, but obtaining a fast practical, parallel Laplacian solver remains an open problem. We present a framework for obtaining extremely simple, parallel Laplacian linear equation solvers with nearly-linear work and sub-linear depth. Our framework allows us to parallelize any Laplacian solver based on repeated single-vertex approximate Gaussian elimination. We demonstrate this by parallelizing both the algorithm of Kyng and Sachdeva (2016) and the practical variant by Gao, Kyng, and Spielman (2023). Our framework is work-efficient in the sense of matching the sequential work of these algorithms. Our parallelization framework is very simple: We sample a subset of the current low-degree vertices (sparse columns), and in parallel we eliminate all vertices that are isolated in the resulting induced subgraph. This approach can be combined with any parallelizable approximate single-vertex elimination subroutine with sparse output. Given the simplicity of the approach, we believe that using it to parallelize the solver of Gao, Kyng, and Spielman (2023) is the most promising direction for obtaining practical parallel Laplacian solvers. If we additionally use a parallel spectral sparsification routine, our approach can be modified to work in polylogarithmic depth and nearly-linear work.
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. For certain large scale applications this ...
详细信息
ISBN:
(纸本)9780897918909
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. For certain large scale applications this is necessarily true. Typically, the cost models proposed for external memory algorithms have measured only the number of I/O operations, and the algorithms have been specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work based on parallelalgorithms to EM, but with limited success. In this paper we provide simulation techniques which produce efficient EM algorithms from efficient algorithms developed under BSP-like parallel computing models. Our techniques can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. In addition to the main simulation result we obtain a more comprehensive cost model for EM algorithms, which considers the total costs incurred by the algorithm including computation, I/O and communication costs.
Most high-level parallel programming languages allow for fine-grained parallelism. Programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to ...
详细信息
ISBN:
(纸本)9780897917179
Most high-level parallel programming languages allow for fine-grained parallelism. Programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. When executing such programs, the major concern is to dynamically schedule tasks to processors in order to minimize execution time and the amount of memory needed. In this paper, a class of parallel schedules that are provably efficient in both time and space, even for programs whose task structure is revealed only during execution are identified. Following this, an efficient dynamic scheduling algorithm that generates schedules in this class, for languages with nested fine-grained parallelism is described.
We present new BSP algorithms for deterministic sorting and randomized median finding. We sort n general keys by using a partitioning scheme that achieves the requirements of efficiency (one-optimality) and insensitiv...
详细信息
ISBN:
(纸本)9780897918091
We present new BSP algorithms for deterministic sorting and randomized median finding. We sort n general keys by using a partitioning scheme that achieves the requirements of efficiency (one-optimality) and insensitivity against data skew (the accuracy of the splitting keys depends solely on the step distance, which can be adapted to meet the worst-case requirements of our application). Although we employ sampling in order to realize efficiency, we can give a precise worst-case estimation of the maximum imbalance which might occur. We also investigate optimal randomized BSP algorithms for the problem of finding the median of n elements that require, with high-probability, 3n/(2p)+o(n/p) number of comparisons, for a wide range of values of n and p. Experimental results for the two algorithms are also presented.
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n =...
详细信息
ISBN:
(纸本)9798400704161
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n = 1 clusterings obtained by contracting the edges of T in order of weight. Existing algorithms for computing the SLD all require Omega(n log n) work where n = vertical bar T vertical bar. Furthermore, to the best of our knowledge no prior work provides a parallel algorithm obtaining non-trivial speedup for this problem. In this paper, we design faster parallelalgorithms for computing SLDs both in theory and in practice based on new structural results about SLDs. In particular, we obtain a deterministic output-sensitive parallel algorithm based on parallel tree contraction that requires O(n log h) work and O(log(2) n log(2) h) depth, where h is the height of the output SLD. We also give a deterministic bottom-up algorithm for the problem inspired by the nearest-neighbor chain algorithm for hierarchical agglomerative clustering, and show that it achieves O(n log h) work and O(h log n) depth. Our results are based on a novel divide-and-conquer framework for building SLDs, inspired by divide-and-conquer algorithms for Cartesian trees. Our new algorithms can quickly compute the SLD on billion-scale trees, and obtain up to 150x speedup over the highly-efficient Union-Find algorithm typically used to compute SLDs in practice.
暂无评论