The proceedings contain 54 papers. The topics discussed include: expediting hazard pointers with bounded RCU critical sections;Alock: asymmetric lock primitive for RDMA systems;when is parallelism fearless and zero-co...
ISBN:
(纸本)9798400704161
The proceedings contain 54 papers. The topics discussed include: expediting hazard pointers with bounded RCU critical sections;Alock: asymmetric lock primitive for RDMA systems;when is parallelism fearless and zero-cost with rust?;efficient parallel reinforcement learning framework using the reactor model;parallel best arm identification in heterogeneous environments;brief announcement: lock-free learned search data structure;brief announcement: LIT: lookup interlocked table for range queries;brief announcement: a fast scalable detectable unrolled lock-based linked list;scheduling out-trees online to optimize maximum flow;optimizing dynamic data center provisioning through speed scaling: a primal-dual perspective;scheduling jobs with work-inefficient parallel solutions;and multi bucket queues: efficient concurrent priority scheduling.
We present a randomized O(m log(2) n) work, O( polylogn) depth parallel algorithm for minimum cut. This algorithm matches thework bounds of a recent sequential algorithm by Gawrychowski, Mozes, andWeimann [ICALP'2...
详细信息
We present a randomized O(m log(2) n) work, O( polylogn) depth parallel algorithm for minimum cut. This algorithm matches thework bounds of a recent sequential algorithm by Gawrychowski, Mozes, andWeimann [ICALP'20], and improves on the previously best parallel algorithm by Geissmann and Gianinazzi [SPAA'18], which performs O(m log(4) n) work in O(polylogn) depth. Our algorithm makes use of three components that might be of independent interest. First, we design a parallel data structure that efficiently supports batched mixed queries and updates on trees. It generalizes and improves thework bounds of a previous data structure of Geissmann and Gianinazzi and iswork efficient with respect to the best sequential algorithm. Second, we design a parallel algorithm for approximate minimum cut that improves on previous results by Karger and Motwani. We use this algorithm to give a work-efficient procedure to produce a tree packing, as in Karger's sequential algorithm for minimum cuts. Last, we design an efficient parallel algorithm for solving the minimum 2-respecting cut problem.
Previous parallel sorting algorithms do not scale to the largest available machines, since they either have prohibitive communication volume or prohibitive critical path length. We describe algorithms that are a viabl...
详细信息
ISBN:
(纸本)9781450335881
Previous parallel sorting algorithms do not scale to the largest available machines, since they either have prohibitive communication volume or prohibitive critical path length. We describe algorithms that are a viable compromise and overcome this gap both in theory and practice. The algorithms are multi-level generalizations of the known algorithms sample sort and multiway mergesort. In particular, our sample sort variant turns out to be very scalable both in theory and practice where it scales up to 2(15) MPI processes with outstanding performance in particular for medium sized inputs. Some tools we develop may be of independent interest - a simple, practical, and flexible sorting algorithm for very small inputs, a near linear time optimal algorithm for solving a constrained bin packing problem, and an algorithm for data delivery, that guarantees a small number of message startups on each processor.
We develop a novel parallel decomposition strategy for un-weighted, undirected graphs, based on growing disjoint connected clusters from batches of centers progressively selected from yet uncovered nodes. With respect...
详细信息
ISBN:
(纸本)9781450335881
We develop a novel parallel decomposition strategy for un-weighted, undirected graphs, based on growing disjoint connected clusters from batches of centers progressively selected from yet uncovered nodes. With respect to similar previous decompositions, our strategy exercises a tighter control on both the number of clusters and their maximum radius. We present two important applications of our parallel graph decomposition: (1) k-center clustering approximation;and (2) diameter approximation. In both cases, we obtain algorithms which feature a polylogarithmic approximation factor and are amenable to a distributed implementation that is geared for massive (long-diameter) graphs. The total space needed for the computation is linear in the problem size, and the parallel depth is substantially sublinear in the diameter for graphs with low doubling dimension. To the best of our knowledge, ours are the first parallel approximations for these problems which achieve sub-diameter parallel time, for a relevant class of graphs, using only linear space. Besides the theoretical guarantees, our algorithms allow for a very simple implementation on clustered architectures: we report on extensive experiments which demonstrate their effectiveness and efficiency on large graphs as compared to alternative known approaches.
In many applications of parallel computing, distribution of the data unambiguously implies distribution of work among processors. But there are exceptions where some tasks can be assigned to one of several processors ...
详细信息
ISBN:
(纸本)9781581135299
In many applications of parallel computing, distribution of the data unambiguously implies distribution of work among processors. But there are exceptions where some tasks can be assigned to one of several processors without altering the total volume of communication. In this paper, we study the problem of exploiting this flexibility in assignment of tasks to improve load balance. We first model the problem in terms of network flow and use combinatorial techniques for its solution. Our parametric search algorithms use maximum flow algorithms for probing on a candidate optimal solution value. We describe two algorithms to solve the assignment problem with log WT and |P| probe calls, where WT and |P|, respectively, denote the total workload and number of processors. We also define augmenting paths and cuts for this problem, and show that any algorithm based on augmenting paths can be used to find an optimal solution for the task assignment problem. We then consider a continuous version of the problem, and formulate it as a linearly constrained optimization problem, i.e., min Ax∞, s.t. Bx = d. To avoid solving an intractable ∞-norm optimization problem, we show that in this case minimizing the 2-norm is sufficient to minimize the ∞-norm, which reduces the problem to the well-studied linearly-constrained least squares problem. The continuous version of the problem has the advantage of being easily amenable to parallelization.
parallel I/O systems typically consist of individual processors, communication networks, and a large number of disks. Managing and utilizing these resources to meet performance, portability and usability goals of appl...
详细信息
ISBN:
(纸本)9780897919890
parallel I/O systems typically consist of individual processors, communication networks, and a large number of disks. Managing and utilizing these resources to meet performance, portability and usability goals of applications has become a significant challenge. We believe that a parallel I/O system that automatically selects efficient I/O plans for user applications is a solution to this problem. In this paper, we present such an automatic performance optimization approach for scientific applications performing collective I/O requests on multidimensional arrays. Under our approach, an optimization engine in a parallel I/O system selects optimal I/O plans automatically without human intervention based on a description of the application I/O requests and the system configuration. To validate our hypothesis, we have built an optimizer that uses a rule-based and randomized search-based algorithms to select optimal parameter settings in Panda, a parallel I/O library for multidimensional arrays. Our performance results obtained from two IBM SPs with significantly different configurations show that the Panda optimizer is able to select high-quality I/O plans and deliver high performance under a variety of system configurations.
We consider the problem of sorting a file of N records on the D-disk model of parallel I/O [VS94] in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B con...
详细信息
ISBN:
(纸本)9780897918091
We consider the problem of sorting a file of N records on the D-disk model of parallel I/O [VS94] in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or from each of the D disks in parallel. We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecast-and-flush approach to overcome the inherent difficulties of simple merging on parallel disks. SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the forecasting technique of [Knu73], our algorithm is able to read in, at any time, the `right' block from any disk, and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the `right' blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental inefficiencies of previous mergesort algorithms. Our analysis technique involves a novel reduction to various maximum occupancy problems. We prove that the expected I/O performance of SRM is efficient under varying sizes of memory and that it compares favorably in practice to disk-striped mergesort (DSM). Our studies indicate that SRM outperforms DSM even when the number D of parallel disks is fairly small.
Discovery of association rules is an important database mining problem. Mining for association rules involves extracting patterns from large databases and inferring useful rules from them. Several parallel and sequent...
详细信息
Discovery of association rules is an important database mining problem. Mining for association rules involves extracting patterns from large databases and inferring useful rules from them. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the commonly occurring patterns or itemsets (set of items), thus incurring high I/O overhead. In the parallel case, these algorithms do a reduction at the end of each pass to construct the global patterns, thus incurring high synchronization cost. In this paper we describe a new parallel association mining algorithm. Our algorithm is a result of detailed study of the available parallelism and the properties of associations. The algorithm uses a scheme to cluster related frequent itemsets together, and to partition them among the processors. At the same time it also uses a different database layout which clusters related transactions together, and selectively replicates the database so that the portion of the database needed for the computation of associations is local to each processor. After the initial set-up phase, the algorithm eliminates the need for further communication or synchronization. The algorithm further scans the local database partition only three times, thus minimizing I/O overheads. Unlike previous approaches, the algorithms uses simple intersection operations to compute frequent itemsets and doesn't have to maintain or search complex hash structures. Our experimental testbed is a 32-processor DEC Alpha cluster inter-connected by the Memory Channel network. We present results on the performance of our algorithm on various databases, and compare it against a well known parallel algorithm. Our algorithm outperforms it by an more than an order of magnitude.
The proceedings contain 40 papers. The topics discussed include: time vs. space trade-offs for rendezvous in trees;allowing each node to communicate only once in a distributed system: shared whiteboard models;optimal ...
ISBN:
(纸本)9781450312134
The proceedings contain 40 papers. The topics discussed include: time vs. space trade-offs for rendezvous in trees;allowing each node to communicate only once in a distributed system: shared whiteboard models;optimal and competitive runtime bounds for continuous, local gathering of mobile robots;online multi-robot exploration of grid graphs with rectangular obstacles;in search of parallel dimensions;delegation and nesting in best-effort hardware transactional memory;design, verification and applications of a new read-write lock algorithm;a lock-free B+tree;brief announcement: the problem based benchmark suite;brief announcement: subgraph isomorphism on a multithreaded shared memory architecture;efficient cache oblivious algorithms for randomized divide-and-conquer on the multicore model;a scalable framework for heterogeneous GPU-based clusters;and faster and simpler width-independent parallelalgorithms for positive semidefinite programming.
The queue-read, queue-write (QRQW) parallel random access machine ( PRAM) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one...
详细信息
The queue-read, queue-write (QRQW) parallel random access machine ( PRAM) model permits concurrent reading and writing to shared memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. The QRQW PRAM model reflects the contention properties of most commercially available parallel machines more accurately than either the well-studied CRCW PRAM or EREW PRAM models, and can be efficiently emulated with only logarithmic slowdown on hypercube-type noncombining networks. This paper describes fast, low-contention, work-optimal, randomized QRQW PRAM algorithms for the fundamental problems of load balancing, multiple compaction, generating a random permutation, parallel hashing, and distributive sorting, These logarithmic or sublogarithmic time algorithms considerably improve upon the best known EREW PRAM algorithms for these problems, while avoiding the high-contention steps typical of CRCW PRAM algorithms. An illustrative experiment demonstrates the performance advantage of a new QRQW random permutation algorithm when compared with the popular EREW algorithm. Finally, this paper presents new randomized algorithms for integer sorting and general sorting. (C) 1996 Academic Press, Inc.
暂无评论