Operations on basic. data structures such as, queues, priority queues, stacks, and counters c an dominate the execution time of a parallel program due to both their frequency and their coordination and contention over...
详细信息
Operations on basic. data structures such as, queues, priority queues, stacks, and counters c an dominate the execution time of a parallel program due to both their frequency and their coordination and contention overheads. There are considerable performance payoffs in developing highly optimized, asynchronous, distributed, cache-conscious, parallel implementations of such data structures. Such implementations may employ a variety of tricks to reduce latencies and avoid serial bottlenecks, as long as the semantics of the data structure are preserved The complexity of the implementation and the difficulty in reasoning about asynchronous systems increases concerns regarding possible bugs in the implementation. In this paper we consider postmortem, black-box procedures for testing whether a, parallel data structure:behaved correctly. We present the first systematic study of algorithms and hardness results for such testing procedures, focusing on queues, priority queues, stacks, and counters, under various important scenarios. Our results demonstrate the importance of selecting test data such that distinct values are inserted into them data structure (as appropriate). In such cases we present an O (n) time algorithm for testing linearizable queues, an O (n log n) time algorithm for testing linearizable priority queues, and an O (np(2)) time algorithm for testing sequentially consistent queues, where n is the number of data structure operations and p is the number of processors. In contrast, we show that testing such data structures for executions with arbitrary input values is NP-complete. Our results also help clarify the thresholds between scenarios that admit polynomial time solutions and those that are NP-complete. Our algorithms are the first nontrivial algorithms for these problems.
Operations on basic. data structures such as, queues, priority queues, stacks, and counters c an dominate the execution time of a parallel program due to both their frequency and their coordination and contention over...
Operations on basic. data structures such as, queues, priority queues, stacks, and counters c an dominate the execution time of a parallel program due to both their frequency and their coordination and contention overheads. There are considerable performance payoffs in developing highly optimized, asynchronous, distributed, cache-conscious, parallel implementations of such data structures. Such implementations may employ a variety of tricks to reduce latencies and avoid serial bottlenecks, as long as the semantics of the data structure are preserved The complexity of the implementation and the difficulty in reasoning about asynchronous systems increases concerns regarding possible bugs in the implementation. In this paper we consider postmortem, black-box procedures for testing whether a, parallel data structure:behaved correctly. We present the first systematic study of algorithms and hardness results for such testing procedures, focusing on queues, priority queues, stacks, and counters, under various important scenarios. Our results demonstrate the importance of selecting test data such that distinct values are inserted into them data structure (as appropriate). In such cases we present an O (n) time algorithm for testing linearizable queues, an O (n log n) time algorithm for testing linearizable priority queues, and an O (np(2)) time algorithm for testing sequentially consistent queues, where n is the number of data structure operations and p is the number of processors. In contrast, we show that testing such data structures for executions with arbitrary input values is NP-complete. Our results also help clarify the thresholds between scenarios that admit polynomial time solutions and those that are NP-complete. Our algorithms are the first nontrivial algorithms for these problems.
External sorting-the process of sorting a file that is too large to fit into the computer's internal memory and must be stored externally on disks-is a fundamental subroutine in database systems [G], [IBM]. Of pri...
External sorting-the process of sorting a file that is too large to fit into the computer's internal memory and must be stored externally on disks-is a fundamental subroutine in database systems [G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K, Section 5.4.9] recently identified SRM (which he calls "randomized striping") as the method of choice for sorting with parallel disks. In this paper we present an efficient implementation of SRM, based upon novel and elegant data structures. We give a new implementation for SRM's lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to read blocks of input runs efficiently during external merging. Our implementation is based on synchronous parallel I/O primitives provided by the TPIE programming environment [TPI], whenever our program issues an I/O read (write) operation, one block of data is synchronously read from (written to) each disk in parallel. We compare the performance of SRM over a wide range of input sizes with that of disk-striped mergesort (DSM), which is widely used in practice. DSM consists of a standard mergesort in conjunction with striped I/O for parallel disk access. SRM merges together significantly more runs at a time compared with DSM, and thus it requires fewer merge passes. We demonstrate in practical scenarios that even though the streaming speeds for merging with DSM are a little higher than those for SRM (since DSM merges fewer runs at a time), sorting using SRM is often significantly faster than with DSM
The utility of algorithm parallelism for coping with increased processor to memory latencies using "latency hiding" is part of the folklore of parallel computing. Latency hiding techniques increase the traff...
详细信息
ISBN:
(纸本)9781581135299
The utility of algorithm parallelism for coping with increased processor to memory latencies using "latency hiding" is part of the folklore of parallel computing. Latency hiding techniques increase the traffic to memory and therefore may "hit another wall": limited bandwidth to memory. The current paper attempts to stimulate research in the following general direction: show that algorithm parallelism need not conflict with limited bandwidth.A general technique for using parallelalgorithms to enhance serial implementation in the face of processor-memory latency problems is revisited. Two techniques for alleviating memory bandwidth constraints are presented. Both techniques can be incorporated in a *** is often considerable parallelism in many of the algorithms which are known as useful serial algorithms. Interestingly enough, all the examples provided for the use of the two techniques come from such serial algorithms.
External sorting-the process of sorting a file that is too large to fit into the computer's internal memory and must be stored externally on disks-is a fundamental subroutine in database systems [G], [IBM]. Of pri...
详细信息
External sorting-the process of sorting a file that is too large to fit into the computer's internal memory and must be stored externally on disks-is a fundamental subroutine in database systems [G], [IBM]. Of prime importance are techniques that use multiple disks in parallel in order to speed up the performance of external sorting. The simple randomized merging (SRM) mergesort algorithm proposed by Barve et al. [BGV] is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth [K, Section 5.4.9] recently identified SRM (which he calls "randomized striping") as the method of choice for sorting with parallel disks. In this paper we present an efficient implementation of SRM, based upon novel and elegant data structures. We give a new implementation for SRM's lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to read blocks of input runs efficiently during external merging. Our implementation is based on synchronous parallel I/O primitives provided by the TPIE programming environment [TPI], whenever our program issues an I/O read (write) operation, one block of data is synchronously read from (written to) each disk in parallel. We compare the performance of SRM over a wide range of input sizes with that of disk-striped mergesort (DSM), which is widely used in practice. DSM consists of a standard mergesort in conjunction with striped I/O for parallel disk access. SRM merges together significantly more runs at a time compared with DSM, and thus it requires fewer merge passes. We demonstrate in practical scenarios that even though the streaming speeds for merging with DSM are a little higher than those for SRM (since DSM merges fewer runs at a time), sorting using SRM is often significantly faster than with DSM
In this paper we present a coarse-grained parallel algorithm for solving the string edit distance problem for a string A and all substrings of a string C. Our method is based on a novel CGM/BSP parallel dynamic progra...
详细信息
ISBN:
(纸本)9781581135299
In this paper we present a coarse-grained parallel algorithm for solving the string edit distance problem for a string A and all substrings of a string C. Our method is based on a novel CGM/BSP parallel dynamic programming technique for computing all highest scoring paths in a weighted grid graph. The algorithm requires \log p rounds/supersteps and O(\fracn^2p\log m) local computation, where $p$ is the number of processors, p^2 \leq m \leq n. To our knowledge, this is the first efficient CGM/BSP algorithm for the alignment of all substrings of C with A. Furthermore, the CGM/BSP parallel dynamic programming technique presented is of interest in its own right and we expect it to lead to other parallel dynamic programming methods for the CGM/BSP.
The proceedings contains 48 papers from Thirteen annualacmsymposium on parallelalgorithms and architectures. Topics discussed include: compact routing schemes;simple on-line algorithms for the maximum disjoint path...
详细信息
The proceedings contains 48 papers from Thirteen annualacmsymposium on parallelalgorithms and architectures. Topics discussed include: compact routing schemes;simple on-line algorithms for the maximum disjoint paths problem;competitive buffer management for shared-memory switches;attack propagation in networks;computational power of pipelined memory hierarchies;and a data tracking scheme for general networks.
A family of deterministic asynchronous Write-All algorithms were studied to analyze the properties of the set of permutations proposed by Kanellakis and Shvartsman. The efficiency of the algorithms was measured in ter...
详细信息
A family of deterministic asynchronous Write-All algorithms were studied to analyze the properties of the set of permutations proposed by Kanellakis and Shvartsman. The efficiency of the algorithms was measured in terms of work acounted for all machine instructions executed by processors. It was found that the analytical results covered only a subset of the possible adversarial patterns of asynchrony. The analysis suggested that the proposed method yielded a faster construction of the Write-All algorithms compared to other methods.
A new block algorithm for triangularization of regular or singular matrices with dimension m × n, is proposed. Taking benefit of fast block multiplication algorithms, it achieves the best known sequential complex...
详细信息
A new block algorithm for triangularization of regular or singular matrices with dimension m × n, is proposed. Taking benefit of fast block multiplication algorithms, it achieves the best known sequential complexity O(mω-1n) for any sizes and any rank. Moreover, the block strategy enables to improve locality with respect to previous algorithms as exhibited by practical performances.
暂无评论