We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consist...
详细信息
ISBN:
(纸本)9780897917179
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the trade-off between the amount of local computation and the amount of inter-processor communication required for parallel sorting algorithms. We prove a lower bound of Ω(n log m/m) on the time to sort n numbers in an exclusive-read variant of the PRAM (m) model. We show that Leighton's Columnsort can be used to give an asymptotically matching upper bound in the case where m grows as a fractional power of n. The bounds are of a surprising form, in that they have little dependence on the parameter p. This implies that attempting to distribute the workload across more processors while holding the problem size and the size of the shared memory fixed will not improve the optimal running time of sorting in this model. We also show that both the upper and the lower bound can be adapted to bridging models that address the issue of limited communication bandwidth: the LogP model and the BSP model. The lower bounds provide convincing evidence that efficient parallelalgorithms for sorting rely strongly on high communication bandwidth.
An efficient design and implementation of the collective communication part in a Message Passing Interface (MPI) that is optimized for clusters of workstations is described. The system which consist of two main compon...
详细信息
ISBN:
(纸本)9780897917179
An efficient design and implementation of the collective communication part in a Message Passing Interface (MPI) that is optimized for clusters of workstations is described. The system which consist of two main components, the MPI-CCL layer and a User-level Reliable Transport Protocol (URTP), is integrated with the operating system via an efficient kernel extension mechanism. The system is then implemented on a collection of IBM RS/6000 workstations connected via a 10Mbit Ethernet LAN. Results indicate that the performance of the MPI Broadcast (on top of Ethernet) is about twice as fast as a recently published software implementation of broadcast on top of ATM.
A n × m (0,1)-matrix is said to satisfy the consecutive-ones property if there is a permutation of the rows of the matrix such that in each column all non-zero entries are adjacent. The problem of determining suc...
详细信息
ISBN:
(纸本)9780897917179
A n × m (0,1)-matrix is said to satisfy the consecutive-ones property if there is a permutation of the rows of the matrix such that in each column all non-zero entries are adjacent. The problem of determining such a permutation, if one exists, is the consecutive-ones property problem. Previously, Klein and Reif [13] gave a parallel solution for the consecutive-ones property problem with an algorithm based on complicated parallel PQ-tree manipulations. The work complexity of this algorithm was improved in [14] to run in time O(log2 n) with a linear number of CRCW processors. We present a new algorithm for this problem, based on a less sophisticated data structure, that improves upon the processor bounds of the previous algorithms by a factor of log n/log log n is general, and by a factor of log n for sufficiently dense problem instances. Our algorithm uses a novel divide-and-conquer approach, and uses for a fundamental data structure the decomposition of graphs into tri-connected components. Solutions to the consecutive-ones problem have important applications to a variety of problems in computational molecular biology, databases, distributed computing, VLSI placement and routing, and graph and network theory.
In this paper we study the question: How useful is randomization in speeding up Exclusive Write PRAM computations? Our results give further evidence that randomization is of limited use in these types of computations....
详细信息
ISBN:
(纸本)9780897917179
In this paper we study the question: How useful is randomization in speeding up Exclusive Write PRAM computations? Our results give further evidence that randomization is of limited use in these types of computations. First we examine a compaction problem on both the CREW and EREW PRAM models, and we present randomized lower bounds which match the best deterministic lower bounds known. (For the CREW PRAM model, the lower bound is asymptotically optimal). These are the first non-trivial randomized lower bounds known for the compaction problem on these models. We show that our lower bounds also apply to the problem of approximate compaction. Next we examine the problem of computing boolean functions on the CREW PRAM model, and we present a randomized lower bound which improves on the previous best randomized lower bound for many boolean functions, including the OR function. (The previous lower bounds for these functions were asymptotically optimal, but we improve the constant multiplicative factor). We also give an alternate proof for the randomized lower bound on PARITY, which was already optimal to within a constant additive factor. Lastly, we give a randomized lower bound for integer merging on an EREW PRAM which matches the best deterministic lower bound known. In all our proofs, we use the Random Adversary method, which has previously only been used for proving lower bounds on models with Concurrent Write capabilities. Thus this paper also serves to illustrate the power and generality of this method for proving parallel randomized lower bounds.
This paper presents a new parallel volume rendering algorithm that can render 2563 voxel medical data sets at over 10 Hz and 1283 voxel data sets at over 30 Hz on a 16-processor Silicon Graphics Challenge. The algorit...
详细信息
ISBN:
(纸本)9780897917742
This paper presents a new parallel volume rendering algorithm that can render 2563 voxel medical data sets at over 10 Hz and 1283 voxel data sets at over 30 Hz on a 16-processor Silicon Graphics Challenge. The algorithm achieves these results by minimizing each of the three components of execution time: computation time, synchronization time, and data communication time. Computation time is low because the parallel algorithm is based on the recently-reported shear-warp serial volume rendering algorithm which is over five times faster than previous serial algorithms. Synchronization time is minimized by using dynamic load balancing and a task partition that minimizes synchronization events. Data communication costs are low because the algorithm is implemented for shared-memory multiprocessors, a class of machines with hardware support for low-latency fine-grain communication and hardware caching to hide latency. We draw two conclusions from our implementation. First, we find that on shared-memory architectures data redistribution and communication costs do not dominate rendering time. second, we find that cache locality requirements impose a limit on parallelism in volume rendering algorithms. Specifically, our results indicate that shared-memory machines with hundreds of processors would be useful only for rendering very large data sets.
The proceedings contain 38 papers. The topics discussed include: efficient compilation of high-level data parallelalgorithms;improved abstract parity-declustered layouts for disk arrays;experiences with parallel n-bo...
ISBN:
(纸本)0897916719
The proceedings contain 38 papers. The topics discussed include: efficient compilation of high-level data parallelalgorithms;improved abstract parity-declustered layouts for disk arrays;experiences with parallel n-body simulation;a comparison of parallelalgorithms for connected components;studying overheads in massively parallel min/max-tree evaluation;scheduling trees using FIFO queues: a control-memory tradeoff;SIMD instruction cache;dynamic parallel tree contraction;improved bounds for routing and sorting on multi-dimensional meshes;parallel sorting by overpartitioning;an optimal randomized logarithmic time connectivity algorithm for the EREW PRAM;list ranking and list scan on the CRAY C-90;modeling communication in parallelalgorithms: a fruitful interaction between theory and systems?;on testing cache-coherent shared memories;diffracting trees;programming abstract DEC-alpha based multiprocessors the easy way;scheduling parallelizable tasks to minimize average response time;and an analysis of diffusive load-balancing.
暂无评论