A parallel sorting algorithm using cooperating heaps in a linear array of processors is presented. It can sort a sequence whose length is much larger than the number of processors. Because the output begins one step a...
详细信息
A parallel sorting algorithm using cooperating heaps in a linear array of processors is presented. It can sort a sequence whose length is much larger than the number of processors. Because the output begins one step after all the items have been input, sorting n items requires 2n + 1 steps. Two independent modifications of the algorithm are possible;one tries to reduce the number of processors used, and the other can sort more items on the same array.
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the PRAM(m) model, where p processors communicate through a globally shared memory which can service m requests per...
详细信息
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the PRAM(m) model, where p processors communicate through a globally shared memory which can service m requests per unit time, we focus on the trade-off between the amount of local computation and the amount of interprocessor communication required for parallel sorting algorithms. Our main result is a lower bound of Omega(n log m/m log n) on the time required to sort n numbers on the exclusive-read and queued-read variants of the PRAM(m). We also show that Leighton's Columnsort can be used to give an asymptotically matching upper bound in the case where m grows as a fractional power of n. The bounds are of a surprising form in that they have little dependence on the parameter p. This implies that attempting to distribute the workload across more processors while holding the problem size and the size of the shared memory fixed will not improve the optimal running time of sorting in this model. We also show that both the lower and the upper bounds can be adapted to bridging models that address the issue of limited communication bandwidth: the LogP model and the bulk-synchronous parallel (BSP) model. The lower bounds provide further convincing evidence that efficient parallel algorithms for sorting rely strongly on high communication bandwidth.
This correspondence examines the problem of sorting on a network of processors, where each processor consists of a single storage register and a small control unit capable of comparing two numbers and has a single ser...
详细信息
This correspondence examines the problem of sorting on a network of processors, where each processor consists of a single storage register and a small control unit capable of comparing two numbers and has a single serial memory attached to it. We show how to sort optimally on one- or two-dimensional arrays of p processors in time ¿(n + (n2/p2)) and ¿((n/¿p) + (n2/p2)), respectively. Because of the implementational advantages of serial memories, we feel that our architecture will be attractive for several applications.
We study the problem of sorting n numbers on a p-processor bulk-synchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processor-to-processor communication rounds provided each ...
详细信息
We study the problem of sorting n numbers on a p-processor bulk-synchronous parallel (BSP) computer, which is a parallel multicomputer that allows for general processor-to-processor communication rounds provided each processor sends and receives at most h items in any round. We provide parallel sorting methods that use internal computation time that is O(n log n/p) and a number of communication rounds that is O( log n/log(h+1)) for h = Theta(n/p). The internal computation bound is optimal for any comparison-based sorting algorithm. Moreover, the number of communication rounds is bounded by a constant for the (practical) situations when p less than or equal to n(1-1/c) for a constant c greater than or equal to 1. In fact, we show that our bound on the number of communication rounds is asymptotically optimal for the full range of values for p, for we show that just computing the "or" of n bits distributed evenly to the first O(n/h) of an arbitrary number of processors in a BSP computer requires Omega(log n/ log(h + 1)) communication rounds.
The aim of the paper is to introduce techniques in order to tune sequential in-core sorting algorithms in the frameworks of two applications. The first application is parallel sorting when the processor speeds are not...
详细信息
The aim of the paper is to introduce techniques in order to tune sequential in-core sorting algorithms in the frameworks of two applications. The first application is parallel sorting when the processor speeds are not identical in the parallel system. The second application is the Zeta-Data Project [M. Koskas, A hierarchical database management algorithm, in: Annales 67 du Lamsade, vol. 2, 2004, pp. 277-317. [9]] whose aim is to develop novel algorithms for databases issues. About 50% of the work done in building indexes is devoted to sorting sets of integers. We develop and compare algorithms built to sort with equal keys. Algorithms are variations of the 3Way-Quicksort of Sedgewick. In order to observe performances and to fully exploit functional units in processors, and also in order to optimize the use of the memory system and the different functional units, we use hardware performance counters that are available on most modem microprocessors. We also develop analytical results for one of our algorithms and compare expected results with the measures. For the two applications, we show, through fine experiments on an Athlon processor (a three-way superscalar x86 processor), that L1 data cache misses are not the central problem, but a subtle proportion of independent retired instructions should be advised to get performance for in-core sorting. (C) 2006 Elsevier B.V. All rights reserved.
The problem of sorting n elements using p processors in a parallel comparison model is considered. Lower and upper bounds which imply that for p≧np≧np \geqq n, the time complexity of this problem is <span class=&...
详细信息
High-speed electronic sorting networks are difficult to implement with VLSI technology because of the dense and global connectivity required. Optics eliminates this bottleneck by offering global interconnections, mass...
详细信息
High-speed electronic sorting networks are difficult to implement with VLSI technology because of the dense and global connectivity required. Optics eliminates this bottleneck by offering global interconnections, massive parallelism, and noninterfering communications. We present a parallel sorting algorithm and its efficient optical implementation using currently available optical hardware. The algorithm sorts n data elements in a few steps, independent of the number of elements to be sorted. Thus, it is a constant-time sorting algorithm, that is, O(1) time.
We present a parallel sorting algorithm and its proof which sorts a sequence of n elements in time O(log2 n) with n/2 processors on an EREW-PRAM computational model. A sorting network directly implements the algorithm...
详细信息
We present a parallel sorting algorithm and its proof which sorts a sequence of n elements in time O(log2 n) with n/2 processors on an EREW-PRAM computational model. A sorting network directly implements the algorithm using O(*** n)PEs. The algorithm is based on the elementary Compare-Exchange operation and has the advantage that it does not require a powerful computational model, uses the least amount of space for the sorting problem, has small constants and can be implemented directly on a sorting network. Furthermore, the architecture of the network is simple and makes no unrealistic technological assumptions.
This paper presents a detailed analysis of a sampling approach used in the partitioning of a data file for the parallel balanced tree sort in a local area network or a multiprocessor environment. The average overall t...
详细信息
This paper presents a detailed analysis of a sampling approach used in the partitioning of a data file for the parallel balanced tree sort in a local area network or a multiprocessor environment. The average overall time complexity for sortingN data on a k processor system is derived. The performance of the parallel sorting rests upon how evenly the file can be partitioned into k ordered subfiles. A data partition scheme by sampling is proposed and analyzed. Formulas for computing the optimal sampling size are obtained. The results also show the computational improvement of the sorting as a function of k and the sampling overhead. The performance of the sampling method is studied and found to be approaching the absolute optimal in some cases.
暂无评论