This paper introduces the queue-read queue-write (QRQW) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of...
详细信息
This paper introduces the queue-read queue-write (QRQW) parallel random access machine (pram) model, which permits concurrent reading and writing to shared-memory locations, but at a cost proportional to the number of readers/writers to any one memory location in a given step. Prior to this work there were no formal complexity models that accounted for the contention to memory locations, despite its large impact on the performance of parallel programs. The QRQW pram model reflects the contention properties of most commercially available parallel machines more accurately than either the well-studied CRCW pram or EREW pram models: the CRCW model does not adequately penalize algorithms with high contention to shared-memory locations, while the EREW model is too strict in its insistence on zero contention at each step. The QRQW PRAM is strictly more powerful than the EREW pram. This paper shows a separation of root log n between the two models, and presents faster and more efficient QRQW algorithms for several basic problems, such as linear compaction, leader election, and processor allocation. Furthermore, we present a work-preserving emulation of the QRQW pram with only logarithmic slowdown on Valiant's bsp model, and hence on hypercube-type noncombining networks, even when latency, synchronization, and memory granularity overheads are taken into account. This matches the best-known emulation result for the EREW pram, and considerably improves upon the best-known efficient emulation for the CRCW pram on such networks. Finally, the paper presents several lower bound results for this model, including lower bounds on the time required for broadcasting and for leader election.
A promising new massively parallel technique for rigorous simulation of topography scattering in optical lithography has been developed and tested. The method is equivalent to the time-domain finite-difference method ...
详细信息
A promising new massively parallel technique for rigorous simulation of topography scattering in optical lithography has been developed and tested. The method is equivalent to the time-domain finite-difference method (TDFDM) used in electromagnetic scattering simulations, but exploits the parallel nature of wave propagation and the power of recent massively parallel architectures such as the Connection Machine. A working code called TEMPEST has been implemented on a Connection Machine CM-2 having 1 to 32 K processors with up to 1 M virtual processors. Numerical accuracy comparable with that of other fully rigorous methods was achieved. The convergence in the time domain for lithographic problems was found to be dominated by the physical process of multiple scattering of incident radiation. A very significant finding was that the solution required constant time per iteration for problems ranging from a few thousand unknowns up to one million, providing the ratio between the problem size and the number of processors is kept constant. The suitability of TEMPEST for solving a large class of topography structures important in alignment, metrology, and lithography is illustrated by examples from linewidth measurement with thin-film interference effects and projection printing in the presence of specular reflections from curved substrates.
作者:
ALNUWEIRI, HMKUMAR, VKP1. EEB-244
Department of Electrical Engineering Systems University of Southern California 90089-2562 Los Angeles CA USA
We present processor-time optimal parallel algorithms for several problems on n x n digitized image arrays, on a mesh-connected array having p processors and a memory of size O(n2) words. The number of processors p ca...
详细信息
We present processor-time optimal parallel algorithms for several problems on n x n digitized image arrays, on a mesh-connected array having p processors and a memory of size O(n2) words. The number of processors p can vary over the range [1, n3/2] while providing optimal speedup for these problems. The class of image problems considered here includes labeling the connected components of an image;computing the convex hull, the diameter, and a smallest enclosing box of each component;and computing all closest neighbors. Such problems arise in medium-level vision and require global operations on image pixels. To achieve optimal performance, several efficient data-movement and reduction techniques are developed for the proposed organization.
This paper presents very fast parallel algorithms for approximate edge coloring. Let log((1)) n = log n, log((k)) n = log(log((k-1)) n), and log*(n) = min{k \ log((k)) n /log(c/4) log*(n)])2 colors in O(log log*(n)) t...
详细信息
This paper presents very fast parallel algorithms for approximate edge coloring. Let log((1)) n = log n, log((k)) n = log(log((k-1)) n), and log*(n) = min{k \ log((k)) n < 1}. It is shown that a graph with n vertices and in edges can he edge colored with (2[log(1/4)log*(n)])(c) ([/log(c/4) log*(n)])2 colors in O(log log*(n)) time using O(m + n) processors on the EREW PRAM, where Delta is the maximum vertex degree of the graph and c is an arbitrarily large constant. It is also shown that the graph can he edge colored using at most [4 Delta (1+4/log log log*(Delta)) log(1/2) log*(Delta )1 colors in O(log Delta log log*(Delta)/log log log* (Delta) + log log*(n)) time using O(m + n) processors on the same model. O 2001 Elsevier Science B.V. All rights reserved.
In this paper, a partitioned-parallel strategy for the solution of the Toeplitz equations appearing in the linear prediction case is introduced. From the one extreme, i.e., the use of the Schur recursions for an order...
详细信息
In this paper, a partitioned-parallel strategy for the solution of the Toeplitz equations appearing in the linear prediction case is introduced. From the one extreme, i.e., the use of the Schur recursions for an order recursive implementation with O(1) processors which perform O(p2) MADs (multiplications and divisions), to the other extreme, i.e., the use of the same recursions for a fully parallel implementation with O(p) processors which perform O(p) MADs, there exists a compromise: The hardware designer can 'cut' the computational scheme into suitable partitions, which are executed one after the other, with all the computations of each partition organized in a parallel manner. This way he can achieve increased flexibility, especially in relation to the model order, which can become totally independent of the available number of processors. Moreover, in this paper an abatement methodology is introduced which significantly reduces the number of multiplications of the above computational schemes, as well as the overall algorithm complexity in the case of the parallel design.
A property of algorithms called synergy is introduced, and a quantity s , of synergy is defined. When synergized, both parallel and serial algorithms run faster, the parallel algorithms benefiting from a cooperation b...
详细信息
A property of algorithms called synergy is introduced, and a quantity s , of synergy is defined. When synergized, both parallel and serial algorithms run faster, the parallel algorithms benefiting from a cooperation between processors. Examples show that synergy is a useful concept in the design of parallel algorithms.
A new alternating group explicit method for solving convection dominated diffusion problem of variable coefficient is presented, and the energy method is used to prove the stability of the method.
A new alternating group explicit method for solving convection dominated diffusion problem of variable coefficient is presented, and the energy method is used to prove the stability of the method.
We present an efficient technique for parallel manipulation of data structures that avoids memory access conflicts. That is, this technique works on the Exclusive Read/Exclusive Write (EREW) model of computation, whic...
详细信息
We present an efficient technique for parallel manipulation of data structures that avoids memory access conflicts. That is, this technique works on the Exclusive Read/Exclusive Write (EREW) model of computation, which is the weakest shared memory, MIMD machine model. It is used in a new parallel radix sort algorithm that is optimal for keys whose values are over a small range. Using the radix sort and known results for parallel prefix on linked lists, we develop parallel algorithms that efficiently solve various computations on trees and “unicycular graphs.” Finally, we develop parallel algorithms for connected components, spanning trees, minimum spanning trees, and other graph problems. All of the graph algorithms achieve linear speedup for all but the sparsest graphs.
Irregular parallel algorithms pose a significant challenge for achieving high performance because of the difficulty predicting memory access patterns or execution paths. Within an irregular application, fine-grained s...
详细信息
Irregular parallel algorithms pose a significant challenge for achieving high performance because of the difficulty predicting memory access patterns or execution paths. Within an irregular application, fine-grained synchronization is one technique for managing the coordination of work;but in practice the actual performance for irregular problems depends on the input, the access pattern to shared data structures, the relative speed of processors, and the hardware support of synchronization primitives. In this paper, we focus on lock-free and mutual exclusion protocols for handling fine-grained synchronization. Mutual exclusion and lock-free protocols have received a fair amount of attention in coordinating accesses to shared data structures from concurrent processes. Mutual exclusion offers a simple programming abstraction, while lock-free data structures provide better fault tolerance and eliminate problems associated with critical sections such as priority inversion and deadlock. These synchronization protocols, however, are seldom used in parallel algorithm designs, especially for algorithms under the SPMD paradigm, as their implementations are highly hardware dependent and their costs are hard to characterize. Using graph-theoretic algorithms for illustrative purposes, we show experimental results on two shared-memory multiprocessors, the IBM pSeries 570 and the Sun Enterprise 4500, that irregular parallel algorithms with efficient fine-grained synchronization may yield good performance. (C) 2006 Elsevier Inc. All rights reserved.
A class of parallel and parallel/pipeline algorithms for computation of the manipulator inertial matrix is presented. An algorithm based on the composite rigid-body spatial inertia method, which results in less data d...
详细信息
A class of parallel and parallel/pipeline algorithms for computation of the manipulator inertial matrix is presented. An algorithm based on the composite rigid-body spatial inertia method, which results in less data dependency and hence better parallelization efficiency, is used for computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound of O((log/sub 2/ n))+O(1) in the computation with O(n/sup 2/) processors. The architectural features required for perfect mapping of these algorithms and their communication complexity are analyzed. The performance of the algorithms when mapped on two- and one-dimensional (linear) processor arrays with nearest-neighbor connection is investigated. Mapping on the linear array results in new algorithms with a computational complexity of k/sub 1/n(log/sub 2/n)+k/sub 2/(log/sub 2/n)+k/sub 3/. A parallel/pipeline algorithm is also presented which achieves the computation time of k/sub 1/n+k/sub 2/(log/sub 2/ n)+k/sub 3/ on the linear array. An architecture-oriented approach is used in the design of the algorithms.< >
暂无评论