List ranking and list scan are two primitive operations used in many parallelalgorithms that use list, trees, and graph data structures. But vectorizing and parallelizing list ranking is a challenge because it is hig...
详细信息
This volume of the proceedings contains 42 articles devoted to parallel processing algorithms and architectures used in computers with multiple processors. algorithms are presented for routing the processing steps, ma...
详细信息
ISBN:
(纸本)089791483X
This volume of the proceedings contains 42 articles devoted to parallel processing algorithms and architectures used in computers with multiple processors. algorithms are presented for routing the processing steps, mapping the data paths, sorting and interconnecting networks. Operating system program algorithms are discussed which optimize performance for hypercube, mesh connected arrays and to order sets, to select and label gray scale images and to schedule tasks.
In this paper we study the overheads arising in our algorithm for distributed evaluation of Min/Max trees. The overheads are classified into search overhead, performance loss, and decrease of work load. Several mechan...
详细信息
We consider the problem of finding the row maxima of a totally monotone n×n matrix. Many geometric and combinatorial problems can be solved efficiently by reductions to this problem. It is known that the row maxi...
详细信息
ISBN:
(纸本)0898713293
We consider the problem of finding the row maxima of a totally monotone n×n matrix. Many geometric and combinatorial problems can be solved efficiently by reductions to this problem. It is known that the row maxima of a totally monotone n×n matrix can be computed sequentially in O(n) time. We give randomized parallelalgorithms for this problem which run in O(log n) time and O(log log n) time with high probability on the EREW and CRCW PRAMs respectively, and have a time-processor product of O(n). The run-times are the best possible, even for randomized parallelalgorithms. Previous polylogarithmic time parallel solutions to this problem had a time-processor product of Θ(n log n/log log n) or worse. Our result directly improves the parallel complexity of several problems.
The fundamental problems in dynamic load balancing and job scheduling in parallel and distributed computers involve moving load between processors. In this paper, we consider a new model for load movement in synchrono...
详细信息
Barrier algorithms are central to the performance of numerous algorithms on scalable, high-performance architectures. Numerous barrier algorithms have been suggested and studied for Non-Uniform Memory Access (NUMA) ar...
详细信息
ISBN:
(纸本)0818656026
Barrier algorithms are central to the performance of numerous algorithms on scalable, high-performance architectures. Numerous barrier algorithms have been suggested and studied for Non-Uniform Memory Access (NUMA) architectures, but less work has been done for Cache Only Memory Access (COMA) or attraction memory [1] architectures such as the KSR-1. In this paper, we presented two new barrier algorithms that offer the best performance we have recorded on the KSR-1 distributed cache multiprocessor. We discuss the trade-offs and the performance of seven algorithms on two architectures. The new barrier algorithms adapt well to a hierarchical caching memory model and take advantage of parallel communication offered by most multiprocessor interconnection networks,. Performance results are shown for a 256-processor KSR-1 and a 20-processor Sequent Symmetry.
The queue-read, queue-write (QRQW) PRAM model [GMR94] permits concurrent reading and writing, but at a cost proportional to the number of readers/writers to a memory location in a given step. The QRQW model reflects t...
详细信息
ISBN:
(纸本)9780897916714
The queue-read, queue-write (QRQW) PRAM model [GMR94] permits concurrent reading and writing, but at a cost proportional to the number of readers/writers to a memory location in a given step. The QRQW model reflects the contention properties of most parallel machines more accurately than either the well-studied CRCW or EREW models: the CRCW model does not adequately penalize algorithms with high contention to shared memory locations, while the EREW model is too strict in its insistence on zero contention at each step. Of primary practical and theoretical interest, then, is the design of fast and efficient QRQW algorithms for problems for which all previous algorithms either suffer from high contention, fail to be fast, or fail to be *** paper describes low-contention, fast, work-optimal QRQW PRAM algorithms for the fundamental problems of finding a random permutation, parallel hashing, load balancing, and sorting. There is no known fast, work-optimal EREW algorithm known for finding a random permutation or for parallel hashing. For load balancing, we improve upon the EREW result whenever the ratio of the maximum to the average load is not too large. We show that the logarithmic dependence of the QRQW running time on this ratio is inherent by providing a matching lower *** demonstrate the performance advantage of a QRQW random permutation algorithm, compared with the popular EREW algorithm, by implementing and running both algorithms on the MasPar ***, we extend the work-time framework for the design of parallelalgorithms to account for contention, and relate it to the QRQW PRAM model. We use our QRQW load balancing algorithm, as well as the QRQW linear compaction algorithm in [GMR94], to provide automatic tools for processor allocation—an issue that needs to be handled when translating an algorithm from its work-time presentation into the explicit PRAM description.
parallelalgorithms for various versions of the stable matching problem are presented. The algorithms are based on the primal-dual interior path-following method for linear programming. The main result is that a stabl...
详细信息
ISBN:
(纸本)0898713293
parallelalgorithms for various versions of the stable matching problem are presented. The algorithms are based on the primal-dual interior path-following method for linear programming. The main result is that a stable matching can be found in O*(√m) time by a polynomial number of processors, where m is the total length of preference lists of individuals.
We devise optimum parallelalgorithms for solving a banded linear system of equations and for computing the determinant of a banded matrix, substantially improving the previous record computational complexity estimate...
详细信息
ISBN:
(纸本)0898713293
We devise optimum parallelalgorithms for solving a banded linear system of equations and for computing the determinant of a banded matrix, substantially improving the previous record computational complexity estimates of [E]. Our algorithms are in NC or RNC and support the optimum bounds on the potential work (the product of time and processor bounds). Moreover, these algorithms are in NC1 or RNC1 if the bandwidth is a constant.
This paper describes our experiences developing high-performance code for astrophysical N-body simulations. Recent N-body methods are based on an adaptive tree structure. The tree must be built and maintained across p...
详细信息
ISBN:
(纸本)9780897916714
This paper describes our experiences developing high-performance code for astrophysical N-body simulations. Recent N-body methods are based on an adaptive tree structure. The tree must be built and maintained across physically distributed memory; moreover, the communication requirements are irregular and adaptive. Together with the need to balance the computational work-load among processors, these issues pose interesting challenges and tradeoffs for high-performance *** implementation was guided by the need to keep solutions simple and general. We use a technique for implicitly representing a dynamic global tree across multiple processors which substantially reduces the programming complexity as well as the performance overheads of distributed memory architectures. The contributions include methods to vectorize the computation and minimize communication time which are theoretically and experimentally *** code has been tested by varying the number and distribution of bodies on different configurations of the Connection Machine CM-5. The overall performance on instances with 10 million bodies is typically over 30% of the peak machine rate. Preliminary timings compare favorably with other approaches.
暂无评论