An optimal lower bound on the average time required by any algorithm that merges two sorted lists on Valiants parallel computation tree model is proven.
An optimal lower bound on the average time required by any algorithm that merges two sorted lists on Valiants parallel computation tree model is proven.
This survey considers the design, composition, and capabilities of several programmingsystems for vector-pipeline supercomputers. We consider both industrial and research products. A prognosis is given for future dev...
详细信息
This survey considers the design, composition, and capabilities of several programmingsystems for vector-pipeline supercomputers. We consider both industrial and research products. A prognosis is given for future development of this region of system programming.
In compressed form, this paper presents fundamental notions, methods, and results from the theory of parallel computation. Results are given on parallel complexity for solution of problems in algebra, combinatorial pr...
详细信息
In compressed form, this paper presents fundamental notions, methods, and results from the theory of parallel computation. Results are given on parallel complexity for solution of problems in algebra, combinatorial problems, integer arithmetic, graph theory, computational geometry, and other fields.
In this article we describe a parallel algorithm which applies Givens rotations to selectively annihilate k(k + 1)/2 nonzero elements from two k × n (k &le n) upper trapezoidal submatrices. The new algorithm ...
详细信息
In this article we describe a parallel algorithm which applies Givens rotations to selectively annihilate k(k + 1)/2 nonzero elements from two k × n (k &le n) upper trapezoidal submatrices. The new algorithm we propose is suitable for implementation on either a pair of directly connected local-memory processors or two clusters of multiple tightly-coupled processors. Our analyses show that in both cases the proposed algorithms achieve optimal speed-up by balancing the work load distribution and masking inter-processor or inter-cluster communication by computation if k n. In the context of solving large scale least squares problems, this submatrix merging step is repetitively needed during the entire computation and, furthermore, there are usually many pairs of such submatrices to be merged with each submatrix stored in the memory of a processor or a cluster of processors. The proposed algorithm can be applied to each pair of submatrices concurrently and thus parallelizes an important step in solving the least squares problems.
This paper describes an implementation and performance evaluation of file replication in a locally distributed system. Different mechanisms are used to update the copies of a replicated file. The algorithms use both s...
详细信息
This paper describes an implementation and performance evaluation of file replication in a locally distributed system. Different mechanisms are used to update the copies of a replicated file. The algorithms use both sequential and concurrent update methods. When updating the files sequentially, the algorithm is executed on the host on which the file has been written. Concurrent update is executed by using one machine or by using all the remote machines on which the file is replicated. The algorithms are proposed to update the copies of the replicated file by using various methods. These algorithms are implemented in a locally distributed system. The performance of the algorithms is compared for different loads and various read and write accesses to the files.
The authors derive optimal lower bounds for this problem with the following area complexity: A = Θ(m(log n - log m + 1)) for k ≥ log n A = Θ(min{2k, m}(|k - log m| + 1)) for k &le log n provided that m &le ...
详细信息
The authors derive optimal lower bounds for this problem with the following area complexity: A = Θ(m(log n - log m + 1)) for k ≥ log n A = Θ(min{2k, m}(|k - log m| + 1)) for k &le log n provided that m &le n. From the result it follows that merging in general is easier than sorting of (m + n)-element arrays. On the otherhand, if m = n and k &le log n then these problems are of the same area complexity. Finally, the paper completes the investigation of area complexity of the problems of ordering.
A software system is described for the compression of a large look-up table to a smaller one, consistent with a worst-case error predefined by the user. The tables and a suitable source code for accessing them are aut...
详细信息
A software system is described for the compression of a large look-up table to a smaller one, consistent with a worst-case error predefined by the user. The tables and a suitable source code for accessing them are automatically generated, with very little user intervention. The techniques of linear interpolation and the partitioning of one table into several are shown to be particularly attractive for reducing the table size, especially when the considerable effort of manual generation to a known accuracy is removed. The use of linear interpolation incurs only a small speed penalty when executed on a digital signal processor and the large reductions in table size thus achieved can make the method a faster and more reliable alternative to either the exact or approximate evaluation of many functions.
作者:
LIN, ANASA
LEWIS RES CTRINST COMPUTAT MECH PROPULTCLEVELANDOH 44135
In the present paper we discuss a general approach to solve boundary value problems numerically in a parallel environment. The basic algorithm consists of two steps: the local step, where all the P available processor...
详细信息
In the present paper we discuss a general approach to solve boundary value problems numerically in a parallel environment. The basic algorithm consists of two steps: the local step, where all the P available processors work in parallel, and the global step, where one processor solves a tridiagonal linear system of the order P . The main advantages of this approach are twofold: First, this suggested approach is very flexible, especially in the local step, and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Second, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples uing this strategy are discussed.
Parallel algorithms that use shared resources are notoriously difficult to write and verify, especially when properties such as fairness and efficiency are desired. This paper introduces a new synchronization method c...
详细信息
Parallel algorithms that use shared resources are notoriously difficult to write and verify, especially when properties such as fairness and efficiency are desired. This paper introduces a new synchronization method called the group lock. This method has some of the advantages of strict synchronization methods such as monitors; algorithms written using group lock are quite clear and easy to verify. At the same time, these algorithms generally make better use of parallelism than those written using stricter synchronization methods. In many cases we can obtain worst case time bounds that are constant or logarithmic in the number of processes, thus also insuring bounded fairness. The paper begins with a discussion of related synchronization methods and an introduction to the group lock. This is followed by some examples of applications in algorithms for a readers/writers problem, parallel stacks and queues, implementation of fetch- and-phi routines, and others. Finally, two implementations of group lock are described. A detailed implementation is given for the paracomputer, an idealized MIMD multiprocessor that supports the fetch-and-add synchronization instruction. An implementation is also outlined for general CREW multiprocessors using only reads and writes to shared memory.
We consider the problem of deterministic sorting of integers on a parallel RAM (PRAM). The best previous result (T. Hagerup, 1987, Inform. and Comput.75, 39–51) states that n integers of size polynomial in n can be s...
详细信息
We consider the problem of deterministic sorting of integers on a parallel RAM (PRAM). The best previous result (T. Hagerup, 1987, Inform. and Comput.75, 39–51) states that n integers of size polynomial in n can be sorted in time O(log n) on a Priority CRCW PRAM with O(n log log nlog n) processors. We prove that n integers drawn from a set {0, …, m−1} can be sorted on an Arbitrary CRCW PRAM in time O(log nlog log n + log log m) with a time-processor product of O(n log log m). In particular, if m = n(log n)O(1), the time and number of processors used are O(log nlog log n) and O(n(log log n)2log n), respectively. This improves the previous result in several respects: The new algorithm is faster, it works on a weaker PRAM model, and it is closer to optimality for input numbers of superpolynomial size. If log log m = O(log nlog log n), the new algorithm is optimally fast, for any polynomial number of processors, and if log log m = (1 + Ω(1)) log log n and log log m = 0(log n), it has optimal speedup relative to the fastest known sequential algorithm. The space needed is O(nmε), for arbitrary but fixed ε > 0. The sorting algorithm derives its speed from a fast solution to a special list ranking problem of possible independent interest, the monotonic list ranking problem. In monotonic list ranking, each list element has an associated key, and the keys are known to increase monotonically along the list. We show that monotonic list ranking problems of size n can be solved optimally in time O(log nlog log n). We also discuss and attempt to solve some of the problems arising in the precise description and implementation of parallel recursive algorithms. As part of this effort, we introduce a new PRAM variant, the allocated PRAM.
暂无评论