This paper addresses the question of the future applicability of bus-based multiprocessors, by studying the execution characteristics of a number of data parallel numerical and scientific applications, as the system p...
详细信息
ISBN:
(纸本)9780897917179
This paper addresses the question of the future applicability of bus-based multiprocessors, by studying the execution characteristics of a number of data parallel numerical and scientific applications, as the system parameters of processor speed, bus bandwidth, and cache size are scaled (holding the number of processors fixed). This study focuses on 'asymptotic' results. Specifically, it addresses whether the application eventually become communication bound as the machine and the application are scaled. Essentially, it addresses the question of whether the machine becomes unsuitable as a parallel computing platform for the application. A transient analysis on the rate which the asymptotic results take hold is also performed.
Techniques for quickly executing lengthy computations by the use of molecular parallelism are described. It is demonstrated that molecular computations can be done using short DNA strands by more or less conventional ...
详细信息
ISBN:
(纸本)9780897917179
Techniques for quickly executing lengthy computations by the use of molecular parallelism are described. It is demonstrated that molecular computations can be done using short DNA strands by more or less conventional biotechnology engineering techniques within a small number of laboratory steps. Two abstract models of molecular computation are proposed. The first, the parallel Associative Memory Model, is a very high level model which includes a parallel Associative Matching operation, that appears to improve the power of molecular parallelism beyond the operations previously considered by Lipton (1994). A Recombinant DNA Model is also proposed which is a low level model that allows operations that are abstractions of very well understood recombinant DNA operations and provides a representation, herein called complex, for the relevant structural properties of DNA.
The original LogP model for parallel computation use four parameters;the communication latency (L), overhead (n), bandwidth (g), and the number of processors (P). This basic model is extended with a linear model for l...
详细信息
The original LogP model for parallel computation use four parameters;the communication latency (L), overhead (n), bandwidth (g), and the number of processors (P). This basic model is extended with a linear model for long messages. The resulting combination which is called the LogGP model for parallel computation has one additional parameter, G, which captures the bandwidth obtained for long messages. In this paper, the algorithm design and analysis under the new model are discussed, and the all-to-all remap, FFT, and radix sort are examined. In addition, the single node scatter problem is, examined. Finally, solutions are derived for the problem, and the their optimality under the LogGP model is proven.
Most high-level parallel programming languages allow for fine-grained parallelism. Programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to ...
详细信息
ISBN:
(纸本)9780897917179
Most high-level parallel programming languages allow for fine-grained parallelism. Programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. When executing such programs, the major concern is to dynamically schedule tasks to processors in order to minimize execution time and the amount of memory needed. In this paper, a class of parallel schedules that are provably efficient in both time and space, even for programs whose task structure is revealed only during execution are identified. Following this, an efficient dynamic scheduling algorithm that generates schedules in this class, for languages with nested fine-grained parallelism is described.
We present a randomized parallel algorithm for constructing the 3D convex hull on a generic p-processor coarse grained multicomputer with arbitrary interconnection network and n/p local memory per processor, where n/p...
详细信息
We present a randomized parallel algorithm for constructing the 3D convex hull on a generic p-processor coarse grained multicomputer with arbitrary interconnection network and n/p local memory per processor, where n/p ≥ p2+Ε (for some arbitrarily small Ε > 0). For any given set of n points in 3-space, the algorithm computes the 3D convex hull, with high probability, in O(n log n÷p) local computation time and O(1) communication phases with at most O(n÷p) data sent/received by each processor. That is, with high probability, the algorithm computes the 3D convex hull of an arbitrary point set in time O(n log n÷p + Γn,p), where Γn,p denotes the time complexity of one communication phase. In the terminology of the BSP model, our algorithm requires, with high probability, O(1) supersteps and a synchronization period Θ(n log n÷p). In the LogP model, the execution time of our algorithm is asymptotically optimal for several architectures.
The circuit value update problem is the problem of updating values in a representation of a combinational circuit when some of the inputs are changed. We assume for simplicity that each combinatorial element has bound...
详细信息
ISBN:
(纸本)9780897917179
The circuit value update problem is the problem of updating values in a representation of a combinational circuit when some of the inputs are changed. We assume for simplicity that each combinatorial element has bounded fan-in and fan-out and can be evaluated in constant time. This problem is easily solved on an ordinary serial computer in O(W + D) time, where W is the number of elements in the altered subcircuit and D is the subcircuit's embedded depth (its depth measured in the original circuit). In this paper, we show how to solve the circuit value update problem efficiently on a P-processor parallel computer. We give a straightforward synchronous, parallel algorithm that runs in O(W/P + D lg P) expected time. Our main contribution, however, is an optimistic, asynchronous, parallel algorithm that runs in O(W/P + D + lg W + lg P) expected time, where W and D are the size and embedded depth, respectively, of the 'volatile' subcircuit, the subcircuit of elements that have inputs which either change or glitch as a result of the update. To our knowledge, our analysis provides the first analytical bounds on the running time of an optimistic algorithm.
Two hardware barrier synchronization schemes are presented which can support deep levels of control nesting in data parallel programs. Hardware barriers are usually an order of magnitude faster than software implement...
详细信息
ISBN:
(纸本)9780897917179
Two hardware barrier synchronization schemes are presented which can support deep levels of control nesting in data parallel programs. Hardware barriers are usually an order of magnitude faster than software implementations. Since large data parallel programs often have several levels of nested barriers, these schemes provide significant speedups in the execution of such programs on MIMD computers. The first scheme performs code transformations and uses two single-bit-trees to implement unlimited levels of nested barriers. However, this scheme increases the code size. The second scheme uses a more expensive integer-tree to support an exponential number of nested barriers without increasing the code size. Using hardware already available on commercial MIMD computers, this scheme can support more than four billion levels of nesting.
In this paper, we address the question how efficiently a single constant-degree processor network can simulate the computation of any constant-degree processor network. We show the following lower bound trade-off: If ...
详细信息
In this paper, we address the question how efficiently a single constant-degree processor network can simulate the computation of any constant-degree processor network. We show the following lower bound trade-off: If M is an arbitrary constant-degree processor network of size m that can simulate all constant-degree processor networks of size n with slowdown s, then m·s = Ω(n log m). Our trade-off holds for a very general model of simulations. It covers all previously considered models and all known techniques for simulations among networks. For m ≥ n, this improves a previous lower bound by a factor of log log n, proved for a weaker simulation model. For m < n, this is the first non-trivial lower bound for this problem. In this case, this lower bound is asymptotically tight.
The proceedings contain 47 papers. The topics discussed include: Fault-Tolerant Meshes with Small Degree;the verification of cache coherence protocols;fault diagnosis in a small constant number of parallel testing rou...
ISBN:
(纸本)0897915992
The proceedings contain 47 papers. The topics discussed include: Fault-Tolerant Meshes with Small Degree;the verification of cache coherence protocols;fault diagnosis in a small constant number of parallel testing rounds;tight bounds on the size of fault-tolerant merging and sorting networks with destructive faults;on Gazit and Miller�s parallel algorithm for planar separators: achieving greater efficiency through random sampling;parallel and output sensitive algorithms for combinatorial and linear algebra problems;efficient parallel shortest-paths in digraphs with a separator decomposition;components for computing and communications;highly efficient dictionary matching in parallel;optimal parallel two dimensional pattern matching;and parallel construction and query of suffix trees for two-dimensional matrices.
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consist...
详细信息
ISBN:
(纸本)9780897917179
We study the problem of sorting on a parallel computer with limited communication bandwidth. By using the recently proposed PRAM(m) model, where p processors communicate through a small, globally shared memory consisting of m bits, we focus on the trade-off between the amount of local computation and the amount of inter-processor communication required for parallel sorting algorithms. We prove a lower bound of Ω(n log m/m) on the time to sort n numbers in an exclusive-read variant of the PRAM (m) model. We show that Leighton's Columnsort can be used to give an asymptotically matching upper bound in the case where m grows as a fractional power of n. The bounds are of a surprising form, in that they have little dependence on the parameter p. This implies that attempting to distribute the workload across more processors while holding the problem size and the size of the shared memory fixed will not improve the optimal running time of sorting in this model. We also show that both the upper and the lower bound can be adapted to bridging models that address the issue of limited communication bandwidth: the LogP model and the BSP model. The lower bounds provide convincing evidence that efficient parallelalgorithms for sorting rely strongly on high communication bandwidth.
暂无评论