Novel architectures for P2P applications were discussed. A distributed hash table (DHT) approach is studied. The probability a processor participates in a random lookup should be small. The memory requirements from ea...
详细信息
Novel architectures for P2P applications were discussed. A distributed hash table (DHT) approach is studied. The probability a processor participates in a random lookup should be small. The memory requirements from each server should be low. An overlay network should be build that performs lookup operations efficiently.
Mixed task and data parallelism exists naturally in many applications, but utilizing it may require sophisticated scheduling algorithms and software support. Recently, significant research effort has been applied to e...
详细信息
ISBN:
(纸本)9780897917179
Mixed task and data parallelism exists naturally in many applications, but utilizing it may require sophisticated scheduling algorithms and software support. Recently, significant research effort has been applied to exploiting mixed parallelism in both theory and systems communities. In this paper, we ask how much mixed parallelism will improve performance in practice, and how architectural evolution impacts these estimates. First, we build and validate a performance model for a class of mixed task and data parallel problems based on machine and problem parameters. Second, we use this model to estimate the gains from mixed parallelism for some scientific applications on current machines. This quantifies our intuition that mixed parallelism is best when either communication is slow or the number of processors is large. Third, we show that, for balanced divide and conquer trees, a simple one-time switch between data and task parallelism gets most of the benefit of general mixed parallelism. Fourth, we establish upper bounds to the benefits of mixed parallelism for irregular task graphs. Apart from these detailed analyses, we provide a framework in which other applications and machines can be evaluated.
A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallelalgorithms for matrix multiplication with perfect st...
详细信息
ISBN:
(纸本)9781450312134
A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallelalgorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallelalgorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms. Copyright is held by the author/owner(s).
A novel comprehensive and coherent approach for the purpose of increasing instruction-level parallelism (ILP) is devised. The key new tool in our envisioned system update is the addition of a parallel prefix-sum (PS) ...
详细信息
A novel comprehensive and coherent approach for the purpose of increasing instruction-level parallelism (ILP) is devised. The key new tool in our envisioned system update is the addition of a parallel prefix-sum (PS) instruction, which will have efficient implementation in hardware, to the instruction-set architecture. This addition gives for the first time a concrete way for recruiting the whole knowledge base of parallelalgorithms for that purpose. The potential increase in ILP is demonstrated by experimental results for a test application. The main technical contribution is in the form of a `completeness theorem'. Perhaps surprisingly, the current abstract proves that in an envisioned system which employs parallel PS functional units, a proper use of a serial programming language suffices for the following. With a moderate effort, one can program a parallel algorithm (in a serial language), so that a parallelizing compiler (even without run-time methods!) will be able to extract the same (i.e., `complete') ILP from such serial code as from code written in a parallel language. Alternatively, rather than have the programmer produce the serial code, a precompiler could derive it from a parallel language. The most interesting idea in the proof is the reliance on the new parallel PS for circumventing collision-ambiguity in references to memory. Other new ideas in the paper include hardware-design of a prefix-sum unit and an on-line algorithm for high-bandwidth register-files. An informal upshot of this paper is the following general insight: to accommodate parallelism in uniprocessor systems (from algorithms to ILP), it is sufficient to only add (and, of course, incorporate) parallel prefix-sum functional units to standard serial system designs.
We present a parallel randomized algorithm for finding if two planar graphs are isomorphic. Assuming that we have a tree of separators for each planar graph, our algorithm takes O(log(n)) time with P = (n1.5&middo...
详细信息
ISBN:
(纸本)0897913701
We present a parallel randomized algorithm for finding if two planar graphs are isomorphic. Assuming that we have a tree of separators for each planar graph, our algorithm takes O(log(n)) time with P = (n1.5·√log(n)) processors with probability to fail of 1/n or less, where n is the number of vertices. The algorithms needs 2·log(m)·log(n) + O(log(n)) random bits. The number of random bits can be decreased to O(log(n)) by increasing the processors number to n3/2+Ε. This algorithm significantly improves the previous results of n4 processors.
Dynamic algorithms are used to compute a property of some data while the data undergoes changes over time. Many dynamic algorithms have been proposed but nearly all are sequential. In this paper, we present our ongoin...
详细信息
ISBN:
(纸本)9781450345934
Dynamic algorithms are used to compute a property of some data while the data undergoes changes over time. Many dynamic algorithms have been proposed but nearly all are sequential. In this paper, we present our ongoing work on designing a parallel algorithm for the dynamic trees problem, which requires computing a property of a forest as the forest undergoes changes. Our algorithm allows insertion and/or deletion of both vertices and edges anywhere in the input and performs updates in parallel. We obtain our algorithm by applying a dynamization technique called self-adjusting computation to the classic algorithm of Miller and Reif for tree contraction.
This paper presents the design and analysis of a near linear-work parallel algorithm for solving symmetric diagonally dominant (SDD) linear systems. On input of a SDD n-by-n matrix A with m non-zero entries and a vect...
详细信息
ISBN:
(纸本)9781450307437
This paper presents the design and analysis of a near linear-work parallel algorithm for solving symmetric diagonally dominant (SDD) linear systems. On input of a SDD n-by-n matrix A with m non-zero entries and a vector h, our algorithm computes a vector (x) over tilde;such that parallel to(x) over tilde - A(+)b parallel to A <= epsilon . parallel to A(+)b parallel to A in O(m log(O(1)) n log 1/epsilon) work and O(m(1/3+theta) log 1/epsilon) depth for any fixed theta > 0. The algorithm relies on a parallel algorithm for generating low-stretch spanning trees or spanning subgraphs. To this end, we first develop a parallel decomposition algorithm that in polylogarithmic depth and (O) over tilde vertical bar E vertical bar) work(1), partitions a graph into components with polylogarithmic diameter such that only a small fraction of the original edges are between the components. This can be used to generate low-stretch spanning trees with average stretch O(n(alpha)) in O(n(1+alpha)) work and O(n(alpha)) depth. Alternatively, it can be used to generate spanning subgraphs with polylogarithmic average stretch in O(vertical bar E vertical bar) work and polylogarithmic depth. We apply this subgraph construction to derive our solver. By using the linear system solver in known applications, our results imply improved parallel randomized algorithms for several problems, including single-source shortest paths, maximum flow, min-cost flow, and approximate max-flow.
We present parallelalgorithms for evaluating game trees. These algorithmsparallelize the "left-To-right" sequential algorithm for evaluating AND]OR trees and the α-β pruning procedure for evaluating MIN/...
详细信息
The authors address the apparently difficult problem of doing parallel transitive closure when the (directed) graph is sparse and/or, only single-source information is desired. O(e) work is their target for the single...
详细信息
ISBN:
(纸本)0897913701
The authors address the apparently difficult problem of doing parallel transitive closure when the (directed) graph is sparse and/or, only single-source information is desired. O(e) work is their target for the single-source problem. When the graph is sparse, then the all-pairs transitive closure problem can be solved by performing a depth-first search from each node, taking O(ne) time;that is their target for the all-pairs problem. The authors do not reach either target, except for the all-pairs case when e is fairly large. However, they make significant progress, in the sense that they have the first algorithms that simultaneously use time much less than linear and use work that is less than M(n).
Andrews et al. [Automatic method for hiding latency in high bandwidth networks, in: Proceedings of the acmsymposium on Theory of Computing, 1996, pp. 257-265;Improved methods for hiding latency in high bandwidth netw...
详细信息
Andrews et al. [Automatic method for hiding latency in high bandwidth networks, in: Proceedings of the acmsymposium on Theory of Computing, 1996, pp. 257-265;Improved methods for hiding latency in high bandwidth networks, in: Proceedings of the Eighth annual acm symposium on parallel algorithms and architectures, 1996, pp. 52-61] introduced a number of techniques for automatically hiding latency when performing simulations of networks with unit delay links on networks with arbitrary unequal delay links. In their work, they assume that processors of the host network are identical in computational power to those of the guest network being simulated. They further assume that the links of the host are able to pipeline messages, i.e., they are able to deliver P packets in time O(P + d) where d is the delay on the link. In this paper we examine the effect of eliminating one or both of these assumptions. In particular, we provide an efficient simulation of a linear array of homogeneous processors connected by unit-delay links on a linear array of heterogeneous processors connected by links with arbitrary delay. We show that the slowdown achieved by our simulation is optimal. We then consider the case of simulating cliques by cliques;i.e., a clique of heterogeneous processors with arbitrary delay links is used to simulate a clique of homogeneous processors with unit delay links. We reduce the slowdown from the obvious bound of the maximum delay link to the average of the link delays. In the case of the linear array we consider both links with and without pipelining. For the clique simulation the links are not assumed to support pipelining. The main motivation of our results (as was the case with Andrews et al.) is to mitigate the degradation of performance when executing parallel programs designed for different architectures on a network of workstations (NOW). In such a setting it is unlikely that the links provided by the NOW will support pipelining and it is quite probab
暂无评论