This paper describes the derivation of an empirically efficient parallel two-dimensional Delaunay triangulation program from a theoretically efficient CREW PRAM algorithm. Compared to previous work, the resulting impl...
详细信息
ISBN:
(纸本)9780897918909
This paper describes the derivation of an empirically efficient parallel two-dimensional Delaunay triangulation program from a theoretically efficient CREW PRAM algorithm. Compared to previous work, the resulting implementation is not limited to datasets with a uniform distribution of points, achieves significantly better speedups over good serial code, and is widely portable due to its use of MPI as a communication mechanism. Results are presented for a loosely-coupled cluster of workstations, a distributed-memory multicomputer, and a shared-memory multiprocessor. The Machiavelli toolkit used to transform the nested data parallelism inherent in the divide-and-conquer algorithm into achievable task and data parallelism is also described and compared to previous techniques.
An earlier parallel list ranking algorithm performs well for problem sizes N that are extremely large in comparison to the number of PUs P. However, no existing algorithm gives good performance for reasonable loads. W...
详细信息
ISBN:
(纸本)9780897918909
An earlier parallel list ranking algorithm performs well for problem sizes N that are extremely large in comparison to the number of PUs P. However, no existing algorithm gives good performance for reasonable loads. We present a novel family of algorithms, that achieve a better trade-off between the number of start-ups and the routing volume. We have implemented them on an Intel Paragon, and they turn out to considerably outperform all earlier algorithms: with P = 2 the sequential algorithm is already beaten for N = 25,000;for P = 100 and N = 107, the speed-up is 21, and for N = 108 it even reaches 30. A modification of one of our algorithms solves a theoretical question: we show that on one-dimensional processor arrays, list ranking can be solved with a number of steps equal to the diameter of the network.
It is well known that simple randomized load balancing schemes can balance load effectively while incurring only a small overhead, making such schemes appealing for practical systems. In this paper, we provide new ana...
详细信息
ISBN:
(纸本)9780897918909
It is well known that simple randomized load balancing schemes can balance load effectively while incurring only a small overhead, making such schemes appealing for practical systems. In this paper, we provide new analyses for several such dynamic randomized load balancing schemes. Unlike previous analyses, we do not assume that in equilibrium each server is stochastically independent from other servers. Our work extends a previous analysis of the supermarket model, a model that abstracts a simple, efficient load balancing scheme in the setting where jobs arrive at a large system of parallel processors. In this model, customers arrive at a system of n servers as a Poisson stream of rate λn, λ
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. For certain large scale applications this ...
详细信息
ISBN:
(纸本)9780897918909
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. For certain large scale applications this is necessarily true. Typically, the cost models proposed for external memory algorithms have measured only the number of I/O operations, and the algorithms have been specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work based on parallelalgorithms to EM, but with limited success. In this paper we provide simulation techniques which produce efficient EM algorithms from efficient algorithms developed under BSP-like parallel computing models. Our techniques can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. In addition to the main simulation result we obtain a more comprehensive cost model for EM algorithms, which considers the total costs incurred by the algorithm including computation, I/O and communication costs.
We present a parallel algorithm on a CRCW PRAM for three-dimensional pattern matching, i.e., finding all occurrences of a pattern of size m3 in a text of size n3. The text search of the algorithm is alphabet-independe...
详细信息
ISBN:
(纸本)9780897918909
We present a parallel algorithm on a CRCW PRAM for three-dimensional pattern matching, i.e., finding all occurrences of a pattern of size m3 in a text of size n3. The text search of the algorithm is alphabet-independent and runs in optimal constant time. The preprocessing computes a witness array for the pattern, and it runs in O(log m) time using m3 processors. The witness computation uses multilayer suffix trees and it works for any dimensions. Our result is based on a new characterization of three-dimensional periodicity. The parallel algorithm can be translated into a sequential algorithm for three-dimensional pattern matching whose text search is alphabet-independent and runs in O(n3) time and whose preprocessing runs in O(m3 log σ) time, where σ = min(|Σ|, m3) and Σ is the alphabet of symbols.
Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. We present a novel method to dynamically balance the processor workloads ...
详细信息
ISBN:
(纸本)9780897918909
Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. We present a novel method to dynamically balance the processor workloads with a global view. This paper presents, for the first time, the implementation and integration of all major components within our dynamic load balancing strategy for adaptive grid calculations. Mesh adaption, repartitioning, processor assignment, and remapping are critical components of the framework that must be accomplished rapidly and efficiently so as not to cause a significant overhead to the numerical simulation. Previous results indicated that mesh repartitioning and data remapping are potential bottlenecks for performing large-scale scientific calculations. We resolve these issues and demonstrate that our framework remains viable on a large number of processors.
We consider the question of whether a shared-memory model can serve as an effective bridging model for parallel computation along the lines of a distributed-memory model such as the BSP. As a candidate for a shared-me...
详细信息
We consider the question of whether a shared-memory model can serve as an effective bridging model for parallel computation along the lines of a distributed-memory model such as the BSP. As a candidate for a shared-memory bridging model, we introduce the Queuing Shared Memory (QSM) model, which accounts for limited communication bandwidth while still providing a simple shared-memory abstraction. We substantiate the ability of the QSM to serve as a bridging model by providing a simple work-preserving emulation of the QSM on both the BSP, and on a related model, the (d,x)-BSP. We present evidence that the features of the QSM are essential to its effectiveness as a bridging model. In addition, we describe scenarios in which the high-level QSM is more suited for analyzing algorithms on certain machines than the more detailed BSP and LogP models. Finally, we present algorithmic results for the QSM, as well as general strategies for mapping algorithms designed for the BSP or PRAM models onto the QSM model. Our main conclusion is that shared-memory models can potentially serve as viable alternatives to existing message-passing, distributed-memory bridging models.
A parallel variant of breadth-first search for distributed computing is presented. The variant allows exhaustive enumeration of elements of a search space (implicitly defined graph) in which the representation of all ...
详细信息
ISBN:
(纸本)9780897918909
A parallel variant of breadth-first search for distributed computing is presented. The variant allows exhaustive enumeration of elements of a search space (implicitly defined graph) in which the representation of all graph nodes would otherwise require more than the total available memory. This algorithm requires the use of a tadpole data structure to partition the search space into connected subgraphs, with each subgraph stored within the memory of a single processor. Thus, the graph of the nodes are stored in a way that adds greater spatial locality, thereby reducing communication among processors. The algorithm enumerates the tadpoles in a breadth-first manner while executing depth-first search within each tadpole. The search within each tadpole is reduced to tree search. A parameter is defined that allows a linear tradeoff in which memory and communication are each reduced linearly as CPU time grows. The result appears to fill a gap in the literature, which has concentrated on parallel depth-first search, but has been relatively sparse in the case of parallel breadth-first search.
A novel comprehensive and coherent approach for the purpose of increasing instruction-level parallelism (ILP) is devised. The key new tool in our envisioned system update is the addition of a parallel prefix-sum (PS) ...
详细信息
A novel comprehensive and coherent approach for the purpose of increasing instruction-level parallelism (ILP) is devised. The key new tool in our envisioned system update is the addition of a parallel prefix-sum (PS) instruction, which will have efficient implementation in hardware, to the instruction-set architecture. This addition gives for the first time a concrete way for recruiting the whole knowledge base of parallelalgorithms for that purpose. The potential increase in ILP is demonstrated by experimental results for a test application. The main technical contribution is in the form of a `completeness theorem'. Perhaps surprisingly, the current abstract proves that in an envisioned system which employs parallel PS functional units, a proper use of a serial programming language suffices for the following. With a moderate effort, one can program a parallel algorithm (in a serial language), so that a parallelizing compiler (even without run-time methods!) will be able to extract the same (i.e., `complete') ILP from such serial code as from code written in a parallel language. Alternatively, rather than have the programmer produce the serial code, a precompiler could derive it from a parallel language. The most interesting idea in the proof is the reliance on the new parallel PS for circumventing collision-ambiguity in references to memory. Other new ideas in the paper include hardware-design of a prefix-sum unit and an on-line algorithm for high-bandwidth register-files. An informal upshot of this paper is the following general insight: to accommodate parallelism in uniprocessor systems (from algorithms to ILP), it is sufficient to only add (and, of course, incorporate) parallel prefix-sum functional units to standard serial system designs.
Given a network of processors, a structured communication problem consists to route a communication pattern known in advance. Structured communication problems appear frequently in parallel computing. Hence, communica...
详细信息
ISBN:
(纸本)9780897918909
Given a network of processors, a structured communication problem consists to route a communication pattern known in advance. Structured communication problems appear frequently in parallel computing. Hence, communication libraries (e.g, PVM or MPI) generally include a specific access to procedures solving the most common problems of this type. A standard communication model assumes that information proceeds by a sequence of calls between neighboring nodes of the network, and that each node is allowed to call at most one neighbor at a time. In this context, most of the decision problems corresponding to the usual structured communication problems have been shown to be NP-complete. Therefore, several approximation algorithms have been proposed to solve specific problems. Each of these algorithms is dedicated to a particular problem. In this paper, we present a high level method which can be used to derive approximation algorithms for many different structured communication problems on arbitrary networks. This method is shown to be quite efficient. For instance, in the case of broadcasting (one-to-all) and gossiping (all-to-all), it allows to obtain algorithms which approach, and even sometime achieve, the performances of the best known algorithms which were designed using specific techniques defined in purpose for each problem. Moreover, in the case of scattering (personalized one-to-all) and multi-scattering (personalized all-to-all), our method allows to obtain optimal results. Our method is based on the simulation on arbitrary networks of movements of data induced by communication algorithms derived for specific architectures and/or specific communication models. This simulation uses sophisticated embedding and scheduling techniques.
暂无评论