Efficient use of resources in a parallel computer often requires the redistribution of tasks during the execution of applications. This problem of load balancing is a formidable one because of the unpredictable and dy...
详细信息
ISBN:
(纸本)0897916344
Efficient use of resources in a parallel computer often requires the redistribution of tasks during the execution of applications. This problem of load balancing is a formidable one because of the unpredictable and dynamic nature of many user programs. We present a Genetic Algorithm load balancer that optimizes the placement of tasks within the system. Our method incorporates the genetic algorithm into a more traditional load balancing framework. Using an event-driven simulator we evaluate the effectiveness of our algorithm and compare it to an existing approach. While genetic algorithms have been used in a variety of optimization problems, this represents the first attempt to use them in this domain. It also represents a departure from traditional heuristic-based load balancing schemes in that the genetic algorithm is the primary decision-making authority. Early results indicate that the use of genetic algorithms in load balancing can lead to a load balancing system that offers stability and adaptability to a variety of architectures and applications.
Consider a system of independent tasks which are to be scheduled without preemption on a parallel computer. For each task both the number of processors required and the corresponding execution time are known. The prob...
详细信息
ISBN:
(纸本)0898713293
Consider a system of independent tasks which are to be scheduled without preemption on a parallel computer. For each task both the number of processors required and the corresponding execution time are known. The problem of finding a schedule with minimum makespan has been extensively studied in the literature. In this paper we tackle the corresponding problem of finding a schedule with minimum average response time. The results are analogous: The average response time problem is also NP-hard, and we construct a polynomial time algorithm whose solution is within a fixed multiplicative constant of optimal. Moreover, we show that none of the classic algorithms for the makespan problem have this property when viewed as solutions to the average response time problem.
Automatic parallelization of real FORTRAN programs does not live up to users expectations yet, and dependence analysis algorithms which either produce too many false dependences or are too slow contribute significantl...
详细信息
ISBN:
(纸本)0897916360
Automatic parallelization of real FORTRAN programs does not live up to users expectations yet, and dependence analysis algorithms which either produce too many false dependences or are too slow contribute significantly to this. In this paper we introduce a data-flow dependence analysis algorithm which exactly computes value-based dependence relations for program fragments in which all subscripts, loop bounds and IF conditions are affine. Our algorithm also computes good affine approximations of dependence relations for non-affine program fragments. Actually, we know of no other algorithm which can compute better approximations.
We study fundamental comparison problems on strings of characters, equipped with the usual lexicographical ordering. For each problem studied, we give a parallel algorithm that is optimal with respect to at least one ...
详细信息
ISBN:
(纸本)0897916638
We study fundamental comparison problems on strings of characters, equipped with the usual lexicographical ordering. For each problem studied, we give a parallel algorithm that is optimal with respect to at least one criterion for which no optimal algorithm was previously known. Specifically, our main results are: qq Two sorted sequences of strings, containing altogether n characters, can be merged in O(log n) time using O(n) operations on an EREW PRAM. This is optimal as regards both the running time and the number of operations. qq A sequence of strings, containing altogether n characters represented by integers of size polynomial in n, can be sorted in O(log n/log log n) time using O(n log log n) operations on a CRCW PRAM. The running time is optimal for any polynomial number of processors. qq The minimum string in a sequence of strings containing altogether n characters can be found using (expected) O(n) operations in (expected) time ranging from O(1) on a randomized CRCW PRAM to O(log n log log n) on a deterministic EREW PRAM.
We provide an o(m)-work EREW PRAM algorithm to maintain the minimum spanning forest of an undirected graph under edge insertions and deletions. Then, using the sparsification data structure, we improve this result obt...
详细信息
Multithreaded architectures context switch between instruction streams to hide memory access latency. Although this improves processor utilization, it can increase cache interference and degrade overall performance. O...
详细信息
Multithreaded architectures context switch between instruction streams to hide memory access latency. Although this improves processor utilization, it can increase cache interference and degrade overall performance. One technique to reduce the interconnect traffic is to co-locate threads that share data on the same processor. Multiple threads sharing in the cache should reduce compulsory and invalidation misses, thereby improving execution time. To test this hypothesis, we compared a variety of thread placement algorithms via trace-driven simulation of fourteen coarse- and medium-grain parallel applications on several multithreaded architectures. Our results contradict the hypothesis. Rather than decreasing, compulsory and invalidation misses remained nearly constant across all placement algorithms, for all processor configurations, even with an infinite cache. That is, sharing-based placement had no (positive) effect on execution time. Instead, load balancing was the critical factor that affected performance. Our results were due to one or both of the following reasons: (1) the sequential and uniform access of shared data by the application's threads and (2) the insignificant number of data references that require interconnect access, relative to the total number of instructions.
We introduce the queue-read, queue-write (QRQW) PRAM model, which permits concurrent reading and writing, but at a cost proportional to the number of readers/writers to a memory location in a given step. Previously, t...
详细信息
ISBN:
(纸本)0898713293
We introduce the queue-read, queue-write (QRQW) PRAM model, which permits concurrent reading and writing, but at a cost proportional to the number of readers/writers to a memory location in a given step. Previously, there were no formal complexity models that accounted for the contention to memory locations, despite the large impact of such contentions on the performance of parallel programs. The QRQW PRAM is strictly more powerful than the EREW PRAM. We show a separation of √lg n between the two models, and present faster and more efficient QRQW algorithms for many basic problems. Nevertheless, we show that the QRQW can be efficiently emulated with only logarithmic slowdown on Valiant's BSP model, and hence on hypercube-type non-combining networks, even when latency, synchronization, and memory granularity overheads are taken into account. In contrast, efficient emulations for the CRCW PRAM on such networks are only known with polynomial slowdown. Most existing machines obey the queue-read, queue-write rule. We study the impact of the QRQW rule on algorithm design, devising new techniques for low-contention algorithms. Our results include fast and efficient algorithms for computing the OR, leader election, linear compaction, multiple compaction, integer sorting, and CRCW simulation, as well as several lower bounds. Some of the results presented are quite involved and use several novel ideas and techniques that are interesting on their own.
We introduce the subtree max gap problem. Consider a tree T with n leaves whose internal nodes have degree at least two. Each leaf is associated with a real number. For each internal node v, let Av be the set of numbe...
详细信息
ISBN:
(纸本)0898713293
We introduce the subtree max gap problem. Consider a tree T with n leaves whose internal nodes have degree at least two. Each leaf is associated with a real number. For each internal node v, let Av be the set of numbers associated with the leaves in the subtree rooted at v, regarded as points on the x-axis. The subtree max gap problem is to compute the maximum distance (gap) between any two consecutive points of Av for every internal node v of T. Our algorithm for the subtree max gap problem follows a series of reductions to other combinatorial problems which are interesting on their own merit. The algorithm runs in O(log n) time using n processors on the CREW PRAM. The subtree max gap problem plays a central role in the parallel solution of the string covering problem. Recently, Iliopoulos, Moore and Park gave an O(n log n) time sequential algorithm for the string covering problem. Neither parallelizing the above sequential algorithm nor using known techniques in strings seems to yield an efficient parallel algorithm for string covering. Our parallel algorithm thus follows a new approach, introducing the use of suffix trees and reducing the string covering problem to the subtree max gap problem. The algorithm runs in O(log n) time using n processors on the CRCW PRAM, thereby matching the number of operations in [15].
In this interdisciplinary research we apply the tools of algorithmic complexity theory to three important non-equilibrium growth models that are used in statistical physics. Much of the insight into these models has b...
详细信息
parallel prefix computation is perhaps the most frequently used subroutine in parallelalgorithms today. Its time complexity on the CRCW PRAM is Θ(lg n/lg lg n) using a polynomial number of processors, even in a rand...
详细信息
ISBN:
(纸本)0898713293
parallel prefix computation is perhaps the most frequently used subroutine in parallelalgorithms today. Its time complexity on the CRCW PRAM is Θ(lg n/lg lg n) using a polynomial number of processors, even in a randomized setting. Nevertheless, there are a number of non-trivial applications that have been shown to be solvable using only an approximate version of the prefix sums problem. In this paper we resolve the issue of approximating parallel prefix by introducing an algorithm that runs in O(lg* n) time with very high probability, using n/lg* n processors, which is optimal in terms of both work and running time. Our approximate prefix sums are guaranteed to come within a factor of (1+Ε) of the values of the true sums in a `consistent fashion', where Ε is o(1). We achieve this result through the use of a number of interesting new techniques, such as overcertification and estimate-focusing, as well as through new adaptations of known techniques, such as failure-sweeping and bit-thinning. We give a number of non-trivial applications of our approximate parallel prefix routine. Perhaps the most interesting application is for padded integer sorting, an approximation version of another fundamental problem in parallel algorithm design - integer sorting - where one wishes to sort n integers into an array of size O(n), allowing for gaps between consecutive elements. We show that this problem can also be solved in O(lg* n) time, with very high probability, using a linear amount of work, which is also optimal in both time and work. Finally, we show several applications to integer-coordinate (non-approximate) problems in computational geometry, such as convex hulls and hidden-line elimination, as well as for approximate selection.
暂无评论