Several different diffusion schemes have previously been developed for load balancing on homogeneous processor networks. We generalize existing schemes, in order to deal with heterogeneous networks. Generalized scheme...
详细信息
Several different diffusion schemes have previously been developed for load balancing on homogeneous processor networks. We generalize existing schemes, in order to deal with heterogeneous networks. Generalized schemes may operate efficiently on networks where each processor can have arbitrary computing power, i.e., the load will be balanced proportionally to these powers. the balancing flow that is calculated by schemes for homogeneous networks is minimal with regard to the l(2)-norm and we prove this to hold true for generalized schemes, too. We demonstrate the usability of generalized schemes by a number of experiments on several heterogeneous networks.
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, ...
详细信息
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, where communication between processors has a cost, and where all scheduling must be online. this problem has been considered previously in the fields of asynchronous parallel computing and scheduling theory. Our model is a bridge between the assumptions in these fields. We provide a new more accurate analysis of an old scheduling algorithm called the maximum utilization scheduler. Based on this analysis, we generalize this scheduling policy and define the high utilization scheduler. We next focus on the Cilk platform and introduce a new algorithm for scheduling Cilk multithreaded parallel programs on heterogeneous processors. this scheduler is inspired by the high utilization scheduler and is modified to fit in a Cilk context. A crucial aspect of our algorithm is that it keeps the original spirit of the Cilk scheduler. In fact, when our new algorithm runs on homogeneous processors, it exactly mimics the dynamics of the original Cilk scheduler.
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, ...
详细信息
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, where communication between processors has a cost, and where all scheduling must be online. this problem has been considered previously in the fields of asynchronous parallel computing and scheduling theory. Our model is a bridge between the assumptions in these fields. We provide a new more accurate analysis of an old scheduling algorithm called the maximum utilization scheduler. Based on this analysis, we generalize this scheduling policy and define the high utilization scheduler. We next focus on the Cilk platform and introduce a new algorithm for scheduling Cilk multithreaded parallel programs on heterogeneous processors. this scheduler is inspired by the high utilization scheduler and is modified to fit in a Cilk context. A crucial aspect of our algorithm is that it keeps the original spirit of the Cilk scheduler. In fact, when our new algorithm runs on homogeneous processors, it exactly mimics the dynamics of the original Cilk scheduler.
this paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines, where movement of data to and from the cache is solely controlled by the hardware. We prese...
详细信息
this paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines, where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the number of cache misses when using work stealing, and introduce a locality-guided work-stealing algorithm and its experimental validation. As a lower bound, we show that a work-stealing application that exhibits good data locality on a uniprocessor may exhibit poor data locality on a multiprocessor. In particular, we show a family of multithreaded computations G(n) whose members perform theta(n) operations (work) and incur a constant number of cache misses on a uniprocessor, while even on two processors the total number of cache misses soars to Ohm(n). On the other hand, we show a tight upper bound on the number of cache misses that nested-parallel computations, a large, important class of computations, incur due to multiprocessing. In particular, for nested-parallel computations, we show that on P processors a multiprocessor execution incurs an expected O(C [m/s] PTinfinity) more misses than the uniprocessor execution. Here m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T-infinity is the number of nodes on the longest chain of dependencies. Based on this we give strong execution time bounds for nested-parallel computations using work stealing. For the second part of our results, we present a locality-guided work-stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative dataparallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, locality guided guided work stealing
A deterministic, oblivious, permutation routing algorithm on the 2D meshes of constant queue size is presented. An optimal, linear-time bound for oblivious permutation routing could be obtained for these meshes. A new...
详细信息
A deterministic, oblivious, permutation routing algorithm on the 2D meshes of constant queue size is presented. An optimal, linear-time bound for oblivious permutation routing could be obtained for these meshes. A new algorithm based on bit-reversal permutation for oblivious routing is proposed. A (2.954+Ε)n oblivious routing is proposed where Ε is any positive constant and the queue-size is a function of Ε. the running time of the algorithm is less than 1.5 times as large as the network diameter of the mesh. the n×n mesh is considered and model on these meshes are presented. the problems in controlling the pipeline of packet flows is discussed.
Suppose that a parallel algorithm can include any number of parallelthreads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads c...
详细信息
Suppose that a parallel algorithm can include any number of parallelthreads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads can be formed. What kind of problems can such restrictive algorithms solve and still be competitive in the total number of operations they perform withthe fastest serial algorithm for the same problem? Intrigued by this informal question, we considered one of the most elementary parallel algorithmic paradigms, that of balanced binary trees. the main contribution of this paper is a new balanced (not necessarily binary) tree no-busy-wait paradigm for parallelalgorithms;applications of the basic paradigm to two problems are presented: building heaps, and executing parallel tree contraction (assuming a preparatory stage);the latter is known to be applicable to evaluating a family of general arithmetic expressions. For putting things in context, we also discuss our `PRAM-on-chip' vision (actually a small update to it), presented at SPAA98.
Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages...
详细信息
Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages. We propose a model for optimizing the exchange of messages under such circumstances which we call the minimum phase remapping problem. We first show that the problem is NP-Complete, and then analyze several methodologies for addressing it. First, we show how the problem can be phrased as an instance of multi-commodity flow. Next, we study a continuous approximation to the problem. We show that this continuous approximation has a solution which requires at most two more phases than the optimal discrete solution, but the question of how to consistently obtain a good discrete solution from the continuous problem remains open. Finally, we devise a simple and practical approximation algorithm for the problem with a bound of 1.5 times the optimal number of phases.
this paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. the parallel Vec...
详细信息
this paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. the parallel Vector Access (PVA) unit exploits the regularity of vectors or streams to access them efficiently in parallel on a multi-bank SDRAM memory system. the PVA unit performs scatter/gather operations so that only the elements accessed by the application are transmitted across the system bus. Vector operations are broadcast in parallel to all memory banks, each of which implements an efficient algorithm to determine which vector elements it holds. Earlier performance evaluations have demonstrated that our PVA implementation loads elements up to 32.8 times faster than a conventional memory system and 3.3 times faster than a pipelined vector unit, without hurting the performance of normal cache-line fills. Here we present the underlying PVA algorithms for both word interleaved and cache-line inter-leaved memory systems.
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, ...
详细信息
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, where communication between processors has a cost, and where all scheduling must be online. this problem has been considered previously in the fields of asynchronous parallel computing and scheduling theory. Our model is a bridge between the assumptions in these fields. We provide a new more accurate analysis of an old scheduling algorithm called the maximum utilization scheduler. Based on this analysis, we generalize this scheduling policy and define the high utilization scheduler. We next focus on the Cilk platform and introduce a new algorithm for scheduling Cilk multithreaded parallel programs on heterogeneous processors. this scheduler is inspired by the high utilization scheduler and is modified to fit in a Cilk context.' A crucial aspect of our algorithm is that it keeps the original spirit of the Cilk scheduler. In fact, when our new algorithm runs on homogeneous processors, it exactly mimics the dynamics of the original Cilk scheduler.
We refine the model underlying our prior work on scheduling cycle-stealing opportunities in NOWs, obtaining a model wherein the scheduling guidelines of [16] produce optimal schedules for every such opportunity. Altho...
详细信息
We refine the model underlying our prior work on scheduling cycle-stealing opportunities in NOWs, obtaining a model wherein the scheduling guidelines of [16] produce optimal schedules for every such opportunity. Although computing optimal schedules usually requires the use of general (often inefficient) function-optimizing methods, we show how to compute optimal schedules efficiently for the broad class of opportunities whose durations come from a concave probability distribution. Even when no such efficient computation of an optimal schedule is available, our refined model always suggests a natural notion of approximately optimal schedule, which may be efficiently computable. We illustrate such efficient approximability via the important class of cycle-stealing opportunities whose durations come from a heavy-tailed distribution. Such opportunities do not admit any optimal schedule - nor even a natural notion of approximately optimal schedule - within the model of [5, 16]. Within our refined model, though, we derive computationally simple schedules for heavy-tailed opportunities, which can be `tuned' to have expected work-output that is arbitrarily close to optimal.
暂无评论