We refine the model underlying our prior work on scheduling cycle-stealing opportunities in NOWs [5, 16], obtaining a model wherein the scheduling guidelines of [16] produce optimal schedules for every such opportunit...
ISBN:
(纸本)9781581131857
We refine the model underlying our prior work on scheduling cycle-stealing opportunities in NOWs [5, 16], obtaining a model wherein the scheduling guidelines of [16] produce optimal schedules for every such opportunity. Although computing optimal schedules usually requires the use of general (often inefficient) function-optimizing methods, we show how to compute optimal schedules efficiently for the broad class of opportunities whose durations come from a concave probability distribution. Even when no such efficient computation of an optimal schedule is available, our refined model always suggests a natural notion of approximately optimal schedule, which may be efficiently computable. We illustrate such efficient approximability via the important class of cycle-stealing opportunities whose durations come from a heavy-tailed distribution. Such opportunities do not admit any optimal schedule—nor even a natural notion of approximately optimal schedule—within the model of [5, 16]. Within our refined model, though, we derive computationally simple schedules for heavy-tailed opportunities, which can be “tuned” to have expected work-output that is arbitrarily close to optimal.
In recent years, the task of allocating jobs to servers has been studied with the “balls and bins” abstraction. Results in this area exploit the large decrease in maximum load that can be achieved by allowing each j...
ISBN:
(纸本)9781581131857
In recent years, the task of allocating jobs to servers has been studied with the “balls and bins” abstraction. Results in this area exploit the large decrease in maximum load that can be achieved by allowing each job (ball) a little freedom in choosing its destination server (bin).In this paper we examine an infinite and parallel allocation process (see [ABS98]) which is related to the “balls and bins” abstraction. The simple process can be used to model many problems arising in applications like load balancing, data accesses for parallel data servers, hashing, and PRAM ***, the parallel allocation process behaves in a highly non-uniform manner which makes its analysis challenging. Even the typically simple question of for which arrival rates the process is stable, is highly non-trivial. In order to cope with this non-uniform behavior we introduce a new sequential process and show (via simulations) that the sequential process models the behavior of the parallel one very accurately. We develop a system of ordinary differential equations in order to describe the behavior of our sequential process and present a thorough analysis of the performance this process. For example, we show that the queue length distribution decreases double-exponentially. Finally, we present simulation results indicating that the solutions to the differential equations very well predict the queue length distribution of our sequential process and the largest injection rate for which it is ***, we can conclude that in all the performance characteristics we have measured experimentally, the parallel and the sequential process are closely related. This indicates that the obtained solution of the differential equations and the results presented above are applicable to the parallel process, too.
The problem of obtaining efficient solutions for parallel evaluation of linear algebraic circuits (LAC) is considered. The parallelism is obtained through matrix multiplications. A CREW PRAM algorithm is used, in whic...
详细信息
The problem of obtaining efficient solutions for parallel evaluation of linear algebraic circuits (LAC) is considered. The parallelism is obtained through matrix multiplications. A CREW PRAM algorithm is used, in which, during execution, propagates computed values into future matrix products, and uses a special scheduling of the matrix multiplications to reduce constant factors of the execution times.
An efficient implementation of simple randomized merging (SRM) algorithm was developed based upon novel data structures. The SRM's lookahead forecasting technique and forecast and flush technique were used for par...
详细信息
An efficient implementation of simple randomized merging (SRM) algorithm was developed based upon novel data structures. The SRM's lookahead forecasting technique and forecast and flush technique were used for parallel prefetching and buffer management, respectively. The techniques resulted in significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to efficiently read blocks of input runs during external merging.
The black box procedures for testing whether a parallel data structure behaved correctly are considered. The first systematic study of algorithms and hardness results for such testing procedures is presented, focusing...
详细信息
The black box procedures for testing whether a parallel data structure behaved correctly are considered. The first systematic study of algorithms and hardness results for such testing procedures is presented, focusing on queues, priority queues, stacks, and counters. The importance of selecting test data, such that distinct values are inserted into the data structures is shown.
Efficient scheduling of processes on processors of a Network of Workstations (NOW) is essential for good system performance. The design of such schedulers is a complex interaction between several system and workload p...
详细信息
Efficient scheduling of processes on processors of a Network of Workstations (NOW) is essential for good system performance. The design of such schedulers is a complex interaction between several system and workload parameters. Two operations, waiting for a message and arrival of a message, can be used to take remedial actions that can guide the behavior of the system towards coscheduling using local information. An intensive implementation and evaluation exercise in studying these system are presented.
A key issue for parallel systems is the development of useful programming abstractions that can coexist with good performance. We describe a communication library that supports an object-based abstraction with a bulk-...
详细信息
A key issue for parallel systems is the development of useful programming abstractions that can coexist with good performance. We describe a communication library that supports an object-based abstraction with a bulk-synchronous communication style;this is the first time such a library has been proposed and implemented. By restricting the library to the exclusive use of barrier synchronization, we are able to design a simple and easy-to-use object system. By exploiting established techniques based on the bulk-synchronous parallel (BSP) model, we are able to design algorithms and library implementations that work well across platforms.
A simple, asynchronous, space-efficient scheduling algorithm for shared memory machines was developed. The algorithm combined the low scheduling overheads and good locality of work stealing with the low space requirem...
详细信息
A simple, asynchronous, space-efficient scheduling algorithm for shared memory machines was developed. The algorithm combined the low scheduling overheads and good locality of work stealing with the low space requirements of depth-first schedulers. The algorithm was applied in the context of a native, user-level implementation of Posix standard threads or Pthreads. Its performance was evaluated using a set of C-based benchmarks and was compared with two other schedulers. The new algorithm covered a range of scheduling granularities and space requirements, and allowed the user to trade the space requirement of a program with the scheduling granularity.
A malleable task is a computational unit which may be executed on any arbitrary number of processors, its execution time depending on the amount of resources allotted to it. According to the standard behavior of paral...
详细信息
A malleable task is a computational unit which may be executed on any arbitrary number of processors, its execution time depending on the amount of resources allotted to it. According to the standard behavior of parallel applications, we assume that the malleable tasks are monotonic, i.e. that the execution tune is decreasing with the number of processors while the computational work increases. This paper presents a new approach for scheduling a set of independent malleable tasks which leads to a worst case guarantee of √3 for the minimization of the parallel execution time, or makespan. It improves all other existing practical results including the two-phases method introduced by Turek et al. The main idea is to transfer the difficulty of a two phases method from the scheduling part to the allotment selection. We show how to formulate this last problem as a knapsack optimization problem. Then, the scheduling problem is solved by a dual-approximation which leads to a simple structure of two consecutive shelves.
We show how to compute multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system with distributed memory when problem sizes are so large that the data do not fit in the memory of the entire system. In...
详细信息
We show how to compute multidimensional Fast Fourier Transforms (FFTs) on a multiprocessor system with distributed memory when problem sizes are so large that the data do not fit in the memory of the entire system. Instead, data reside on a parallel disk system and are brought into memory in sections. We use the parallel Disk Model for implementation and analysis. Our method is a straightforward out-of-core variant of a well-known method for in-core, multidimensional FFTs. It performs 1-dimensional FFT computations on each dimension in turn. This method is easy to generalize to any number of dimensions, and it also readily permits the individual dimensions to be of any sizes that are integer powers of 2. The key step is an out-of-core transpose operation that places the data along each dimension into contiguous positions on the parallel disk system so that the data for the 1-dimensional FFTs are contiguous. We present an I/O complexity analysis for this method as well as empirical results for a DEC 2100 server, an SGI Origin 2000, and a Beowulf cluster, each of which has a parallel disk system.
暂无评论