Using idle times of the processors is a well-known approach to run coarse grained parallelalgorithms for extremely complex problems. We present on-line algorithms for scheduling the processes of a parallel applicatio...
详细信息
ISBN:
(纸本)9781581138405
Using idle times of the processors is a well-known approach to run coarse grained parallelalgorithms for extremely complex problems. We present on-line algorithms for scheduling the processes of a parallel application that is known off-line on a dynamic network in which the idle times of the processors are dictated by an adversary. We also take communication and synchronization costs into account. Our first contribution consists of a formal model to restrict the adversary in a reasonable way. We then show a constant factor approximation for the off-line scheduling problem. As this problem has to take communication cost into account, it can be seen as a generalization of many NP-hard parallel machine scheduling problems. Finally, we present on-line algorithms for different models with constant or with "nearly constant" competitive ratio.
An acyclic set in an undirected graph G = (V, E) is a set A ⊂ V (G) such that the induced graph on A is acyclic. A maximal acyclic set (MAS) has the further property that no vertex can be added to it without creating ...
详细信息
ISBN:
(纸本)9781581138405
An acyclic set in an undirected graph G = (V, E) is a set A ⊂ V (G) such that the induced graph on A is acyclic. A maximal acyclic set (MAS) has the further property that no vertex can be added to it without creating a cycle in the induced graph. We present the first NC algorithm that finds an MAS in any graph, running in time O(log4 n) using O((m2+n)/log n) processors on a EREW PRAM, answering questions raised by Chen and He. Our NC algorithm for finding an MAS in a graph relies on a novel NC algorithm for finding a maximal forest in a hypergraph which may be of independent interest.
We describe in this paper a new method for building an efficient algorithm for scheduling jobs in a cluster. Jobs are considered as parallel tasks (PT) which can be scheduled on any number of processors. The main feat...
详细信息
ISBN:
(纸本)9781581138405
We describe in this paper a new method for building an efficient algorithm for scheduling jobs in a cluster. Jobs are considered as parallel tasks (PT) which can be scheduled on any number of processors. The main feature is to consider two criteria that are optimized together. These criteria are the makespan and the weighted minimal average completion time (minsum). They are chosen for their complementarity, to be able to represent both user-oriented objectives and system administrator objectives. We propose an algorithm based on a batch policy with increasing batch sizes, with a smart selection of jobs in each batch. This algorithm is assessed by intensive simulation results, compared to a new lower bound (obtained by a relaxation of ILP) of the optimal schedules for both criteria separately. It is currently implemented in an actual real-size cluster platform.
The parallel packet switch (PPS) is extensively used as the core of contemporary commercial switches. This paper investigates the inherent queuing delay and delay jitter introduced by the PPS's demultiplexing algo...
详细信息
ISBN:
(纸本)9781581138405
The parallel packet switch (PPS) is extensively used as the core of contemporary commercial switches. This paper investigates the inherent queuing delay and delay jitter introduced by the PPS's demultiplexing algorithm, relative to an optimal work-conserving switch. We show that the inherent queuing delay and delay jitter of a symmetric and fault-tolerant N × N PPS, where every demultiplexing algorithm dispatches cells to all the middle-stage switches is Ω(N), if there are no buffers in the PPS input-ports. If the demultiplexing algorithms dispatch cells only to part of the middle-stage switches, the queuing delay and delay jitter are Ω(N/S), where 5 is the PPS speedup. These lower bounds hold unless the demultiplexing algorithm has full and immediate knowledge of the switch status. When the PPS has buffers in its input-ports, an Ω(N/S) lower bound holds if the demultiplexing algorithm uses only local information, or the input buffers are small relative to the time an input-port needs to learn the switch global information.
A crucial problem that needs to be solved is the allocation of memory to processors in a pipeline. Ideally, the processor memories should be totally separate (i.e., one port memories) in order to minimize contention;h...
详细信息
ISBN:
(纸本)9781581138405
A crucial problem that needs to be solved is the allocation of memory to processors in a pipeline. Ideally, the processor memories should be totally separate (i.e., one port memories) in order to minimize contention;however, this minimizes memory sharing. Idealized sharing occurs by using a single shared memory for all processors but this maximizes contention. Instead, in this paper we show that perfect memory sharing of shared memory can be achieved with a collection of *two*-port memories, as long as the number of processors is less than the number of memories. We show that the problem of allocation is NP-complete in general, but has a fast approximation algorithm that comes within a factor of 3/2. The proof utilizes a new bin packing model, which is interesting in its own right. Further, for important special cases that arise in practice the approximation algorithm is indeed optimal. We also describe an incremental memory allocation algorithm that provides good memory utilization while allowing fast updates.
parallel disks provide a cost effective way of speeding up I/Os in applications that work with large amounts of data. The main challenge is to achieve as much parallelism as possible, using prefetching to avoid bottle...
详细信息
ISBN:
(纸本)9781581138405
parallel disks provide a cost effective way of speeding up I/Os in applications that work with large amounts of data. The main challenge is to achieve as much parallelism as possible, using prefetching to avoid bottlenecks in disk access. Efficient algorithms have been developed for some particular patterns of accessing the disk blocks, In this paper, we consider general request sequences. When the request sequence consists of unique block requests, the problem is called prefetching and is a well-solved problem for arbitrary request sequences. When the reference sequence can have repeated references to the same block, we need to devise an effective caching policy as well. While optimum offline algorithms have been recently designed for the problem, in the online case, no effective algorithm was previously known. Our main contribution is a deterministic online algorithm threshold-LRU which achieves O((MD/L) 2/3) competitive ratio and a randomized online algorithm threshold-MARK which achieves O(√(MD/L) log(MD/L)) competitive ratio for the caching/prefetching problem on the parallel disk model (PDM), where D is the number of disks, M is the size of fast memory buffer, and M + L is the amount of lookahead available in the request sequence. The best-known lower bound on the competitive ratio is Ω(√MD/L) for lookahead L ≥ M in both models. We also show that if the deterministic online algorithm is allowed to have twice the memory of the offline then a tight competitive ratio of Θ(√MD/L) can be achieved. This problem generalizes the well-known paging problem on a single disk to the parallel disk model.
A key capability of data-race detectors is to determine whether one thread executes logically in parallel with another or whether the threads must operate in series. This paper provides two algorithms, one serial and ...
详细信息
ISBN:
(纸本)9781581138405
A key capability of data-race detectors is to determine whether one thread executes logically in parallel with another or whether the threads must operate in series. This paper provides two algorithms, one serial and one parallel, to maintain series-parallel (SP) relationships "on the fly" for fork-join multithreaded programs. The serial SP-order algorithm runs in O(1) amortized time per operation. In contrast, the previously best algorithm requires a time per operation that is proportional to Tarjan's functional inverse of Ackermann's function. SP-order employs an order-maintenance data structure that allows us to implement a more efficient "English-Hebrew" labeling scheme than was used in earlier race detectors, which immediately yields an improved determinacy-race detector. In particular, any fork-join program running in T1 time on a single processor can be checked on the fly for determinacy races in O(T1) time. Corresponding improved bounds can also be obtained for more sophisticated data-race detectors, for example, those that use locks. By combining SP-order with Feng and Leiserson's serial SP-bags algorithm, we obtain a parallel SP-maintenance algorithm, called SP-hybrid. Suppose that a fork-join program has n threads, T1 work, and a critical-path length of T∞. When executed on P processors, we prove that SP-hybrid runs in O((T1/P + PT∞) lg n) expected time. To understand this bound, consider that the original program obtains linear speed-up over a 1-processor execution when P = O(T 1/T∞). In contrast, SP-hybrid obtains linear speed-up when P = O(√T1/T∞), but the work is increased by a factor of O (lg n).
作者:
Gustedt, JensLORIA
INRIA Lorraine Campus Scientifique BP 239 F-54506 Vandoeuvre-les-Nancy France
We show how to uniformly distribute data at random (not to be confounded with permutation routing) in a coarse grained parallel environment with p processors. In contrast to previously known work, our method is able t...
详细信息
ISBN:
(纸本)9781581136616
We show how to uniformly distribute data at random (not to be confounded with permutation routing) in a coarse grained parallel environment with p processors. In contrast to previously known work, our method is able to fulfill the three goals of uniformity, work-optimality and balance among the processors simultaneously. To guarantee the uniformity we investigate the matrix of communication requests between the processors. We show that its distribution is a generalization of the multivariate hypergeometric distribution and we give algorithms to compute it efficiently.
We develop an algorithm for parallel disk sorting, whose I/O cost approaches the lower bound and that guarantees almost perfect overlap between I/O and computation. Previous algorithms have either suboptimal I/O volum...
详细信息
ISBN:
(纸本)9781581136616
We develop an algorithm for parallel disk sorting, whose I/O cost approaches the lower bound and that guarantees almost perfect overlap between I/O and computation. Previous algorithms have either suboptimal I/O volume or cannot guarantee that I/O and computations can always be overlapped. We give an efficient implementation that can (at least) compete with the best practical implementations but gives additional performance guarantees. For the experiments we have configured a state of the art machine that can sustain full bandwidth I/O with eight disks and is very cost effective.
暂无评论