The proceedings contains 46 papers from the conference on SPAA 2003 Fifteenth annualacmsymposium on parallelism in algorithms and architectures. The topics discussed include: optimal sharing of bags of tasks in hete...
详细信息
The proceedings contains 46 papers from the conference on SPAA 2003 Fifteenth annualacmsymposium on parallelism in algorithms and architectures. The topics discussed include: optimal sharing of bags of tasks in heterogeneous clusters;minimizing total flow time and total completion time with immediate dispatching;a practical algorithm for constructing oblivious routing schemes;a polynomial-time tree decomposition to minimize congestion and online oblivious routing.
The proceedings contain 7 papers from the symposium on parallelism in algorithms and architectures, SPAA 2003: 15th annualsymposium on parallelism in algorithms and architectures. The topics discussed include: a prac...
详细信息
The proceedings contain 7 papers from the symposium on parallelism in algorithms and architectures, SPAA 2003: 15th annualsymposium on parallelism in algorithms and architectures. The topics discussed include: a practical algorithm for constructing oblivious routing schemes;novel architectures for P2P applications: the continuous-discrete approach;quantifying instruction criticality for shared memory multiprocessors;relaxing the problem-size bound for out-of-core columnsort;the complexity of verifying memory coherence;a near optimal scheduler for switch-memory-switch routers;and on local algorithms for topology control and routing in ad hoc networks.
We study integrated prefetching and caching in single and parallel disk systems. There exist two very popular approximation algorithms called Aggressive and Conservative for minimizing the total elapsed time in the si...
详细信息
ISBN:
(纸本)9781581136616
We study integrated prefetching and caching in single and parallel disk systems. There exist two very popular approximation algorithms called Aggressive and Conservative for minimizing the total elapsed time in the single disk problem. For D parallel disks, approximation algorithms are known for both the elapsed time and stall time performance measures. In particular, there exists a D-approximation algorithm for the stall time measure that uses D-1 additional memory locations in cache. In the first part of the paper we investigate approximation algorithms for the single disk problem. We give a refined analysis of the Aggressive algorithm, showing that the original analysis was too pessimistic. We prove that our new bound is tight. Additionally we present a new family of prefetching and caching strategies and give algorithms that perform better than Aggressive and Conservative. In the second part of the paper we investigate the problem of minimizing stall time in parallel disk systems. We present a polynomial time algorithm for computing a prefetching/caching schedule whose stall time is bounded by that of an optimal solution. The schedule uses at most 3(D - 1) extra memory locations in cache. This is the first polynomial time algorithm for computing schedules with a minimum stall time. Our algorithm is based on the linear programming approach of [1]. However, in order to achieve minimum stall times, we introduce the new concept of synchronized schedules in which fetches on the D disks are performed completely in parallel.
Diffusive schemes have been widely analyzed for parallel and distributed load balancing. It is well known that their convergence rates depend on the eigenvalues of some associated matrices and on the expansion propert...
详细信息
Diffusive schemes have been widely analyzed for parallel and distributed load balancing. It is well known that their convergence rates depend on the eigenvalues of some associated matrices and on the expansion properties of the underlying graphs. In the first part of this paper we make use of these relationships in order to obtain new spectral bounds on the edge and node expansion of graphs. We show that these new bounds are better than the classical bounds for several graph classes. In the second part of the paper, we consider the load balancing problem for indivisible unit size tokens. Since known diffusion schemes do not completely balance the load for such settings, we propose a randomized distributed algorithm based on Markov chains to reduce the load imbalance. We prove that this approach provides the best asymptotic result that can be achieved in l1- or l2-norm concerning the final load situation.
We propose a new approach for constructing P2P networks based on a dynamic decomposition of a continuous space into cells corresponding to processors. We demonstrate the power of these design rules by suggesting two n...
详细信息
ISBN:
(纸本)9781581136616
We propose a new approach for constructing P2P networks based on a dynamic decomposition of a continuous space into cells corresponding to processors. We demonstrate the power of these design rules by suggesting two new architectures, one for DHT (Distributed Hash Table) and the other for dynamic expander networks. The DHT network, which we call Distance Halving allows logarithmic routing and load, while preserving constant degrees. It offers an optimal tradeoff between the degree and the dilation in the sense that degree d guarantees a dilation of O(logdn). Another advantage over previous constructions is its relative simplicity. A major new contribution of this construction is a dynamic caching technique that maintains low load and storage even under the occurrence of hot spots. Our second construction builds a network that is guaranteed to be an expander. The resulting topologies are simple to maintain and implement. Their simplicity makes it easy to modify and add protocols. A small variation yields a DHT which is robust against random faults. Finally we show that, using our approach, it is possible to construct any family of constant degree graphs in a dynamic environment, though with worst parameters. Therefore we expect that more distributed data structures could be designed and implemented in a dynamic environment.
Accommodating the uncertain latency of load instructions is one of the most vexing problems in in-order microarchitecture design and compiler development. Compilers can generate schedules with a high degree of instruc...
详细信息
Accommodating the uncertain latency of load instructions is one of the most vexing problems in in-order microarchitecture design and compiler development. Compilers can generate schedules with a high degree of instruction-level parallelism but cannot effectively accommodate unanticipated latencies; incorporating traditional out-of-order execution into the microarchitecture hides some of this latency but redundantly performs work done by the compiler and adds additional pipeline stages. Although effective techniques, such as prefetching and threading, have been proposed to deal with anticipable, long latency misses, the shorter, more diffuse stalls due to difficult-to-anticipate, first- or second-level misses are less easily hidden on in-order architectures. This paper addresses this problem by proposing a microarchitectural technique, referred to as two-pass pipelining, wherein the program executes on two in-order back-end pipelines coupled by a queue. The "advance" pipeline executes instructions greedily, without stalling on unanticipated latency dependences (executing independent instructions while otherwise blocking instructions are deferred). The "backup" pipeline allows concurrent resolution of instructions that were deferred in the other pipeline, resulting in the absorption of shorter misses and the overlap of longer ones. This paper argues that this design is both achievable and a good use of transistor resources and shows results indicating that it can deliver significant speedups for in-order processor designs.
We discuss parallel sorting algorithms and their implementations suitable for cluster architectures in order to optimize cluster resources. We focus on the time spent in computation and the load balancing properties w...
详细信息
We discuss parallel sorting algorithms and their implementations suitable for cluster architectures in order to optimize cluster resources. We focus on the time spent in computation and the load balancing properties when processors are running at different speeds, i.e. correlated by a multiplicative constant factor (our weak definition of heterogeneous platform). One scheme is under study: parallel sorting by sampling (either regular sampling technique introduced by Shi and Schaeffer [J. parallel Distrib. Comput. 14 (4) (1992) 361] or the over-partitioning scheme introduced by Li and Seveik [parallel sorting by over-partitioning, in: proceedings of the Sixth annualsymposium on parallelalgorithms and architectures, acm Press, New York, June 1994]). What is important in the paper is mainly the load balance factor and not necessary the execution time. It is clear that improved load balance leads to improved execution titre. The results presented in the paper demonstrate that load balancing for the case of computers with heterogeneous processing capacity is more challenging than for the homogeneous case. The survey, through the sorting case study, allow us to identify some algorithmic issues and software challenges to master heterogeneous cluster platforms in order to better utilize theta: data decomposition techniques, scheduling and load balancing methods. (C) 2002 Elsevier Science B.V. All rights reserved.
We present a significant improvement for parallel integer sorting. On the EREW (exclusive read exclusive write) PRAM our algorithm sorts n integers in the range {0, 1,..., m 1} in time O(log n) with O(n(q) (log n) ove...
详细信息
ISBN:
(纸本)0898714346
We present a significant improvement for parallel integer sorting. On the EREW (exclusive read exclusive write) PRAM our algorithm sorts n integers in the range {0, 1,..., m 1} in time O(log n) with O(n(q) (log n) over bar /k) operations using word length k log( m + n), where 1 less than or equal to k less than or equal to log n. In this paper we present the following four variants of our algorithm. (1) The first variant sorts integers in {0, 1,..., m - 1} in time O(log n) and in linear space with O(n) operations using word length log m log n. (2) The second variant sorts integers in {0, 1,..., n - 1} in time O ( log n) and in linear space with O(n rootlog n) operations using word length log n. (3) The third variant sorts integers in {0, 1,..., m - 1} in time O(log(3)/(2)n) and in linear space with O(n rootlog n) operations using word length log(m + n). (4) The fourth variant sorts integers in {0, 1,..., m - 1} in time O(log n) and space O(nm(epsilon)) with O(n rootlog n) operations using word length log( m + n). Our algorithms can then be generalized to the situation where the word length is k log( m + n), 1 less than or equal to k less than or equal to log n.
The utility of algorithm parallelism for coping with increased processor to memory latencies using "latency hiding" is part of the folklore of parallel computing. Latency hiding techniques increase the traff...
详细信息
ISBN:
(纸本)9781581135299
The utility of algorithm parallelism for coping with increased processor to memory latencies using "latency hiding" is part of the folklore of parallel computing. Latency hiding techniques increase the traffic to memory and therefore may "hit another wall": limited bandwidth to memory. The current paper attempts to stimulate research in the following general direction: show that algorithm parallelism need not conflict with limited bandwidth.A general technique for using parallelalgorithms to enhance serial implementation in the face of processor-memory latency problems is revisited. Two techniques for alleviating memory bandwidth constraints are presented. Both techniques can be incorporated in a *** is often considerable parallelism in many of the algorithms which are known as useful serial algorithms. Interestingly enough, all the examples provided for the use of the two techniques come from such serial algorithms.
暂无评论