The proceedings contains 48 papers from Thirteen annualacmsymposium on parallelalgorithms and architectures. Topics discussed include: compact routing schemes;simple on-line algorithms for the maximum disjoint path...
详细信息
The proceedings contains 48 papers from Thirteen annualacmsymposium on parallelalgorithms and architectures. Topics discussed include: compact routing schemes;simple on-line algorithms for the maximum disjoint paths problem;competitive buffer management for shared-memory switches;attack propagation in networks;computational power of pipelined memory hierarchies;and a data tracking scheme for general networks.
We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for optimal parallel-disk scheduling. Traditional buffer management algorithms that minimize the number of I/O dis...
ISBN:
(纸本)9781581134094
We address the problem of prefetching and caching in a parallel I/O system and present a new algorithm for optimal parallel-disk scheduling. Traditional buffer management algorithms that minimize the number of I/O disk accesses, are substantially suboptimal in a parallel I/O system where multiple I/Os can proceed *** present a new algorithm SUPERVISOR for parallel-disk I/O scheduling. We show that in the off-line case, where apriori knowledge of all the requests is available, SUPERVISOR performs the minimum number of I/Os to service the given I/O requests. This is the first parallel I/O scheduling algorithm that is provably offline optimal. In the on-line case, we study SUPERVISOR in the context of global L-block lookahead, which gives the buffer management algorithm a lookahead consisting of L distinct requests. We show that the competitive ratio of SUPERVISOR, with global L-block lookahead, is Θ(M - L + D), when L ≤ M, and Θ(MD/L), when L > M, where the number of disks is D and buffer size is M.
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads c...
ISBN:
(纸本)9781581131857
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads can be formed. What kind of problems can such restrictive algorithms solve and still be competitive in the total number of operations they perform with the fastest serial algorithm for the same problem?Intrigued by this informal question, we considered one of the most elementary parallel algorithmic paradigms, that of balanced binary trees. The main contribution of this paper is a new balanced (not necessarily binary) tree no-busy-wait paradigm for parallelalgorithms; applications of the basic paradigm to two problems are presented: building heaps, and executing parallel tree contraction (assuming a preparatory stage); the latter is known to be applicable to evaluating a family of general arithmetic *** putting things in context, we also discuss our “PRAM-on-chip” vision (actually a small update to it), presented at SPAA98.
This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. The parallel Vec...
ISBN:
(纸本)9781581131857
This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. The parallel Vector Access (PVA) unit exploits the regularity of vectors or streams to access them efficiently in parallel on a multi-bank SDRAM memory system. The PVA unit performs scatter/gather operations so that only the elements accessed by the application are transmitted across the system bus. Vector operations are broadcast in parallel to all memory banks, each of which implements an efficient algorithm to determine which vector elements it holds. Earlier performance evaluations have demonstrated that our PVA implementation loads elements up to 32.8 times faster than a conventional memory system and 3.3 times faster than a pipelined vector unit, without hurting the performance of normal cache-line fills. Here we present the underlying PVA algorithms for both word interleaved and cache-line inter-leaved memory systems.
Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages...
ISBN:
(纸本)9781581131857
Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages. We propose a model for optimizing the exchange of messages under such circumstances which we call the minimum phase remapping problem. We first show that the problem is NP-Complete, and then analyze several methodologies for addressing it. First, we show how the problem can be phrased as an instance of multi-commodity flow. Next, we study a continuous approximation to the problem. We show that this continuous approximation has a solution which requires at most two more phases than the optimal discrete solution, but the question of how to consistently obtain a good discrete solution from the continuous problem remains open. Finally, we devise a simple and practical approximation algorithm for the problem with a bound of 1.5 times the optimal number of phases.
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow ...
详细信息
ISBN:
(纸本)9781581131857
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow concepts. The first approach, referred to as the receiver-initiated algorithm, realizes the FFAT iterations as a parent-child relationship while fully exploiting the underlying parallelism. The second approach, referred to as the sender-initiated algorithm, follows a data-flow model based on the producer-consumer style of programming and can be adopted to different architectural parameters for achieving high performance. The implementations of the proposed algorithms have been carried out on the EARTH (Efficient Architecture for Running THreads) platform. For both the algorithms, we analyze the ratio of remote vs local threads and study its impact on the experimental results. Our implementation results show that for certain block sizes on fixed problem size and machine size, the receiver-initiated approach performs better than the sender-initiated approach. For large number of processors, both the algorithms perform well, yielding execution times of only 10 msec for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed.
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, ...
ISBN:
(纸本)9781581131857
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, where communication between processors has a cost, and where all scheduling must be online. This problem has been considered previously in the fields of asynchronous parallel computing and scheduling theory. Our model is a bridge between the assumptions in these fields. We provide a new more accurate analysis of of an old scheduling algorithm called the maximum utilization scheduler. Based on this analysis, we generalize this scheduling policy and define the high utilization scheduler. We next focus on the Cilck platform and introduce a new algorithm for scheduling Cilk multithreaded parallel programs on heterogeneous processors. This scheduler is inspired by the high utilization scheduler and is modified to fit in a Cilk context. A crucial aspect of our algorithm is that it keeps the original spirit of the Cilk scheduler. In fact, when our new algorithm runs on homogeneous processors, it exactly mimics the dynamics of the original Cilk scheduler.
We refine the model underlying our prior work on scheduling cycle-stealing opportunities in NOWs [5, 16], obtaining a model wherein the scheduling guidelines of [16] produce optimal schedules for every such opportunit...
ISBN:
(纸本)9781581131857
We refine the model underlying our prior work on scheduling cycle-stealing opportunities in NOWs [5, 16], obtaining a model wherein the scheduling guidelines of [16] produce optimal schedules for every such opportunity. Although computing optimal schedules usually requires the use of general (often inefficient) function-optimizing methods, we show how to compute optimal schedules efficiently for the broad class of opportunities whose durations come from a concave probability distribution. Even when no such efficient computation of an optimal schedule is available, our refined model always suggests a natural notion of approximately optimal schedule, which may be efficiently computable. We illustrate such efficient approximability via the important class of cycle-stealing opportunities whose durations come from a heavy-tailed distribution. Such opportunities do not admit any optimal schedule—nor even a natural notion of approximately optimal schedule—within the model of [5, 16]. Within our refined model, though, we derive computationally simple schedules for heavy-tailed opportunities, which can be “tuned” to have expected work-output that is arbitrarily close to optimal.
In recent years, the task of allocating jobs to servers has been studied with the “balls and bins” abstraction. Results in this area exploit the large decrease in maximum load that can be achieved by allowing each j...
ISBN:
(纸本)9781581131857
In recent years, the task of allocating jobs to servers has been studied with the “balls and bins” abstraction. Results in this area exploit the large decrease in maximum load that can be achieved by allowing each job (ball) a little freedom in choosing its destination server (bin).In this paper we examine an infinite and parallel allocation process (see [ABS98]) which is related to the “balls and bins” abstraction. The simple process can be used to model many problems arising in applications like load balancing, data accesses for parallel data servers, hashing, and PRAM ***, the parallel allocation process behaves in a highly non-uniform manner which makes its analysis challenging. Even the typically simple question of for which arrival rates the process is stable, is highly non-trivial. In order to cope with this non-uniform behavior we introduce a new sequential process and show (via simulations) that the sequential process models the behavior of the parallel one very accurately. We develop a system of ordinary differential equations in order to describe the behavior of our sequential process and present a thorough analysis of the performance this process. For example, we show that the queue length distribution decreases double-exponentially. Finally, we present simulation results indicating that the solutions to the differential equations very well predict the queue length distribution of our sequential process and the largest injection rate for which it is ***, we can conclude that in all the performance characteristics we have measured experimentally, the parallel and the sequential process are closely related. This indicates that the obtained solution of the differential equations and the results presented above are applicable to the parallel process, too.
暂无评论