High performance applications involving large data sets require the efficient and flexible use of multiple disks. In an external memory machine with D parallel, independent disks, only one block can be accessed on eac...
详细信息
ISBN:
(纸本)0898714532
High performance applications involving large data sets require the efficient and flexible use of multiple disks. In an external memory machine with D parallel, independent disks, only one block can be accessed on each disk in one I/O step. This restriction leads to a load balancing problem that is perhaps the main inhibitor for adapting single-disk external memory algorithms to multiple disks. This paper shows that this problem can be solved efficiently using a combination of randomized placement, redundancy and an optimal scheduling algorithm. A buffer of O(D) blocks suffices to support efficient writing of arbitrary blocks if blocks are distributed uniformly at random to the disks (e.g., by hashing). If two randomly allocated copies of each block exist, N arbitrary blocks can be read within [N/D] + I/O steps with high probability. In addition, the redundancy can be reduced from 2 to if 1/r for any integer r. These results can be used to emulate the simple and powerful "single-disk multi-head" model of external computing [1] on the physically more realistic independent disk model [33] with small constant overhead. This is faster than a lower bound for deterministic emulation [3].
We study the problem of scheduling parallel machines online, allowing preemptions while disallowing migration of jobs that have been scheduled on one machine to another. For a given job, we measure the quality of serv...
详细信息
ISBN:
(纸本)0898714532
We study the problem of scheduling parallel machines online, allowing preemptions while disallowing migration of jobs that have been scheduled on one machine to another. For a given job, we measure the quality of service provided by an algorithm by the stretch of the job, defined as the ratio of the amount of time the job spends in the system (also known as the response time) to its processing time. For a sequence of jobs, we measure the performance of an algorithm by the average stretch achieved over all the jobs. The scheduling goal is to minimize the average stretch. This problem is of relevance in many applications, e.g., wireless data servers and distributed server systems In wired networks. We prove an O(1) competitive ratio for this problem. The algorithm for which we prove this result is the one proposed in [2] that has (tight) logarithmic competitive ratio for minimizing the average response time. Thus the algorithm in [2] simultaneously performs well for average response time as well as average stretch. We prove the O(1) competitive ratio against an adversary who not only knows the entire input ahead of time, but is also allowed to migrate jobs. Thus our result shows that migration is not necessary to be competitive for minimizing average stretch;in contrast, we prove that preemption is essential, even if randomization is allowed. We also establish a constant-factor lower bound on the competitive ratio of any online algorithm that minimizes average stretch without migration.
We study an optimization problem that arises in the context of data placement in multimedia storage systems. We are given a collection of M multimedia data objects that need to be assigned to a storage system consisti...
详细信息
ISBN:
(纸本)0898714532
We study an optimization problem that arises in the context of data placement in multimedia storage systems. We are given a collection of M multimedia data objects that need to be assigned to a storage system consisting of N disks d(1),d(2)...,d(N). We are also given sets U-1,U-2,...,U-M such that U-i is the set of clients requesting the ith data object. Each disk d(j) is characterized by two parameters, namely, its storage capacity C-j which indicates the maximum number of data objects that may be assigned to it, and a load capacity L-j which indicates the maximum number of clients that it can serve. The goal is to find a placement of data objects on disks and an assignment of clients to disks so as to maximize the total number of clients served, subject to the capacity constraints of the storage system. We study this data placement problem for two natural classes of storage systems, namely, homogeneous and uniform ratio. Our first main result is a tight upper and lower bound on the number of items that can always be packed for any input instance to homogeneous as well as uniform ratio storage systems. We show that an algorithm given in [11] for data placement, achieves this bound. Our second main result is a polynomial time approximation scheme for the data placement problem in homogeneous and uniform ratio storage systems, answering an open question of [11]. Finally, we also study the problem from an empirical perspective.
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads c...
ISBN:
(纸本)9781581131857
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads can be formed. What kind of problems can such restrictive algorithms solve and still be competitive in the total number of operations they perform with the fastest serial algorithm for the same problem?Intrigued by this informal question, we considered one of the most elementary parallel algorithmic paradigms, that of balanced binary trees. The main contribution of this paper is a new balanced (not necessarily binary) tree no-busy-wait paradigm for parallelalgorithms; applications of the basic paradigm to two problems are presented: building heaps, and executing parallel tree contraction (assuming a preparatory stage); the latter is known to be applicable to evaluating a family of general arithmetic *** putting things in context, we also discuss our “PRAM-on-chip” vision (actually a small update to it), presented at SPAA98.
This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. The parallel Vec...
ISBN:
(纸本)9781581131857
This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. The parallel Vector Access (PVA) unit exploits the regularity of vectors or streams to access them efficiently in parallel on a multi-bank SDRAM memory system. The PVA unit performs scatter/gather operations so that only the elements accessed by the application are transmitted across the system bus. Vector operations are broadcast in parallel to all memory banks, each of which implements an efficient algorithm to determine which vector elements it holds. Earlier performance evaluations have demonstrated that our PVA implementation loads elements up to 32.8 times faster than a conventional memory system and 3.3 times faster than a pipelined vector unit, without hurting the performance of normal cache-line fills. Here we present the underlying PVA algorithms for both word interleaved and cache-line inter-leaved memory systems.
Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages...
ISBN:
(纸本)9781581131857
Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages. We propose a model for optimizing the exchange of messages under such circumstances which we call the minimum phase remapping problem. We first show that the problem is NP-Complete, and then analyze several methodologies for addressing it. First, we show how the problem can be phrased as an instance of multi-commodity flow. Next, we study a continuous approximation to the problem. We show that this continuous approximation has a solution which requires at most two more phases than the optimal discrete solution, but the question of how to consistently obtain a good discrete solution from the continuous problem remains open. Finally, we devise a simple and practical approximation algorithm for the problem with a bound of 1.5 times the optimal number of phases.
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow ...
详细信息
ISBN:
(纸本)9781581131857
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow concepts. The first approach, referred to as the receiver-initiated algorithm, realizes the FFAT iterations as a parent-child relationship while fully exploiting the underlying parallelism. The second approach, referred to as the sender-initiated algorithm, follows a data-flow model based on the producer-consumer style of programming and can be adopted to different architectural parameters for achieving high performance. The implementations of the proposed algorithms have been carried out on the EARTH (Efficient Architecture for Running THreads) platform. For both the algorithms, we analyze the ratio of remote vs local threads and study its impact on the experimental results. Our implementation results show that for certain block sizes on fixed problem size and machine size, the receiver-initiated approach performs better than the sender-initiated approach. For large number of processors, both the algorithms perform well, yielding execution times of only 10 msec for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed.
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, ...
ISBN:
(纸本)9781581131857
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, where communication between processors has a cost, and where all scheduling must be online. This problem has been considered previously in the fields of asynchronous parallel computing and scheduling theory. Our model is a bridge between the assumptions in these fields. We provide a new more accurate analysis of of an old scheduling algorithm called the maximum utilization scheduler. Based on this analysis, we generalize this scheduling policy and define the high utilization scheduler. We next focus on the Cilck platform and introduce a new algorithm for scheduling Cilk multithreaded parallel programs on heterogeneous processors. This scheduler is inspired by the high utilization scheduler and is modified to fit in a Cilk context. A crucial aspect of our algorithm is that it keeps the original spirit of the Cilk scheduler. In fact, when our new algorithm runs on homogeneous processors, it exactly mimics the dynamics of the original Cilk scheduler.
暂无评论