It is well known that after placing m≥n balls independently and uniformly at random (i.u.r.) into n bins, the fullest bin contains Θ(log n/log log n+m/n) balls, with high probability. It is also known (see [Ste96]) ...
详细信息
ISBN:
(纸本)9780897918909
It is well known that after placing m≥n balls independently and uniformly at random (i.u.r.) into n bins, the fullest bin contains Θ(log n/log log n+m/n) balls, with high probability. It is also known (see [Ste96]) that a maximum load of O (m/n) can be obtained for all m≥n if a ball is allocated in one (suitably chosen) of two (i.u.r.) bins. Stemann ([Ste96]) shows that r communication rounds suffice to guarantee a maximum load of max {r√log n, O (m/n)}, with high probability. In particular, O(log log n) communication rounds suffice to guarantee optimal load O(m/n) for m≥n, with high probability. Adler et al. have shown in [acmR95] that Stemanns protocol is optimal for constant r. In this paper we extend the above results in two directions: We generalize the lower bound to arbitrary r≤log log n. This implies that the result of Stemanns protocol is optimal for all r. Our main result is a generalization of Stemanns upper bound to weighted jobs: Let WA (WM) denote the average (maximum) weight of the balls. Further let Δ = WA/WM. Note that the maximum load is at least Ω(m/n·WA+WM). We present a protocol that achieves maximum load of γ·(m/n·WA+WM) using O (log log n/log(γ·((m/n)·Δ+1))) rounds of communication. For uniform weights this matches the results of Stemann. In particular, for log log n rounds we achieve optimal load of O (m/n·WA+WM). Using this lower bound it is also shown that our algorithm is optimal in the case of weighted balls for various degrees of uniformity. All the balls into bins games model Load Balancing problems: The balls are jobs, the bins are resources, the task is to allocate the jobs to the resources so that they are evenly distributed. From this interpretation, considering weighted balls is very important because the weights may e.g. model the runtime of jobs. Applications of such Load Balancing problems occur e.g. for client-server networks and for Multimedia-Servers using disk arrays.
We study an optimization problem that arises in the context of data placement in multimedia storage systems. We are given a collection of M multimedia data objects that need to be assigned to a storage system consisti...
详细信息
ISBN:
(纸本)0898714532
We study an optimization problem that arises in the context of data placement in multimedia storage systems. We are given a collection of M multimedia data objects that need to be assigned to a storage system consisting of N disks d(1),d(2)...,d(N). We are also given sets U-1,U-2,...,U-M such that U-i is the set of clients requesting the ith data object. Each disk d(j) is characterized by two parameters, namely, its storage capacity C-j which indicates the maximum number of data objects that may be assigned to it, and a load capacity L-j which indicates the maximum number of clients that it can serve. The goal is to find a placement of data objects on disks and an assignment of clients to disks so as to maximize the total number of clients served, subject to the capacity constraints of the storage system. We study this data placement problem for two natural classes of storage systems, namely, homogeneous and uniform ratio. Our first main result is a tight upper and lower bound on the number of items that can always be packed for any input instance to homogeneous as well as uniform ratio storage systems. We show that an algorithm given in [11] for data placement, achieves this bound. Our second main result is a polynomial time approximation scheme for the data placement problem in homogeneous and uniform ratio storage systems, answering an open question of [11]. Finally, we also study the problem from an empirical perspective.
parallel tree contraction has been found to be a useful and quite powerful tool for the design of a wide class of efficient graph algorithms. We propose a corresponding technique for the parallel solution of increment...
详细信息
ISBN:
(纸本)9780897916714
parallel tree contraction has been found to be a useful and quite powerful tool for the design of a wide class of efficient graph algorithms. We propose a corresponding technique for the parallel solution of incremental problems. As our computational model, we assume a variant of the CRCW PRAM where we can dynamically activate processors by a forking *** consider a dynamic binary tree T of ≤ n nodes and unbounded depth. We describe a procedure, which we call the dynamic parallel tree contraction algorithm, which incrementally processes various parallel modification requests and queries:(1)parallel requests to add or delete leaves of T, or modify labels of internal nodes or leaves of T, and also(2) parallel tree contraction queries which require recomputingvalues at specified nodes. Each modification or query is with respect to a set of nodes U in *** dynamic parallel tree contraction algorithm is a randomized algorithm that takes O(log(|U|log n)) expected parallel time using O(|U|log n/log(|U|log n) processors. We give a large number of applications (with the same bounds), including:(a) maintaining the usual tree properties (such as number of ancestors, preorder, etc.),(b) Eulerian tour,(c) expression evaluation,(d) least common ancestor, and(e) canonical forms of ***, there where no known parallelalgorithms for incrementally maintaining and solving such problems in parallel time less than Θ(log n).In deriving our incremental algorithms, we solve a key subproblem, namely a processor activation problem, within the same asymptotic bounds, which may be useful in the design of other parallel incremental algorithms. This algorithm uses an interesting persistent parallel data structures involving a non-trivial *** a subsequent paper, we apply our dynamic parallel tree contraction technique to various incremental graph problems: maintaining various properties, (such as coloring, minimum covering set, maximum matching, etc.). of parall
Efficient scheduling of processes on processors of a Network of Workstations (NOW) is essential for good system performance. The design of such schedulers is a complex interaction between several system and workload p...
详细信息
Efficient scheduling of processes on processors of a Network of Workstations (NOW) is essential for good system performance. The design of such schedulers is a complex interaction between several system and workload parameters. Two operations, waiting for a message and arrival of a message, can be used to take remedial actions that can guide the behavior of the system towards coscheduling using local information. An intensive implementation and evaluation exercise in studying these system are presented.
Optimizing the performance of a multi-core microprocessor within a power budget has recently received a lot of attention. However, most existing solutions are centralized and cannot scale well with the rapidly increas...
详细信息
ISBN:
(纸本)9781450304726
Optimizing the performance of a multi-core microprocessor within a power budget has recently received a lot of attention. However, most existing solutions are centralized and cannot scale well with the rapidly increasing level of core integration. While a few recent studies propose power control algorithms for many-core architectures, those solutions assume that the workload of every core is independent and therefore cannot effectively allocate power based on thread criticality to accelerate multi-threaded parallel applications, which are expected to be the primary workloads of many-core architectures. This paper presents a scalable power control solution for many-core microprocessors that is specifically designed to handle realistic workloads, i.e., a mixed group of single-threaded and multi-threaded applications. Our solution features a three-layer design. First, we adopt control theory to precisely control the power of the entire chip to its chip-level budget by adjusting the aggregated frequency of all the cores on the chip. Second, we dynamically group cores running the same applications and then partition the chip-level aggregated frequency quota among different groups for optimized overall microprocessor performance. Finally, we partition the group-level frequency quota among the cores in each group based on the measured thread criticality for shorter application completion time. As a result, our solution can optimize the microprocessor performance while precisely limiting the chip-level power consumption below the desired budget. Empirical results on a 12-core hardware testbed show that our control solution can provide precise power control, as well as 17% and 11% better application performance than two state-of-the-art solutions, on average, for mixed PARSEC and SPEC benchmarks. Furthermore, our extensive simulation results for 32, 64, and 128 cores, as well as overhead analysis for up to 4,096 cores, demonstrate that our solution is highly scalable to many
A key issue for parallel systems is the development of useful programming abstractions that can coexist with good performance. We describe a communication library that supports an object-based abstraction with a bulk-...
详细信息
A key issue for parallel systems is the development of useful programming abstractions that can coexist with good performance. We describe a communication library that supports an object-based abstraction with a bulk-synchronous communication style;this is the first time such a library has been proposed and implemented. By restricting the library to the exclusive use of barrier synchronization, we are able to design a simple and easy-to-use object system. By exploiting established techniques based on the bulk-synchronous parallel (BSP) model, we are able to design algorithms and library implementations that work well across platforms.
A simple, asynchronous, space-efficient scheduling algorithm for shared memory machines was developed. The algorithm combined the low scheduling overheads and good locality of work stealing with the low space requirem...
详细信息
A simple, asynchronous, space-efficient scheduling algorithm for shared memory machines was developed. The algorithm combined the low scheduling overheads and good locality of work stealing with the low space requirements of depth-first schedulers. The algorithm was applied in the context of a native, user-level implementation of Posix standard threads or Pthreads. Its performance was evaluated using a set of C-based benchmarks and was compared with two other schedulers. The new algorithm covered a range of scheduling granularities and space requirements, and allowed the user to trade the space requirement of a program with the scheduling granularity.
A malleable task is a computational unit which may be executed on any arbitrary number of processors, its execution time depending on the amount of resources allotted to it. According to the standard behavior of paral...
详细信息
A malleable task is a computational unit which may be executed on any arbitrary number of processors, its execution time depending on the amount of resources allotted to it. According to the standard behavior of parallel applications, we assume that the malleable tasks are monotonic, i.e. that the execution tune is decreasing with the number of processors while the computational work increases. This paper presents a new approach for scheduling a set of independent malleable tasks which leads to a worst case guarantee of √3 for the minimization of the parallel execution time, or makespan. It improves all other existing practical results including the two-phases method introduced by Turek et al. The main idea is to transfer the difficulty of a two phases method from the scheduling part to the allotment selection. We show how to formulate this last problem as a knapsack optimization problem. Then, the scheduling problem is solved by a dual-approximation which leads to a simple structure of two consecutive shelves.
暂无评论