Recent work on scheduling algorithms has resulted in provable bounds on the space taken by parallel computations in relation to the space taken by sequential computations. The results for online versions of these algo...
详细信息
ISBN:
(纸本)9780897918909
Recent work on scheduling algorithms has resulted in provable bounds on the space taken by parallel computations in relation to the space taken by sequential computations. The results for online versions of these algorithms, however, have been limited to computations in which threads can only synchronize with ancestor or sibling threads. Such computations do not include languages with futures or user-specified synchronization constraints. Here we extend the results to languages with synchronization variables. Such languages include languages with futures, such as Multilisp and Cool, as well as other languages such as ID. The main result is an online scheduling algorithm which, given a computation with w work (total operations), σ synchronizations, d depth (critical path) and s1 sequential space, will run in O(w/p+σlog(pd)/p+d log(pd)) time and s1+O(pd log (pd)) space, on a p-processor CRCW PRAM with a fetch-and-add primitive. This includes all time and space costs for both the computation and the scheduler. The scheduler is non-preemptive in the sense that it will only move a thread if the thread suspends on a synchronization, forks a new thread, or exceeds a threshold when allocating space. For the special case where the computation is a planar graph with left-to-right synchronization edges, the scheduling algorithm can be implemented in O(w/p+d log p) time and s1+O(pd log p) space. These are the first non-trivial space bounds described for such languages.
In this paper general simulations of algorithms designed for fully operational BSP machines on BSP machines with faulty or unavailable processors, are developed. The fail-stop model is considered for the fault occurre...
详细信息
ISBN:
(纸本)9780897918909
In this paper general simulations of algorithms designed for fully operational BSP machines on BSP machines with faulty or unavailable processors, are developed. The fail-stop model is considered for the fault occurrences, that is, if a processor fails or becomes unavailable, it remains so until the end of the computation. The faults are random, in the sense that a processor may fail independently with probability at most a, where a is a constant. Two possible settings for fault occurrences are considered: the static case where the faults are static (the faulty or unavailable processors are already known at the beginning of the computation), and the dynamic case where the processors may become faulty or unavailable during the computation. In the case of static faults, a simulation of an n-processor fault-free BSP machine on a faulty n-processor BSP machine is presented with constant slowdown per local computation step and O(log n·max{L, g}) slowdown per communication step, given that a preprocessing has been done that needs O(n/log n·max{L, g}) time (L and g are the parameters of the simulating BSP machine). In the case of dynamic faults, a simulation of an n-processor fault-free BSP machine on an cn·log n-processor faulty BSP machine is presented. No dynamic faults may occur during certain periods of the simulation. The simulations are randomized and Monte Carlo: they are guaranteed to be correct with high probability, and the time bounds always hold. To our knowledge, no previous work on the fault tolerance of the BSP model exists.
Partitioning unstructured graphs is central to the parallel solution of computational science and engineering problems. Spectral partitioners, such recursive spectral bisection (RSB), have proven effective in generati...
详细信息
Partitioning unstructured graphs is central to the parallel solution of computational science and engineering problems. Spectral partitioners, such recursive spectral bisection (RSB), have proven effective in generating high-quality partitions of realistically-sized meshes. The major problem which hindered their widespread use was their long execution times. This paper presents a new inertial spectral partitioner, called HARP. The main objective of the proposed approach is to quickly partition the meshes at runtime in a manner that works efficiently for real applications in the context of distributed-memory machines. The underlying principle of HARP is to find the eigenvectors of the unpartitioned vertices and then project them onto the eigenvectors of the original mesh. Results for various meshes ranging in size from 1000 to 100,000 vertices indicate that HARP can indeed partition meshes rapidly at runtime. Experimental results show that our largest mesh can be partitioned sequentially in only a few seconds on an SP2 which is several times faster than other spectral partitioners while maintaining the solution quality of the proven RSB method. A parallel MPI version of HARP has also been implemented on IBM SP2 and Cray T3E. parallel HARP, running on 64 processors SP2 and T3E, can partition a mesh containing more than 100,000 vertices into 64 subgrids in about half a second. These results indicate that graph partitioning can now be truly embedded in dynamically-changing real-world applications.
Much of the computation involved in parallel programs occurs within loops, either nested loops as in parallel scientific applications or collections of loops as in stream-based applications. Loop fusion is a well-know...
详细信息
ISBN:
(纸本)9780897918909
Much of the computation involved in parallel programs occurs within loops, either nested loops as in parallel scientific applications or collections of loops as in stream-based applications. Loop fusion is a well-known program transformation that has shown to be effective in improving data locality in parallel programs by reducing inter-processor communication and improving register and cache locality. Weighted loop fusion is the problem of finding a legal partition of loop nests into fusible clusters so as to minimize the total inter-cluster weights. The loop nests may contain parallel or sequential loops;care is taken to ensure that a parallel loop does not get serialized after fusion. It has been shown in past work that the weighted loop fusion problem is NP-hard. Despite the NP-hardness property, we show how optimal solutions can be found efficiently (i.e., within the compile-time constraints of a product-quality optimizing compiler) for weighted loop fusion problem sizes that occur in practice. In this paper, we present an integer programming formulation for weighted loop fusion with size (number of variables and constraints) that is linearly proportional to the size of the input weighted loop fusion problem. The linear-sized formulation is key to making the execution time small enough for use in a product-quality optimizing compiler, since the natural integer programming formulation for this problem has cubic size for which the execution time would be too large to be practical. The linear-sized integer programming formulation can be solved efficiently using any standard optimization package but we also present a custom branch-and-bound algorithm that can be used if greater efficiency is desired. A prototype implementation of this approach has been completed, and preliminary compile-time measurements are included in the paper as validation of the practicality of this approach.
Over the last decade, significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the cont...
详细信息
ISBN:
(纸本)0818679778
Over the last decade, significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose computing, and more specifically the SPEC benchmark suite. At the same time, number of microprocessor architectures have emerged which have VLIW and SIMD structures that are well matched to the needs of the ILP compilers. Most of these processors are targeted at embedded applications such as multimedia and communications, rather than general-purpose systems. Conventional wisdom, and a history of hand optimization of inner-loops, suggests that ILP compilation techniques are well suited to these applications. Unfortunately, there currently exists a gap between the compiler community and embedded applications developers. This paper presents MediaBench, a benchmark suite that has been designed to fill this gap. This suite has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement to establish uniqueness, and integration with system synthesis algorithms to establish usefulness.
It is well known that after placing m≥n balls independently and uniformly at random (i.u.r.) into n bins, the fullest bin contains Θ(log n/log log n+m/n) balls, with high probability. It is also known (see [Ste96]) ...
详细信息
ISBN:
(纸本)9780897918909
It is well known that after placing m≥n balls independently and uniformly at random (i.u.r.) into n bins, the fullest bin contains Θ(log n/log log n+m/n) balls, with high probability. It is also known (see [Ste96]) that a maximum load of O (m/n) can be obtained for all m≥n if a ball is allocated in one (suitably chosen) of two (i.u.r.) bins. Stemann ([Ste96]) shows that r communication rounds suffice to guarantee a maximum load of max {r√log n, O (m/n)}, with high probability. In particular, O(log log n) communication rounds suffice to guarantee optimal load O(m/n) for m≥n, with high probability. Adler et al. have shown in [acmR95] that Stemanns protocol is optimal for constant r. In this paper we extend the above results in two directions: We generalize the lower bound to arbitrary r≤log log n. This implies that the result of Stemanns protocol is optimal for all r. Our main result is a generalization of Stemanns upper bound to weighted jobs: Let WA (WM) denote the average (maximum) weight of the balls. Further let Δ = WA/WM. Note that the maximum load is at least Ω(m/n·WA+WM). We present a protocol that achieves maximum load of γ·(m/n·WA+WM) using O (log log n/log(γ·((m/n)·Δ+1))) rounds of communication. For uniform weights this matches the results of Stemann. In particular, for log log n rounds we achieve optimal load of O (m/n·WA+WM). Using this lower bound it is also shown that our algorithm is optimal in the case of weighted balls for various degrees of uniformity. All the balls into bins games model Load Balancing problems: The balls are jobs, the bins are resources, the task is to allocate the jobs to the resources so that they are evenly distributed. From this interpretation, considering weighted balls is very important because the weights may e.g. model the runtime of jobs. Applications of such Load Balancing problems occur e.g. for client-server networks and for Multimedia-Servers using disk arrays.
Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose...
详细信息
ISBN:
(纸本)9780818679773
Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose computing, and more specifically the SPEC benchmark suite. At the same time, a number of microprocessor architectures have emerged which have VLIW and SIMD structures that are well matched to the needs of the ILP compilers. Most of these processors are targeted at embedded applications such as multimedia and communications, rather than general-purpose systems. Conventional wisdom, and a history of hand optimization of inner-loops, suggests that ILP compilation techniques are well suited to these applications. Unfortunately, there currently exists a gap between the compiler community and embedded applications developers. This paper presents MediaBench, a benchmark suite that has been designed to fill this gap. This suite has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement to establish uniqueness, and integration with system synthesis algorithms to establish usefulness.
暂无评论