Over the last decade, significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the cont...
详细信息
ISBN:
(纸本)0818679778
Over the last decade, significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose computing, and more specifically the SPEC benchmark suite. At the same time, number of microprocessor architectures have emerged which have VLIW and SIMD structures that are well matched to the needs of the ILP compilers. Most of these processors are targeted at embedded applications such as multimedia and communications, rather than general-purpose systems. Conventional wisdom, and a history of hand optimization of inner-loops, suggests that ILP compilation techniques are well suited to these applications. Unfortunately, there currently exists a gap between the compiler community and embedded applications developers. This paper presents MediaBench, a benchmark suite that has been designed to fill this gap. This suite has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement to establish uniqueness, and integration with system synthesis algorithms to establish usefulness.
Much of the computation involved in parallel programs occurs within loops, either nested loops as in parallel scientific applications or collections of loops as in stream-based applications. Loop fusion is a well-know...
详细信息
ISBN:
(纸本)9780897918909
Much of the computation involved in parallel programs occurs within loops, either nested loops as in parallel scientific applications or collections of loops as in stream-based applications. Loop fusion is a well-known program transformation that has shown to be effective in improving data locality in parallel programs by reducing inter-processor communication and improving register and cache locality. Weighted loop fusion is the problem of finding a legal partition of loop nests into fusible clusters so as to minimize the total inter-cluster weights. The loop nests may contain parallel or sequential loops;care is taken to ensure that a parallel loop does not get serialized after fusion. It has been shown in past work that the weighted loop fusion problem is NP-hard. Despite the NP-hardness property, we show how optimal solutions can be found efficiently (i.e., within the compile-time constraints of a product-quality optimizing compiler) for weighted loop fusion problem sizes that occur in practice. In this paper, we present an integer programming formulation for weighted loop fusion with size (number of variables and constraints) that is linearly proportional to the size of the input weighted loop fusion problem. The linear-sized formulation is key to making the execution time small enough for use in a product-quality optimizing compiler, since the natural integer programming formulation for this problem has cubic size for which the execution time would be too large to be practical. The linear-sized integer programming formulation can be solved efficiently using any standard optimization package but we also present a custom branch-and-bound algorithm that can be used if greater efficiency is desired. A prototype implementation of this approach has been completed, and preliminary compile-time measurements are included in the paper as validation of the practicality of this approach.
We present new approximation algorithms for the problem of scheduling precedence-constrained jobs on parallel machines that are uniformly related. That is, there are n jobs and m machines;each job j requires pj units ...
详细信息
We present new approximation algorithms for the problem of scheduling precedence-constrained jobs on parallel machines that are uniformly related. That is, there are n jobs and m machines;each job j requires pj units of processing, and is to be processed on one machine without interruption;if it is assigned to machine i, which runs at a given speed si, it takes pj/si time units. There also is a partial order max = maxjCj, where Cj denotes the completion time of job j, and ∑j wjCj, where wj is a weight that is given for each job j. For the first objective, the best previously known result is an O(√m)-approximation algorithm, which was shown by Jaffe more than 15 years ago. We shall give an O(log m)-approximation algorithm. We shall also show how to extend this result to obtain an O(log m)-approximation algorithm for the second objective, albeit with a somewhat larger constant. These results also extend to settings in which each job j has a release date rj before which the job may not begin processing. In addition, we obtain stronger performance guarantees if there are a limited number of distinct speeds. Our results are based on a new linear programming-based technique for estimating the speed at which each job should be run, and a variant of the list scheduling algorithm of Graham that can exploit this additional information.
In this paper a parallel and fault-tolerant LAN (P_FTLAN) with dual communication subnetworks is presented to improve LANs' reliability. Its function modes, technical characters, hardware and software architecture...
详细信息
ISBN:
(纸本)0818678704
In this paper a parallel and fault-tolerant LAN (P_FTLAN) with dual communication subnetworks is presented to improve LANs' reliability. Its function modes, technical characters, hardware and software architectures and some key, implementation techniques, such as logical addresses and parallel mechanisms, are described in details. Our prototype system and analyzing results suggest that the scheme presented in the paper not only provides an effective approach to high reliable LANs, but also can improve their performance greatly.
parallelism suffers from a lack of programming languages both simple to handle and able to take advantage of the power of present parallel computers. If parallelism expression is too high level, compilers have to perf...
详细信息
ISBN:
(纸本)0818678836
parallelism suffers from a lack of programming languages both simple to handle and able to take advantage of the power of present parallel computers. If parallelism expression is too high level, compilers have to perform complex optimizations leading often to poor performances. One the other hand, too low level parallelism transfers difficulties reward the programmer. In this paper, we propose a new programming language that integrates both a synchronous data-parallel progamming model and an asynchronous execution model. The synchronous data-parallel programming model allows a safe program designing. The asynchronous execution model yields an efficient execution on present MIMD architectures without any program transformation. Our language relies on a logical instruction ordering exploited by specific send/receive communications. It allows to express only the effective data dependences between processors. This ability is enforced by a possible send/receive unmatching useful for irregular algorithms. A sparse vector computation exemplifies our language potentialities.
It is well known that after placing m≥n balls independently and uniformly at random (i.u.r.) into n bins, the fullest bin contains Θ(log n/log log n+m/n) balls, with high probability. It is also known (see [Ste96]) ...
详细信息
ISBN:
(纸本)9780897918909
It is well known that after placing m≥n balls independently and uniformly at random (i.u.r.) into n bins, the fullest bin contains Θ(log n/log log n+m/n) balls, with high probability. It is also known (see [Ste96]) that a maximum load of O (m/n) can be obtained for all m≥n if a ball is allocated in one (suitably chosen) of two (i.u.r.) bins. Stemann ([Ste96]) shows that r communication rounds suffice to guarantee a maximum load of max {r√log n, O (m/n)}, with high probability. In particular, O(log log n) communication rounds suffice to guarantee optimal load O(m/n) for m≥n, with high probability. Adler et al. have shown in [acmR95] that Stemanns protocol is optimal for constant r. In this paper we extend the above results in two directions: We generalize the lower bound to arbitrary r≤log log n. This implies that the result of Stemanns protocol is optimal for all r. Our main result is a generalization of Stemanns upper bound to weighted jobs: Let WA (WM) denote the average (maximum) weight of the balls. Further let Δ = WA/WM. Note that the maximum load is at least Ω(m/n·WA+WM). We present a protocol that achieves maximum load of γ·(m/n·WA+WM) using O (log log n/log(γ·((m/n)·Δ+1))) rounds of communication. For uniform weights this matches the results of Stemann. In particular, for log log n rounds we achieve optimal load of O (m/n·WA+WM). Using this lower bound it is also shown that our algorithm is optimal in the case of weighted balls for various degrees of uniformity. All the balls into bins games model Load Balancing problems: The balls are jobs, the bins are resources, the task is to allocate the jobs to the resources so that they are evenly distributed. From this interpretation, considering weighted balls is very important because the weights may e.g. model the runtime of jobs. Applications of such Load Balancing problems occur e.g. for client-server networks and for Multimedia-Servers using disk arrays.
We consider the problem of optimizing the total flow time a stream of jobs that are released over time in a multi-processor setting. This problem is NP-hard even when we allow preemption, and have only two machines. A...
详细信息
ISBN:
(纸本)9780897918886
We consider the problem of optimizing the total flow time a stream of jobs that are released over time in a multi-processor setting. This problem is NP-hard even when we allow preemption, and have only two machines. Although the total (or average) flow time is widely accepted as a good measurement of the overall quality of service, no approximation algorithms were known for this basic scheduling problem. This paper contains two main results. We first prove that when preemption is allowed, Shortest Remaining Processing Time (SRPT) is an O(log(min{n/m, P})) approximation algorithm for the total flow time, where n is the number of jobs, m is the number of machines, and P is the ratio between the maximum and the minimum processing time of a job. We also provide an Ω(log n/m) and an Ω(log P) lower bounds on the (worst case) competitive ratio of any randomized algorithm for the on-line problem in which jobs are known at their release times. Thus, we show that up to a constant factor SRPT is an optimal on-line algorithm. Our second main result addresses the non-preemptive case. We present a general technique that allows to transform any preemptive solution into a non-preemptive solution at the expense of an O(√n/m) factor in the approximation ratio of the total flow time. Combining this technique with our previous result yields an O(√n/m log n/m) approximation algorithm for this case. We also show an Ω(n1/3-Ε) lower bound on the approximability of this problem (assuming P≠NP).
暂无评论