Partitioning unstructured graphs is central to the parallel solution of computational science and engineering problems. Spectral partitioners, such recursive spectral bisection (RSB), have proven effective in generati...
详细信息
Partitioning unstructured graphs is central to the parallel solution of computational science and engineering problems. Spectral partitioners, such recursive spectral bisection (RSB), have proven effective in generating high-quality partitions of realistically-sized meshes. The major problem which hindered their widespread use was their long execution times. This paper presents a new inertial spectral partitioner, called HARP. The main objective of the proposed approach is to quickly partition the meshes at runtime in a manner that works efficiently for real applications in the context of distributed-memory machines. The underlying principle of HARP is to find the eigenvectors of the unpartitioned vertices and then project them onto the eigenvectors of the original mesh. Results for various meshes ranging in size from 1000 to 100,000 vertices indicate that HARP can indeed partition meshes rapidly at runtime. Experimental results show that our largest mesh can be partitioned sequentially in only a few seconds on an SP2 which is several times faster than other spectral partitioners while maintaining the solution quality of the proven RSB method. A parallel MPI version of HARP has also been implemented on IBM SP2 and Cray T3E. parallel HARP, running on 64 processors SP2 and T3E, can partition a mesh containing more than 100,000 vertices into 64 subgrids in about half a second. These results indicate that graph partitioning can now be truly embedded in dynamically-changing real-world applications.
Over the last decade, significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the cont...
详细信息
ISBN:
(纸本)0818679778
Over the last decade, significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose computing, and more specifically the SPEC benchmark suite. At the same time, number of microprocessor architectures have emerged which have VLIW and SIMD structures that are well matched to the needs of the ILP compilers. Most of these processors are targeted at embedded applications such as multimedia and communications, rather than general-purpose systems. Conventional wisdom, and a history of hand optimization of inner-loops, suggests that ILP compilation techniques are well suited to these applications. Unfortunately, there currently exists a gap between the compiler community and embedded applications developers. This paper presents MediaBench, a benchmark suite that has been designed to fill this gap. This suite has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement to establish uniqueness, and integration with system synthesis algorithms to establish usefulness.
Much of the computation involved in parallel programs occurs within loops, either nested loops as in parallel scientific applications or collections of loops as in stream-based applications. Loop fusion is a well-know...
详细信息
ISBN:
(纸本)9780897918909
Much of the computation involved in parallel programs occurs within loops, either nested loops as in parallel scientific applications or collections of loops as in stream-based applications. Loop fusion is a well-known program transformation that has shown to be effective in improving data locality in parallel programs by reducing inter-processor communication and improving register and cache locality. Weighted loop fusion is the problem of finding a legal partition of loop nests into fusible clusters so as to minimize the total inter-cluster weights. The loop nests may contain parallel or sequential loops;care is taken to ensure that a parallel loop does not get serialized after fusion. It has been shown in past work that the weighted loop fusion problem is NP-hard. Despite the NP-hardness property, we show how optimal solutions can be found efficiently (i.e., within the compile-time constraints of a product-quality optimizing compiler) for weighted loop fusion problem sizes that occur in practice. In this paper, we present an integer programming formulation for weighted loop fusion with size (number of variables and constraints) that is linearly proportional to the size of the input weighted loop fusion problem. The linear-sized formulation is key to making the execution time small enough for use in a product-quality optimizing compiler, since the natural integer programming formulation for this problem has cubic size for which the execution time would be too large to be practical. The linear-sized integer programming formulation can be solved efficiently using any standard optimization package but we also present a custom branch-and-bound algorithm that can be used if greater efficiency is desired. A prototype implementation of this approach has been completed, and preliminary compile-time measurements are included in the paper as validation of the practicality of this approach.
It is well known that after placing m≥n balls independently and uniformly at random (i.u.r.) into n bins, the fullest bin contains Θ(log n/log log n+m/n) balls, with high probability. It is also known (see [Ste96]) ...
详细信息
ISBN:
(纸本)9780897918909
It is well known that after placing m≥n balls independently and uniformly at random (i.u.r.) into n bins, the fullest bin contains Θ(log n/log log n+m/n) balls, with high probability. It is also known (see [Ste96]) that a maximum load of O (m/n) can be obtained for all m≥n if a ball is allocated in one (suitably chosen) of two (i.u.r.) bins. Stemann ([Ste96]) shows that r communication rounds suffice to guarantee a maximum load of max {r√log n, O (m/n)}, with high probability. In particular, O(log log n) communication rounds suffice to guarantee optimal load O(m/n) for m≥n, with high probability. Adler et al. have shown in [acmR95] that Stemanns protocol is optimal for constant r. In this paper we extend the above results in two directions: We generalize the lower bound to arbitrary r≤log log n. This implies that the result of Stemanns protocol is optimal for all r. Our main result is a generalization of Stemanns upper bound to weighted jobs: Let WA (WM) denote the average (maximum) weight of the balls. Further let Δ = WA/WM. Note that the maximum load is at least Ω(m/n·WA+WM). We present a protocol that achieves maximum load of γ·(m/n·WA+WM) using O (log log n/log(γ·((m/n)·Δ+1))) rounds of communication. For uniform weights this matches the results of Stemann. In particular, for log log n rounds we achieve optimal load of O (m/n·WA+WM). Using this lower bound it is also shown that our algorithm is optimal in the case of weighted balls for various degrees of uniformity. All the balls into bins games model Load Balancing problems: The balls are jobs, the bins are resources, the task is to allocate the jobs to the resources so that they are evenly distributed. From this interpretation, considering weighted balls is very important because the weights may e.g. model the runtime of jobs. Applications of such Load Balancing problems occur e.g. for client-server networks and for Multimedia-Servers using disk arrays.
Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose...
详细信息
ISBN:
(纸本)9780818679773
Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose computing, and more specifically the SPEC benchmark suite. At the same time, a number of microprocessor architectures have emerged which have VLIW and SIMD structures that are well matched to the needs of the ILP compilers. Most of these processors are targeted at embedded applications such as multimedia and communications, rather than general-purpose systems. Conventional wisdom, and a history of hand optimization of inner-loops, suggests that ILP compilation techniques are well suited to these applications. Unfortunately, there currently exists a gap between the compiler community and embedded applications developers. This paper presents MediaBench, a benchmark suite that has been designed to fill this gap. This suite has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement to establish uniqueness, and integration with system synthesis algorithms to establish usefulness.
The proceedings contains 39 papers from the 8th annualacmsymposium on parallelalgorithms and architectures. Topics discussed include: parallel random access memory;optical parallel process;release consistency;scope...
详细信息
The proceedings contains 39 papers from the 8th annualacmsymposium on parallelalgorithms and architectures. Topics discussed include: parallel random access memory;optical parallel process;release consistency;scope consistency;entry consistency;cache coherence protocol;thread management;weight factoring;rooted tree networks;butterfly networks;memory mapping;network routing table;all to all personalized communication;sample sort;radix sort;blockwise sample;minimum spanning forests;and virtual channels.
Two novel variations on sample sort, one using only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead and another using regular sam...
详细信息
Two novel variations on sample sort, one using only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead and another using regular sampling for choosing splitters, were studied. The two were coded in Split-C and were run on a variety of platforms. Results were consistent with theoretical analysis and illustrated the scalability and efficiency of the algorithms.
暂无评论