Since their introduction, the four-point explicit group (EG) and explicit decoupled group (EDG) methods in solving elliptic PDE's have been implemented on various parallel computing architectures such as shared me...
详细信息
Since their introduction, the four-point explicit group (EG) and explicit decoupled group (EDG) methods in solving elliptic PDE's have been implemented on various parallel computing architectures such as shared memory parallel computer and distributed computer systems. However, no detailed study on the performance analysis of these algorithms was done in any of these implementations. In this paper we developed performance models for these explicit group methods and present detailed study of their hypothetical implementation on two distributed memory multicomputers with different computation speed and communication bandwidth. Detailed performance analysis based on these models predicted different theoretical performance if the methods were implemented on the clusters. This was confirmed by the experimental results performed on the two distinct clusters. Theoretical analysis and experimental results indicated that both explicit group methods are scalable with respect to number of processors and the problem size. (C) 2007 Published by Elsevier B.V.
We have studied various implementations of iterative polynomial root finding methods on a distributed memory multicomputer. These methods are based on the construction of a sequence of approximations that converge to ...
详细信息
We have studied various implementations of iterative polynomial root finding methods on a distributed memory multicomputer. These methods are based on the construction of a sequence of approximations that converge to the set of zeros. The synchronous version consists in sharing the computation of the next iterate among the processors and updating their data through a total exchange of their results. In order to decrease the communication cost, we introduce asynchronous versions. The computation of the next iterate is still shared among the processor, but the updating is done by using only nearest neighbor communications. We prove that under weak conditions, these asynchronous versions are still locally convergent, even if their convergence orders are reduced. We analyze the behavior of the asynchronous methods in function of their delay, the topology of the interconnection network, and the elementary computation and communication times. We have implemented and compared these methods on a hypercube multicomputer.
3-D data visualization is very useful for medical imaging and computational fluid dynamics. Volume rendering can be used to exhibit the shape and volumetric properties of 3-D objects. However, volume rendering require...
详细信息
3-D data visualization is very useful for medical imaging and computational fluid dynamics. Volume rendering can be used to exhibit the shape and volumetric properties of 3-D objects. However, volume rendering requires a considerable amount of time to process the large volume of data. To deliver the necessary rendering rates, parallel hardware architectures such as distributed memory multicomputers offer viable solutions. The challenge is to design efficient parallel algorithms that utilize the hardware parallelism effectively. In this paper, we present two efficient parallel volume rendering algorithms, the 1D-partition and 2D-partition methods, based on the shear-warp factorization for distributed memory multicomputers. The 1D-partition method has a performance bound on the size of the volume data. If the number of processors is less than a threshold, the 1D-partition method can deliver a good rendering rate. If the number of processors is over a threshold, the 2D-partition method can be used. To evaluate the performance of these two algorithms, we implemented the proposed methods along with the slice data partitioning, volume data partitioning, and sheared volume data partitioning methods on an IBM SP2 parallel machine. Six volume data sets were used as the test samples. The experimental results show that the proposed methods outperform other compatible algorithms for all test samples. When the number of processors is over a threshold, the experimental results also demonstrate that the 2D-partition method is better than the 1D-partition method.
We present a new scalable algorithm for short-range molecular dynamics simulations on distributedmemory MIMD multicomputers based on a message-passing multi-cell approach. We have implemented the algorithm on the Con...
详细信息
We present a new scalable algorithm for short-range molecular dynamics simulations on distributedmemory MIMD multicomputers based on a message-passing multi-cell approach. We have implemented the algorithm on the Connection Machine 5 (CM-5) and demonstrate that meso-scale molecular dynamics with more than 10(8) particles is now possible on massively parallel MIMD computers. Typical runs show single particle update-times of 0.15 mus in 2 dimensions (2D) and approximately 1 mus in 3 dimensions (3D) on a 1024 node CM-5 without vector units, corresponding to more than 1.8 Gflops overall performance. We also present a scaling equation which agrees well with actually observed timings.
With the objective of minimizing the total execution time of a parallel program on a distributedmemory parallel computer, this paper discusses how to find an optimal supernode size and optimal supernode relative side...
详细信息
With the objective of minimizing the total execution time of a parallel program on a distributedmemory parallel computer, this paper discusses how to find an optimal supernode size and optimal supernode relative side lengths of a supernode transformation (also known as tiling). We identify three parameters of supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For algorithms with perfectly nested loops and uniform dependencies, for sufficiently large supernodes and number of processors, and for the case where multiple supernodes are mapped to a single processor, we give an order n polynomial whose real positive roots include the optimal supernode size. For two special cases, 1) two-dimensional algorithm problems and 2) n-dimensional algorithm problems, where the communication cost is dominated by the startup penalty and, therefore, can be approximated by a constant, we give a closed form expression for the optimal supernode size, which is independent of the supernode relative side lengths and cutting hyperplanes. For the case where the algorithm iteration index space and the supernodes are hyperrectangular, we give closed form expressions for the optimal supernode relative side lengths. Our experiment shows a good match of the closed form expressions with experimental data.
This paper presents a simple and effective method for the concurrent manipulation of linearly ordered data structures on hypercube systems. The method is based on the existence of an augmented binomial search tree, ca...
详细信息
This paper presents a simple and effective method for the concurrent manipulation of linearly ordered data structures on hypercube systems. The method is based on the existence of an augmented binomial search tree, called the pruned binomial tree, rooted at any arbitrary processor node of the hypercube such that 1) every edge of the tree corresponds to a direct link between a pair of hypercube nodes, and 2) the tree spans any arbitrary sequence of n consecutive nodes containing the root, using a fan-out of at most inverted right perpendicular log2 n inverted left perpendicular and a depth of at most inverted right perpendicular log2 n inverted left perpendicular + 1. Search trees spanning nonoverlapping processor lists are formed using only local information, and can be used concurrently without contention problems. Thus, they can be used for performing operations such as broadcast and merge simultaneously on sets with nonuniform sizes. Extensions of the tree to k-ary n-cubes and faulty hypercubes are presented. Applications of this concurrent data structure to low- and intermediate-level image processing algorithms, and for dictionary operations involving multiple keys, are also outlined.
This paper formalizes a general technique to combine different methods in the solution of large systems of nonlinear equations using parallel asynchronous implementations on distributed-memory multiprocessor systems. ...
详细信息
This paper formalizes a general technique to combine different methods in the solution of large systems of nonlinear equations using parallel asynchronous implementations on distributed-memory multiprocessor systems. Such combinations of methods, referred to as Team Algorithms, are evaluated as a way of obtaining desirable properties of different methods and a sufficient condition for their convergence is derived. The load flow problem of electrical power networks is presented as an example problem that, under certain conditions, has the characteristics io make a Tearri Algorithm an appealing choice for its solution. Experimental results of an implementation on an Intel iPSC/860 Hypercube are reported, showing that considerable speedup and robustness can be obtained using team algorithms.
With the objective of minimizing the total execution time of a parallel program on a distributedmemory parallel computer, this paper discusses the selection of an optimal supernode shape of a supernode transformation...
详细信息
With the objective of minimizing the total execution time of a parallel program on a distributedmemory parallel computer, this paper discusses the selection of an optimal supernode shape of a supernode transformation (also known as tiling). We identify three parameters of a supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For supernode transformations on algorithms with perfectly nested loops and uniform dependencies, we prove the optimality of a constant linear schedule vector and give a necessary and sufficient condition for optimal relative side lengths. We also prove that the total running time is minimized by a cutting hyperplane direction matrix from a particular subset of all valid directions and we discuss the cases where this subset is unique. The results are derived in continuous space and should be considered approximate. Our model does not include cache effects and assumes an unbounded number of available processors, the communication cost approximated by a constant, uniform dependences, and loop bounds known at compile time. A comprehensive example is discussed with an application of the results to the Jacobi algorithm.
暂无评论