parallel computing is a useful technology for scientific and engineering algorithms/applications. LU-SGS (lower-upper Symmetric-Gauss-Seidel method) is an efficient and robust scheme for CFD (Computational fluid dynam...
详细信息
parallel computing is a useful technology for scientific and engineering algorithms/applications. LU-SGS (lower-upper Symmetric-Gauss-Seidel method) is an efficient and robust scheme for CFD (Computational fluid dynamics) and has strong data dependence in its computation. In this paper, we present an efficient wavefront parallel algorithm for 3D (three dimensional) LU-SGS with structured meshes. The corresponding data structure and memory access method with better data locality and communication optimization is designed. The performances of the presented parallel algorithm are reported with different problem sizes. Some discussion and performance issues are also reported. The results show that the overall performance speedup of one Intel E5540 CPU (4 CPU cores) ranges from 2.23 to 2.95 compared with one E5540 core. The parallel efficiency of 1024, 128 processes are up to 35.68%, 72.69% compared with 32 processes on a distributed memory cluster system. The CFD simulation of M6 wing model shows the effect of the presented parallel algorithm. (C) 2016 Elsevier Ltd. All rights reserved.
We present a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph. With high probability, the algorithm runs in logarithmic time and linear work on an exclusive read exclusive write (ERE...
详细信息
We present a randomized algorithm to find a minimum spanning forest (MSF) in an undirected graph. With high probability, the algorithm runs in logarithmic time and linear work on an exclusive read exclusive write (EREW) PRAM. This result is optimal w.r.t. both work and parallel time, and is the first provably optimal parallel algorithm for this problem under both measures. We also give a simple, general processor allocation scheme for tree-like computations.
We present an O((log log N)(2)) -time algorithm for computing the distance transform of an N x N binary image. Our algorithm is designed for the common concurrent read concurrent write parallel random access machine (...
详细信息
We present an O((log log N)(2)) -time algorithm for computing the distance transform of an N x N binary image. Our algorithm is designed for the common concurrent read concurrent write parallel random access machine (CRCW PRAM) and requires O(N2+epsilon / log log N) processors, for any epsilon such that 0 < E < 1. Our algorithm is based on a novel deterministic sampling scheme and can be used for computing distance transforms for a very general class of distance functions. We also present a scalable version of our algorithm when the number of processors is available p(2+epsilon) / log log p for some p < N. In this case, our algorithm runs in O((N-2/p(2)) + (N/p) log log p + (log log p)(2)) time. This scalable algorithm is more practical since usually the number of available processors is much less than the size of the image.
In this paper, a novel parallel algorithm is proposed to solve the problems of heavy computation and long simulation time in the field of compressible flows. In this algorithm, a third-order upwind scheme and a fourth...
详细信息
In this paper, a novel parallel algorithm is proposed to solve the problems of heavy computation and long simulation time in the field of compressible flows. In this algorithm, a third-order upwind scheme and a fourth-order central difference scheme are employed, with a third-order Runge-Kutta method for time stepping. Considering the powerful floating-point computing ability of the Graphics Processing Unit (GPU), this paper establishes the algorithm on the basis of GPU. Moreover, the direct numerical simulation method is adopted in this algorithm to improve the solution accuracy of the simulation results. To further enhance the efficiency of the algorithm, several optimization strategies are explored in the design of the algorithm as well. Both accuracy and feasibility of the algorithm are verified by a classical two-dimensional example. Compared with solving this example on the Central Processing Unit platform, the experimental results demonstrate that the maximum speedup ratio achieved by our approach is 18.03 times.
Basing on overlapping domain decomposition, we construct a new parallel algorithm combined the method of subspace correction with least-squares procedure for solving time-dependent convection-diffusion problem. This a...
详细信息
Basing on overlapping domain decomposition, we construct a new parallel algorithm combined the method of subspace correction with least-squares procedure for solving time-dependent convection-diffusion problem. This algorithm is fully parallel. We analyze the convergence of approximate solution, and study the dependence of the convergent rate on the spacial mesh size, time increment, iteration number and sub-domains overlapping degree. Both theoretical analysis and numerical results suggest that only one or two iterations are needed to reach to given accuracy at each time step.
The work presented in this paper focuses on parallel iterative algorithm for solving band linear systems. With suitable decomposition of coefficient matrix and using the form of the iterative formula of the BSOR seria...
详细信息
The work presented in this paper focuses on parallel iterative algorithm for solving band linear systems. With suitable decomposition of coefficient matrix and using the form of the iterative formula of the BSOR serial method, a parallel algorithm on distributed-memory multi-computer is established. Then Eq. (7) and (8) are provided for carrying out computation required by our parallel algorithm. Furthermore, convergence is proved when the coefficient matrix A is Hermite positive definite matrix or M-matrix, and the sufficient condition is given. According to theoretical analysis, this algorithm has the same convergent velocity as BSOR method, and has the same parallelism as BJ method. In the end, two illustrative examples implemented on HP rx2600 cluster show that our algorithm's parallel acceleration rates and efficiency are higher. (c) 2006 Elsevier Inc. All rights reserved.
The P-4-tidy graphs were introduced by I. Rusu to generalize some already known classes of graphs with few induced P-4 (cographs, P-4-sparse graphs, P-4-lite graphs). Here, we propose an extension of R. Lin and S. Ola...
详细信息
The P-4-tidy graphs were introduced by I. Rusu to generalize some already known classes of graphs with few induced P-4 (cographs, P-4-sparse graphs, P-4-lite graphs). Here, we propose an extension of R. Lin and S. Olariu's work (1994. J. parallel Distributed Computing 22, 26-36.) on cographs, using the modular decomposition. As an application, we show how to obtain a maximum matching parallel algorithm for the family of P-4-tidy graphs (represented by a parse tree) in O(log n) time with O(n/log n) processors in the EREW-PRAM model with n-vertex graphs. (C) 1998 Academic Press, Inc.
The algebraic problem $B\gamma (E) + A\beta (E) = \eta $, which may evolve from elliptic or parabolic partial differential equations, is studied. Conditions are given so that a parallel algorithm of a nonlinear Gauss...
详细信息
The algebraic problem $B\gamma (E) + A\beta (E) = \eta $, which may evolve from elliptic or parabolic partial differential equations, is studied. Conditions are given so that a parallel algorithm of a nonlinear Gauss–Seidel type converges to the unique solution. An application is given to the solidification of a channel in which the continuum problem is discretized by the finite element method. These computations show a significant decrease in computing times as compared to serial algorithms.
Particle swarm optimization (PSO) algorithm is a population-based algorithm for finding the optimal solution. Because of its simplicity in implementation and fewer adjustable parameters compared to the other global op...
详细信息
Particle swarm optimization (PSO) algorithm is a population-based algorithm for finding the optimal solution. Because of its simplicity in implementation and fewer adjustable parameters compared to the other global optimization algorithms, PSO is gaining attention in solving complex and large scale problems. However, PSO often requires long execution time to solve those problems. This paper proposes a parallel PSO algorithm, called delayed exchange parallelization, which improves performance of PSO on distributed environment by hiding communication latency efficiently. By overlapping communication with computation, the proposed algorithm extracts parallelism inherent in PSO. The performance of our proposed parallel PSO algorithm was evaluated using several applications. The results of evaluation showed that the proposed parallel algorithm drastically improved the performance of PSO, especially in high-latency network environment. (C) 2010 Elsevier B.V. All rights reserved.
A planar monotone circuit (PMC) is a Boolean circuit that can be embedded in the plane and that contains only AND and OR gates. Goldschlager, Dymond, Cook, and others have developed NC2 algorithms to evaluate a specia...
详细信息
A planar monotone circuit (PMC) is a Boolean circuit that can be embedded in the plane and that contains only AND and OR gates. Goldschlager, Dymond, Cook, and others have developed NC2 algorithms to evaluate a special layered form of a PMC. These algorithms require a large number of processors (Omega(n(6)), where n is the size of the input circuit). Yang and, more recently, Delcher and Kosaraju have given NC algorithms for the general planar monotone circuit value problem. These algorithms use at least as many processors as the algorithms for the layered case. This paper gives an efficient parallel algorithm that evaluates a general PMC of size n in polylog time using only a linear number of processors on an exclusive read exclusive write parameter random-access machine (EREW PRAM). This parallel algorithm is the best possible to within a polylog factor and is a substantial improvement over the earlier algorithms for the problem. The algorithm uses several novel techniques to perform the evaluation, including the use of the dual of the plane embedding of the circuit to determine the propagation of values within the circuit.
暂无评论