This paper presents the development of a spatial decomposition parallel algorithm and its implementation into a concurrent atomistic-continuum (CAC) method simulator for multiscale modeling of dislocations in metallic...
详细信息
This paper presents the development of a spatial decomposition parallel algorithm and its implementation into a concurrent atomistic-continuum (CAC) method simulator for multiscale modeling of dislocations in metallic materials. The scalability and parallel efficiency of the parallelized CAC are tested using up to 512 processors. With a modest computational resource, a single crystalline f.c.c. sample containing 10: 6 billion atoms is modeled using only 4;809;108 finite elements in a CAC model at a fraction of the cost of full molecular dynamics (MD). The simulation demonstrates a nearly ideal scalability of the newly parallelized CAC simulator. The parallel efficiency of the newly parallelized CAC is shown to be higher than 90% when using 512 processors in the high performance computing cluster at Iowa State University. This parallel efficiency is comparable to the state-of-the-art atomistic simulator. Moreover, the newly parallelized CAC simulator employing a uniform coarse mesh is capable of capturing important atomistic features of dislocations, including dislocation nucleation, migration, stacking faults as well as the formation of Lomer-Cottrell locks, in a billion-atom system. The spatial decomposition-based parallelization algorithm developed in this work is general and can be transferable to many other existing concurrent multiscale simulation tools. Published by Elsevier B.V.
Merging is the process of constructing a sorted array C from two sorted arrays A and B of lengths nA and nB, respectively, such that array C contains the elements of arrays A and B. The problem is a fundamental subrou...
详细信息
Merging is the process of constructing a sorted array C from two sorted arrays A and B of lengths nA and nB, respectively, such that array C contains the elements of arrays A and B. The problem is a fundamental subroutine in many applications such as tree reconstruction, database design, and sorting. In this paper, we present a constant-time integer merging algorithm on a shared memory model with a concurrent read operation. No constant-time algorithm for integer merging on this model was designed previously. The algorithm, which is based on cross-rank and address strategy, works when the elements of the inputs belong to the integer domain. The presented algorithm has the following advantages:- (1) running in constant time;(2) stable;(3) simple;(4) optimal when the domain of integers is less than or equal to the size of the inputs and the number of processors is equal to size of the inputs;and (5) deterministic.
A simple parallel randomized algorithm to find a maximal matching is developed. It is randomized in that it uses randomly chosen numbers. Its characteristics and efficiency do not depend upon the assumption that the...
详细信息
A simple parallel randomized algorithm to find a maximal matching is developed. It is randomized in that it uses randomly chosen numbers. Its characteristics and efficiency do not depend upon the assumption that the input graph is random. A basic operation of the algorithm is a random choice operation (RCO) involving a nonzero boolean vector. The expected time complexity of the algorithm is proven, leading to a trivial implementation of the algorithm on an EREW-PRAM. It is then shown how to perform RCO in the expected time on a CRCW-PRAM, which yields the promised complexity. The expected time is independent of the input graph's structure; this improves the best known deterministic algorithm.
In this paper we present an O(1/alpha log n)-time parallel algorithm for computing the convex hull of n points in R(3). This algorithm uses O(n(1+alpha)) processors on a CREW PRAM, for any constant 0<alpha less tha...
详细信息
In this paper we present an O(1/alpha log n)-time parallel algorithm for computing the convex hull of n points in R(3). This algorithm uses O(n(1+alpha)) processors on a CREW PRAM, for any constant 0parallel algorithms proposed for this problem use time at least O(log(2) n). In addition, the algorithm presented here is the first parallel algorithm for the three-dimensional convex hull problem that is not based on the serial divide-and-conquer algorithm of Preparata and Hong, whose crucial operation is the merging of the convex hulls of two linearly separated point sets. The contributions of this paper are therefore (i) an O(log n)-time parallel algorithm for the three-dimensional convex hull problem, and (ii) a parallel algorithm for this problem that does not follow the traditional paradigm.
In order to solve poor fine searching capacity of artificial fish swarm algorithm and artificial bee colony swarm algorithm in late state to result in insufficient local optimization, hybrid swarm intelligent parallel...
详细信息
In order to solve poor fine searching capacity of artificial fish swarm algorithm and artificial bee colony swarm algorithm in late state to result in insufficient local optimization, hybrid swarm intelligent parallel algorithm research based on multi-core clusters is proposed;Then, reverse learning mechanism is introduced in early stage of algorithm, initialized swarms are evenly distributed, and swarms are randomly divided into two groups to make interactive learning strategy accelerates rate of convergence, and basic artificial fish swarm algorithm and artificial bee colony swarm algorithm are used to make global searching. In late stage of algorithm, niches artificial fish swarm algorithm and Random Perturbation Artificial Bee Colony are used to make local fine searching to the solution obtained in early stage;On this basis, MPI+OpenMP+STM parallel programming model based on multi-core clusters is established for parallel design and analysis. Finally, stimulation experiment indicates optimizing efficiency of this algorithm is higher than single artificial fish swarm algorithm and artificial bee colony swarm algorithm. (C) 2016 Elsevier B.V. All rights reserved.
It has been shown in [Nuclear Science and Engineering 93 (1986) 6799] that the finite difference discretization of Navier-Stoke's equation leads to the solution of N x N system written in the matrix form as My = B...
详细信息
It has been shown in [Nuclear Science and Engineering 93 (1986) 6799] that the finite difference discretization of Navier-Stoke's equation leads to the solution of N x N system written in the matrix form as My = B, where M is a quasi-tridiagonal having non-zero elements at the top right and bottom left corners. We present an efficient parallel algorithm on a p-processor hypercube implemented in two phases. In phase I a generalization of an algorithm due to Kowalik [High Speed Computation, Springer, New York] is developed which decomposes the above matrix system into smaller quasi-tridiagonal (p + 1) x (p + 1) subsystem, which is then solved in Phase II using an odd-even reduction method. (C) 2002 Elsevier Science Inc. All rights reserved.
High-performance computing for climate models has always been an interesting research area. It is valuable to nest a regional climate model within a global climate model, but large-scale simulation of the nesting or c...
详细信息
High-performance computing for climate models has always been an interesting research area. It is valuable to nest a regional climate model within a global climate model, but large-scale simulation of the nesting or coupling severely challenges to the development of efficient parallel algorithms that fit well into multi-core clusters. This paper first presents research on the coupling of the Institute of Atmospheric Physics of Chinese Academy of Sciences Atmospheric General Circulation Model version 4.0 and the Weather Research and Forecasting model, then proposes an efficient parallel algorithm of the coupling. The algorithm includes initialization of input data, decomposition of computing grid and processes, parallel computing of component models, and data exchange by a coupler. By calling some subroutines of the Model Coupling Toolkit, the parallelization of the proposed algorithm is implemented. Experiments show that the parallel algorithm is very effective and scalable. The parallel efficiency of the algorithm on 1,024 CPU cores can reach up to 70%. Moreover, its parallel efficiency with respect to weak scalability is 72.56% on a multi-core cluster.
The image will be contaminated by noise during the imaging process, which severely degrades the image quality. It is necessary to filter the collected image. With the increasing amount of image data, the traditional s...
详细信息
The image will be contaminated by noise during the imaging process, which severely degrades the image quality. It is necessary to filter the collected image. With the increasing amount of image data, the traditional single-processor or multiprocessor computing equipment has been unable to meet the requirements of real-time data processing. In this paper, the computational model of weighted mean filtering and the characteristics of high performance computer architecture are studied. An efficient hierarchical image weighted mean filtering parallel algorithm for Open Computing Language (OpenCL) is designed and implemented, which can fully express the parallelism of the computing model. The parallel algorithm takes full account of the characteristics of image discrete convolution computing and the multi-layer logic architecture of high performance computer, deeply excavates the parallelism of the computing platform and computing model, and realizes the efficient task mapping from computing model to computing resources. The model is implemented in parallel with the two levels of work-group and work-item. The experimental results show that compared with the serial algorithm based on CPU, the parallel algorithm based on Open Multi-Processing (OpenMP) and the parallel algorithm based on Compute Unified Device Architecture (CUDA), the parallel algorithm of weighted mean filtering achieves 20.88 times, 18.52 times and 1.26 times acceleration ratio on the NVIDIA GPU computing platform based on OpenCL architecture, respectively. It realizes better computing performance and runs on different Graphic Processing Unit (GPU) computing platforms, and has good portability and scalability.
A parallel overlapping preconditioner is applied to ICCG method and the effect of the parallel preconditioning on the convergence of the method is investigated by solving large scale block tridiagonal linear systems a...
详细信息
A parallel overlapping preconditioner is applied to ICCG method and the effect of the parallel preconditioning on the convergence of the method is investigated by solving large scale block tridiagonal linear systems arising from the discretization of Poisson's equation. Compared with the original ICCG method, the parallel preconditioned ICCG method can solve the problems in high parallelism with slight increasing the number of iterations. Furthermore, the speedup and the efficiency are evaluated for the parallel preconditioned ICCG method by substituting the experimental results into formulae of complexity. For example, when a domain of simulation is discretized on a 250 x 250 rectangular grid and the preconditioner is divided into 249 smaller ones, its speedup is 146.3 with the efficiency 0.59.
This paper deals with the convergence and stability of a new parallel algorithm and the error estimates for a particular case of the new parallel algorithm, which is used to solve the incompressible nonstationary Navi...
详细信息
This paper deals with the convergence and stability of a new parallel algorithm and the error estimates for a particular case of the new parallel algorithm, which is used to solve the incompressible nonstationary Navier-Stokes equations. The theoretical results show that the scheme is (at least) conditionally stable and convergent. (c) 2007 Elsevier Ltd. All rights reserved.
暂无评论