The development of smart grid and the increasing scale of power system bring much pressure to the electromagnetic transient simulation of a power system. The graphic processing unit (GPU), which features the massive c...
详细信息
The development of smart grid and the increasing scale of power system bring much pressure to the electromagnetic transient simulation of a power system. The graphic processing unit (GPU), which features the massive concurrent threads and excellent floating point performance, brings a new chance to the area of power system simulation. This study introduces a parallel lower triangular and upper triangular decomposition algorithm and calculation strategy of electromagnetic transient simulation based on GPU. In this scheme, the GPU is mainly used to do the computationally intensive part of the simulation in parallel on its built-in multiple processing cores, and the CPU is assigned for updating history terms and flow control of the simulation. By comparing with the results simulating by the CPU-only implementations, the validity and efficiency of the proposed method are verified.
Presented is a parallel algorithm based on the fast multipole method (FMM) for the Helmholtz equation. This variant of the FMM is useful for computing radar cross sections and antenna radiation patterns. The FMM decom...
详细信息
Presented is a parallel algorithm based on the fast multipole method (FMM) for the Helmholtz equation. This variant of the FMM is useful for computing radar cross sections and antenna radiation patterns. The FMM decomposes the impedance matrix into sparse components, reducing the operation count of the matrix-vector multiplication in iterative solvers to O(N 3/2 ) (where N is the number of unknowns). The parallel algorithm divides the problem into groups and assigns the computation involved with each group to a processor node. Careful consideration is given to the communications costs. A time complexity analysis of the algorithm is presented and compared with empirical results from a Paragon XP/S running the lightweight Sandia/University of New Mexico operating system (SUNMOS). For a 90,000 unknown problem running on 60 nodes, the sparse representation fits in memory and the algorithm computes the matrix-vector product in 1.26 seconds. It sustains an aggregate rate of 1.4 Gflop/s. The corresponding dense matrix would occupy over 100 Gbytes and, assuming that I/O is free, would require on the order of 50 seconds to form the matrix-vector product.
This paper presents the results of a study conducted to evaluate the inherent memory reference behavior of several engineering/scientific applications, executing on shared memory, MIN-based, parallel systems. In this ...
详细信息
This paper presents the results of a study conducted to evaluate the inherent memory reference behavior of several engineering/scientific applications, executing on shared memory, MIN-based, parallel systems. In this study, system sizes of two to 64 processors were evaluated. A trace-driven simulation model was used to obtain dynamic reference characteristics of the code. Included in this code were explicit declarations of shared variables. Our results indicate that a significant amount of explicitly declared shared data is accessed as either readonly by several processors, or read-write by a single processor. Furthermore, lines containing synchronization variables tend to see small ownership times at a processor and are accessed by several processors in the system. We also note that, as expected, relatively more references are to data with smaller ownership times, as the number of processors increase. Finally, the application data set size can have an impact on ownership time, as the number of processors increase.
Computationally efficient serial and parallel algorithms for estimating the general linear model are proposed. The sequential block-recursive algorithm is an adaptation of a known Givens strategy that has Lis a main c...
详细信息
Computationally efficient serial and parallel algorithms for estimating the general linear model are proposed. The sequential block-recursive algorithm is an adaptation of a known Givens strategy that has Lis a main component the Generalized QR decomposition. The proposed algorithm is based on orthogonal transformations and exploits the triangular structure of the Cholesky QRD factor of the variance-covariance matrix. Specifically, it computes the estimator of the general linear model by solving recursively a series of smaller and smaller generalized linear least squares problems. The new algorithm is found to Outperform significantly the corresponding LAPACK routine. A parallel version of the new sequential algorithm which utilizes an efficient distribution of the matrices over the processors and has low inter-processor communication is developed. The theoretical computational complexity of the parallel algorithms is derived and analyzed. Experimental results are presented which confirm the theoretical analysis. The parallel strategy is found to be scalable and highly efficient for estimating large-scale general linear estimation problems. (c) 2005 Elsevier B.V. All rights reserved.
We present a parallel algorithm for finding the convex hull of a sorted set of points in the plane. Our algorithm runs inO(logn/log logn) time usingO(n log logn/logn) processors in theCommon crcw pram computational mo...
详细信息
We present a parallel algorithm for finding the convex hull of a sorted set of points in the plane. Our algorithm runs inO(logn/log logn) time usingO(n log logn/logn) processors in theCommon crcw pram computational model, which is shown to be time and cost optimal. The algorithm is based onn 1/3 divide-and-conquer and uses a simple pointer-based data structure.
We present scalable algorithms to simulate large-scale stochastic particle systems amenable for modeling dense colloidal suspensions, glasses and gels. To handle the large number of particles and consequent many-body ...
详细信息
We present scalable algorithms to simulate large-scale stochastic particle systems amenable for modeling dense colloidal suspensions, glasses and gels. To handle the large number of particles and consequent many-body interactions present in such systems, we leverage an Accelerated Stokesian Dynamics (ASD) approach, for which we developed parallel algorithms in a distributed memory architecture. We present parallelization of the sparse near-field (including singular lubrication) interactions, and of the matrix-free many body far-field interactions, along with a strategy for communicating and mapping the distributed data structures between the near-and far field. Scaling to up to tens of thousands of processors for a million particles is demonstrated. In addition, we propose a novel algorithm to efficiently simulate correlated Brownian motion with hydrodynamic interactions. The original Accelerated Stokesian Dynamics approach requires the separate computation of far-field and near-field Brownian forces. Recent advancements propose computation of a far-field velocity using positive spectral Ewald decomposition. We present an alternative approach for calculating the far-field Brownian velocity by implementing the fluctuating force coupling method and embedding it using a nested scheme into ASD. This straightforward and flexible approach reduces the computational time of the Brownian far field force construction from O(NlogN)(1+vertical bar alpha vertical bar) to O(NlogN). (C) 2021 Elsevier Inc. All rights reserved.
Metaheuristics, providing high level guidelines for heuristic optimisation, have successfully been applied to many complex problems over the past decades. However, their performances often vary depending on the choice...
详细信息
Metaheuristics, providing high level guidelines for heuristic optimisation, have successfully been applied to many complex problems over the past decades. However, their performances often vary depending on the choice of the initial settings for their parameters and operators along with the characteristics of the given problem instance handled. Hence, there is a growing interest into designing adaptive search methods that automate the selection of efficient operators and setting of their parameters during the search process. In this study, an adaptive binary parallel evolutionary algorithm, referred to as ABPEA, is introduced for solving the uncapacitated facility location problem which is proven to be an NP-hard optimisation problem. The approach uses a unary and two other binary operators. A reinforcement learning mechanism is used for assigning credits to operators considering their recent impact on generating improved solutions to the problem instance in hand. An operator is selected adaptively with a greedy policy for perturbing a solution. The performance of the proposed approach is evaluated on a set of well-known benchmark instances using ORLib and M*, and its scaling capacity by running it with different starting points on an increasing number of threads. Parameters are adjusted to derive the best configuration of three different rewarding schemes, which are instant, average and extreme. A performance comparison to the other state-of-the-art algorithms illustrates the superiority of ABPEA. Moreover, ABPEA provides up to a factor of 3.9 times acceleration when compared to the sequential algorithm based on a single-operator.
In an earlier paper, an approximate SVD updating scheme has been derived as an interlacing of a QR updating on the one hand and a Jacobi-type SVD procedure on the other hand, possibly supplemented with a certain re-or...
详细信息
In an earlier paper, an approximate SVD updating scheme has been derived as an interlacing of a QR updating on the one hand and a Jacobi-type SVD procedure on the other hand, possibly supplemented with a certain re-orthogonalization scheme. This paper maps this updating algorithm onto a systolic array with 0(n2) parallelism for 0(n2) Complexity, resulting in an 0(n0) throughput. Furthermore, it is shown how a square root-free implementation is obtained by combining modified Givens rotations with approximate SVD schemes.
The theme of this paper is that the primary computational bottleneck in the solution of stiff ordinary differential equations (ODEs) and the parallel solution of nonstiff ODEs is the implicitness of the ODE rather tha...
详细信息
The theme of this paper is that the primary computational bottleneck in the solution of stiff ordinary differential equations (ODEs) and the parallel solution of nonstiff ODEs is the implicitness of the ODE rather than the approximation of the integration process (or in conventional terminology, numerical stability rather than accuracy), and therefore it may be fruitful to apply (at least conceptually) the iterative techniques needed to overcome implicitness in continuous time, before discretization—to waveforms rather than values at a point in time. Several classical iterations, based on splitting, are discussed, but the emphasis is on those not based on a partitioning of the ODE system. The shifted Picard iteration is proposed as a compromise between the cheap but slow Picard iteration and the fast but expensive Newton iteration. By varying the shift parameter from one iteration to the next, a good rate of convergence seems possible. As an alternative, the author also examines the more classical acceleration technique applied to the Picard iteration. Some experimental results are given. However, the practical aspects of discretization are beyond the scope of this paper.
暂无评论