3-D magnetotelluric (MT) forward modeling has always been faced with the problems of high memory requirements and long computing time. In this article, we design a scalable parallel algorithm for 3-D MT finite element...
详细信息
3-D magnetotelluric (MT) forward modeling has always been faced with the problems of high memory requirements and long computing time. In this article, we design a scalable parallel algorithm for 3-D MT finite element modeling in anisotropic media. The parallel algorithm is based on the distributed mesh storage, including multiple parallel granularities, and is implemented through multiple tools. Message-passing interface (MPI) is used to exploit process parallelisms for subdomains, frequencies, and solving equations. Thread parallelisms for merge sorting, element analysis, matrix assembly, and imposing Dirichlet boundary conditions are developed by Open Multi-Processing (OpenMP). We validate the algorithm through several model simulations and study the effects of topography and conductivity anisotropy on apparent resistivities and phase responses. Scalability tests are performed on the Tianhe-2 supercomputer to analyze the parallel performance of different parallel granularities. Three parallel direct solvers Supernodal LU (SUPERLU), MUltifrontal Massively parallel sparse direct Solver (MUMPS), and parallel Sparse matriX package (PASTIX) are compared in solving sparse systems of equations. As a result, reasonable parallel parameters are suggested for practical applications. The developed parallel algorithm is proven to be efficient and scalable.
A series of Vlasov-type high power microwave launchers mere investigated with several slant-cut angles. Finite element analysis using parallel computation was performed on a cluster of workstations and compared with l...
详细信息
A series of Vlasov-type high power microwave launchers mere investigated with several slant-cut angles. Finite element analysis using parallel computation was performed on a cluster of workstations and compared with low-power measurements made on a variety of such antennas. Good agreement between the main features of the radiation patterns were observed. However, not all details were reproduced.
Microprocessor clock rates-which for three decades doubled about every 18 months-have essentially stopped increasing. Instead, the number of processor cores (identical processing units capable of all usual microproces...
详细信息
Microprocessor clock rates-which for three decades doubled about every 18 months-have essentially stopped increasing. Instead, the number of processor cores (identical processing units capable of all usual microprocessor functions) in a microprocessor is increasing exponentially with time. In order to increase performance as the number of cores increase, a measurement analysis software will have to take advantage of this parallelism. The objectives of this paper are to study one example of a measurement analysis having serial dependencies among the input data and to show that there is a practical parallel algorithm despite the data dependencies within the measured time series. The measurement analysis studied is transition localization in digital signals. A parallel scan-type algorithm is presented. The results of applying the parallel algorithm on both synthetic data and actual measured data are presented, and the speedup obtained on a twenty-four core computer analyzed. The parallel method produces exactly the same measurement results, bit for bit, as the original serial method. It is argued that what is desired for this and many other measurement processing algorithms is scalability in throughput with number of cores. Such scalability is achieved by the proposed algorithm, with throughput up to about a dozen cores.
This study proposes a novel factorization method for the DCT IV algorithm that allows for breaking it into four or eight sections that can be run in parallel. Moreover, the arithmetic complexity has been significantly...
详细信息
This study proposes a novel factorization method for the DCT IV algorithm that allows for breaking it into four or eight sections that can be run in parallel. Moreover, the arithmetic complexity has been significantly reduced. Based on the proposed new algorithm for DCT IV, the speed performance has been improved substantially. The performance of this algorithm was verified using two different GPU systems produced by the NVIDIA company. The experimental results show that the novel proposed DCT algorithm achieves an impressive reduction in the total processing time. The proposed method is very efficient, improving the algorithm speed by more than 4-times-that was expected by segmenting the DCT algorithm into four sections running in parallel. The speed improvements are about five-times higher-at least 5.41 on Jetson AGX Xavier, and 10.11 on Jetson Orin Nano-if we compare with the classical implementation (based on a sequential approach) of DCT IV. Using a parallel formulation with eight sections running in parallel, the improvement in speed performance is even higher, at least 8.08-times on Jetson AGX Xavier and 11.81-times on Jetson Orin Nano.
A nodal-based finite element formulation coupled with absorbing boundary conditions has been developed to solve open boundary microwave problems, Only parallel computation enables to modelize large devices. We show in...
详细信息
A nodal-based finite element formulation coupled with absorbing boundary conditions has been developed to solve open boundary microwave problems, Only parallel computation enables to modelize large devices. We show in this paper how the code has been implemented on a parallel shared memory computer. Each step of the code is analyzed. Two types of storage for the matrix and two preconditioning methods for the conjugate gradient algorithm are particularly compared.
Plasma particle simulations are used extensively for the study of nonlinear phenomena in both space and laboratory plasmas. Here, a well-benchmarked plasma simulation code has been implemented on the 32-node JPL Mark ...
详细信息
Plasma particle simulations are used extensively for the study of nonlinear phenomena in both space and laboratory plasmas. Here, a well-benchmarked plasma simulation code has been implemented on the 32-node JPL Mark III hypercube to study the applicability of parallel architecture to particle simulation models. In the sequential version of the code, about 90% of the computation time is spent updating the particle positions and velocities. When implemented in parallel on the Mark III Hypercube, this part of the code was sped up by a factor of about 27 (83% efficiency). Computation times on the Mark III have also been compared with times on a variety of other computers.
In tree-based adaptive mesh refinement, elements are partitioned between processes using a space-filling curve. The curve establishes an ordering between all elements that derive from the same root element, the tree. ...
详细信息
In tree-based adaptive mesh refinement, elements are partitioned between processes using a space-filling curve. The curve establishes an ordering between all elements that derive from the same root element, the tree. When representing complex geometries by connecting several trees, the roots of these trees form an unstructured coarse mesh. We present an algorithm to partition the elements of the coarse mesh such that (a) the fine mesh can be load-balanced to equal element counts per process regardless of the element-to-tree map, and (b) each process that holds fine mesh elements has access to the meta data of all relevant trees. As an additional feature, the algorithm partitions the meta data of relevant ghost (halo) trees as well. We develop in detail how each process computes the communication pattern for the partition routine without handshaking and with minimal data movement. We demonstrate the scalability of this approach on up to 917e3 MPI ranks and 371e9 coarse mesh elements, measuring run times of one second or less.
Determining the inner organizational structure of sets of networked elements is of paramount importance to analyze real-world systems such as social, biological, or economic networks. To such an end, it is necessary t...
详细信息
Determining the inner organizational structure of sets of networked elements is of paramount importance to analyze real-world systems such as social, biological, or economic networks. To such an end, it is necessary to identify communities of interrelated nodes within the networks. Recently, a fuzzy community detection approach based on the minimization of a topological error functional has been proposed in the form of a gradient-based algorithm design pattern. However, the intrinsic quadratic algorithmic complexity of the procedure limits the problem size that can be efficiently treated. Here, we extend the ability of this approach to analyze larger networks resorting to parallelism. Thus, we identify the concurrency sources in the gradient-based algorithm design pattern. To determine the parallelization limits, we develop a two-dimensional performance model as a function of the number of processors and network size. The model permits to compute the maximum possible speedup. Another model is presented to find the maximum problem size tractable in a given amount of time. Application of the previous models to a set of benchmark networks shows that parallelization enhances the proposed fuzzy community detection approach in more than an order of magnitude. This allows treatment of networks with several hundred thousand nodes in a time frame of hours.
This paper formulates an incomplete projection algorithm that is applied to the image recovery problem. The algorithm allows an easy implementation of dynamic load balancing for parallel architectures. Furthermore, th...
详细信息
This paper formulates an incomplete projection algorithm that is applied to the image recovery problem. The algorithm allows an easy implementation of dynamic load balancing for parallel architectures. Furthermore, the local computation - communication load ratio can be adjusted, since each processor performs a finite number of iterations of any projection-type technique, and this number can be provided as a parameter of the algorithm. Numerical results compare favorably with those obtained by the extrapolated method of parallel subgradient projections.
We consider the problem of sampling n numbers from the range {1,..., N} without replacement on modern architectures. The main result is a simple divide-and-conquer scheme that makes sequential algorithms more cache ef...
详细信息
We consider the problem of sampling n numbers from the range {1,..., N} without replacement on modern architectures. The main result is a simple divide-and-conquer scheme that makes sequential algorithms more cache efficient and leads to a parallel algorithm running in expected time O(n/p + log p) on p processors, i.e., scales to massively parallel machines even for moderate values of n. The amount of communication between the processors is very small (at most O(log p)) and independent of the sample size. We also discuss modifications needed for load balancing, online sampling, sampling with replacement, Bernoulli sampling, and vectorization on SIMD units or GPUs.
暂无评论