This paper investigates the parallel, distributed-memory computation of the translation operator with L + 1 multipoles in the three-dimensional Multilevel Fast Multipole Algorithm (MLFMA). A baseline, communication-fr...
详细信息
This paper investigates the parallel, distributed-memory computation of the translation operator with L + 1 multipoles in the three-dimensional Multilevel Fast Multipole Algorithm (MLFMA). A baseline, communication-free parallel algorithm can compute such a translation operator in O(L) time, using O(L-2) processes. We propose a parallel algorithm that reduces this complexity to O(log L) time. This complexity is theoretically supported and experimentally validated up to 16 384 parallel processes. For realistic cases, the implementation of the proposed algorithm proves to be up to ten times faster than the baseline algorithm. For a large-scale parallel MLFMA simulation with 4096 parallel processes, the runtime for the computation of all translation operators during the setup stage is reduced from roughly one hour to only a few minutes.
In genomic prediction, common analysis methods rely on a linear mixed-model framework to estimate SNP marker effects and breeding values of animals or plants. Ridge regression-best linear unbiased prediction (RR-BLUP)...
详细信息
In genomic prediction, common analysis methods rely on a linear mixed-model framework to estimate SNP marker effects and breeding values of animals or plants. Ridge regression-best linear unbiased prediction (RR-BLUP) is based on the assumptions that SNP marker effects are normally distributed, are uncorrelated, and have equal variances. We propose DAIRRy-BLUP, a parallel, distributed-memory RR-BLUP implementation, based on single-trait observations (y), that uses the Average Information algorithm for restricted maximum-likelihood estimation of the variance components. The goal of DAIRRy-BLUP is to enable the analysis of large-scale data sets to provide more accurate estimates of marker effects and breeding values. A distributed-memory framework is required since the dimensionality of the problem, determined by the number of SNP markers, can become too large to be analyzed by a single computing node. Initial results show that DAIRRy-BLUP enables the analysis of very large-scale data sets (up to 1,000,000 individuals and 360,000 SNPs) and indicate that increasing the number of phenotypic and genotypic records has a more significant effect on the prediction accuracy than increasing the density of SNP arrays.
We discuss the design and high-performance implementation of collective communications operations on distributed-memory computer architectures. Using a combination of known techniques (many of which were first propose...
详细信息
We discuss the design and high-performance implementation of collective communications operations on distributed-memory computer architectures. Using a combination of known techniques (many of which were first proposed in the 1980s and early 1990s) along with careful exploitation of communication modes supported by MPI, we have developed implementations that have improved performance in most situations compared to those currently supported by public domain implementations of MPI such as MPICH. Performance results from a large Intel Xeon/Pentimn 4 (R) processor cluster are included. Copyright (C) 2007 John Wiley & Sons, Ltd.
A suite of High Performance Fortran (HPF) coding examples of practical scientific algorithms are examined in detail, with the idea that on these simple but non-trivial examples, we can fairly well understand issues re...
详细信息
A suite of High Performance Fortran (HPF) coding examples of practical scientific algorithms are examined in detail, with the idea that on these simple but non-trivial examples, we can fairly well understand issues related to different data distributions, different parallel constructs, and different programming styles (static Versus dynamic allocations). Coding examples include 2D stencils solution of PDEs, N-body problem, LU factorization, several vector/matrix library routines, 2D and 3D array redistribution. Performances of HPF codes are compared to hand-written Fortran codes with message passing libraries. From 1997 to 1998, HPF compilers are improved significantly such that HPF codes perform as well as Fortran+MPI codes for all the examples investigated here. However, many important peculiarities of HPF coding still exist. (C) 1999 Elsevier Science B.V. All rights reserved.
Developing an efficient algorithm for solving a large linear system in a parallel computing environment is the major problem associated with the application of parallel processing to the numerical solution of large-sc...
详细信息
Developing an efficient algorithm for solving a large linear system in a parallel computing environment is the major problem associated with the application of parallel processing to the numerical solution of large-scale engineering problems. This paper presents a new algorithm called Multiple Sequential Staging of Tasks (MSST) to speed up the solution of a large linear system. The technique of Sequential Staging of Tasks (SST) is a highly efficient approach to the parallel solution of a targe linear system, but it is not suitable for middle- and large-scale parallel computers due to the idle periods of processors. The MSST technique partitions processors into groups and makes each group start its operation from a different row of a large linear system to remove the idle period. Therefore, MSST can be performed effectively on middle- and large-scale parallel computers and achieves a higher speed-up. Numerical results were obtained from computer experiments performed with a numerical solution method of the Poisson equation on a Dawning-1000 supercomputer (a distributed-memory MIMD architecture). The parallel speed-up is satisfactory. Copyright (C) 1999 John Wiley & Sons, Ltd.
暂无评论