In this paper we present a parallel implementation of Arnoldi's subspace method on the Connection Machine. With a 16K-processor CM2, we obtained performances of a few hundred Megaflops for a matrix size of several...
详细信息
In this paper we present a parallel implementation of Arnoldi's subspace method on the Connection Machine. With a 16K-processor CM2, we obtained performances of a few hundred Megaflops for a matrix size of several thousands when computing a small number of eigenvalues and eigenvectors. The extrapolated performance on a 64K-processor CM2 indicates that the asymptotic speed will be greater than 1 Gigaflop for very large matrices. We show that it is possible to use the subspace method with a good throughput or speed-up on massively-parallel architectures like the CM2. We remark that other classical methods for linear algebra problems such as, for example, backsubstitution and thc QR method, cannot exploit all the potential power of massively-parallel machines. Next, we propose using the subspace method as a programming methodology for massively-parallel machines in order to obtain a good performance when solving some large linear algebra problems, especially eigenproblems. This method is the most frequently one used for very large cigenproblems and it is also well adapted to massively-parallel architectures. The choice of the subspace size is verv important both for numerical stability and speed-up. We conclude with a discussion on the effects of the orders chosen for the subspaces on performance.
We push the boundaries of electronic structure-based ab-initio molecular dynamics (AIMD) beyond 100 million atoms. This scale is otherwise barely reachable with classical force-field methods or novel neural network an...
详细信息
We push the boundaries of electronic structure-based ab-initio molecular dynamics (AIMD) beyond 100 million atoms. This scale is otherwise barely reachable with classical force-field methods or novel neural network and machine learning potentials. We achieve this breakthrough by combining innovations in linear-scaling AIMD, efficient and approximate sparse linear algebra, low and mixed-precision floating-point computation on GPUs, and a compensation scheme for the errors introduced by numerical approximations. The core of our work is the non-orthogonalized local submatrix method (NOLSM), which scales very favorably to massivelyparallel computing systems and translates large sparse matrix operations into highly parallel, dense matrix operations that are ideally suited to hardware accelerators. We demonstrate that the NOLSM method, which is at the center point of each AIMD step, is able to achieve a sustained performance of 324 PFLOP/s in mixed FP16/FP32 precision corresponding to an efficiency of 67.7% when running on 1536 NVIDIA A100 GPUs.
This work presents strategies to massivelyparallelize recursive filters on inputs of one dimension (1D) or three dimensions (3D), complementing and improving on previous state-of-the-art algorithms on two dimensions ...
详细信息
This work presents strategies to massivelyparallelize recursive filters on inputs of one dimension (1D) or three dimensions (3D), complementing and improving on previous state-of-the-art algorithms on two dimensions (2D). Each strategy is reusable on different algorithms for parallel processing with feedback data dependencies, allowing to develop highly optimized algorithms for computing digital filters in general, with double-pass causal-anticausal feedbacks, in one or multiple dimensions. The algorithms are linear in time and memory, exposes a high number of parallel tasks, and they are implemented on graphics processing units, i.e. GPUs. One major barrier in this area is to have such algorithms faster than generic counterparts in available libraries, and another is to have them in an easy-to-use manner. To overcome the latter, the implementation of the presented strategies is available as open source, and, to overcome the former, timing performance and comparison results are provided, including a range of publicly available source codes and libraries, showing that this work outperforms fastest prior algorithms. (C) 2021 Elsevier Inc. All rights reserved.
暂无评论