In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully...
详细信息
In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variablesize batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix-vector multiplication kernel that transforms the linear systems' right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVlDlA's K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver. (C) 2018 Elsevier B.V. All rights reserved.
Randomized algorithms for the generation of low rank approximations of large dense matrices have become popular methods in scientific computing and machine learning. In this paper, we extend the scope of these methods...
详细信息
Randomized algorithms for the generation of low rank approximations of large dense matrices have become popular methods in scientific computing and machine learning. In this paper, we extend the scope of these methods and present batched GPU randomized algorithms for the efficient generation of low rank representations of large sets of small dense matrices, as well as their generalization to the construction of hierarchically low rank symmetric H-2 matrices with general partitioning structures. In both cases, the algorithms need to access the matrices only through matrix-vector multiplication operations which can be done in blocks to increase the arithmetic intensity and substantially boost the resulting performance. The batched GPU kernels are adaptive, allow nonuniform sizes in the matrices of the batch, and are more effective than SVD factorizations on matrices with fast decaying spectra. The hierarchical matrix generation consists of two phases, interleaved at every level of the matrix hierarchy. A first phase adaptively generates low rank approximations of matrix blocks through randomized matrix-vector sampling. A second phase accumulates and compresses these blocks into a hierarchical matrix that is incrementally constructed. The accumulation expresses the low rank blocks of a given level as a set of local low rank updates that are performed simultaneously on the whole matrix allowing high-performance batched kernels to be used in the compression operations. When the ranks of the blocks generated in the first phase are too large to be processed in a single operation, the low rank updates can be split into smaller-sized updates and applied in sequence. Assuming representative rank k, the resulting matrix has optimal O(kN) asymptotic storage complexity because of the nested bases it uses. The ability to generate an H-2 matrix from matrix-vector products allows us to support a general randomized matrix-matrix multiplication operation, an important kernel in hiera
Large dense matrices are ubiquitous in scientific computing, arising from the discretization of integral operators associated with elliptic pdes, Schur complement methods, covariances in spatial statistics, kernel-bas...
详细信息
Large dense matrices are ubiquitous in scientific computing, arising from the discretization of integral operators associated with elliptic pdes, Schur complement methods, covariances in spatial statistics, kernel-based machine learning, and numerical optimization problems. Hierarchical matrices are an efficient way for storing the dense matrices of very large dimension that appear in these and related settings. They exploit the fact that the underlying matrices, while formally dense, are data sparse. They have a structure consisting of blocks many of which can be well-approximated by low rank factorizations. A hierarchical organization of the blocks avoids superlinear growth in memory requirements to store n × n dense matrices in a scalable manner, requiring O(n) units of storage with a constant depending on a representative rank k for the low rank blocks. The asymptotically optimal storage requirement of the resulting hierarchical matrices is a critical advantage, particularly in extreme computing environments, characterized by low memory per processing core. The challenge then becomes to develop the parallel linear algebra operations that can be performed directly on this compressed representation. In this dissertation, I implement a set of hierarchical basic linear algebra subroutines (HBLAS) optimized for GPUs, including hierarchical matrix vector multiplication, orthogonalization, compression, low rank updates, and matrix multiplication. I develop a library of open source batched kernel operations previously missing on GPUs for the high performance implementation of the H2 operations, while relying wherever possible on existing open source and vendor kernels to ride future improvements in the technology. Fast marshaling routines extract the batch operation data from an efficient representation of the trees that compose the hierarchical matrices. The methods developed for GPUs extend to CPUs using the same code base with simple abstractions around the batched r
暂无评论