The direct solution of batches of band linear systems in parallel is important for many applications. In this paper, we elaborate on three new GPU algorithms for the data-parallel direct solution of linear system batc...
详细信息
The direct solution of batches of band linear systems in parallel is important for many applications. In this paper, we elaborate on three new GPU algorithms for the data-parallel direct solution of linear system batches, sharing a band structure. We develop algorithms for three matrix types: tridiagonal, small bandwidth, and wide bandwidth. We exploit the band structure of the matrix, and store it in an efficient fashion (LAPACK band matrix format) and ensure that the SIMD parallelism of the GPUs are maximized. We develop a panel-based factorization for wide-band matrices to ensure coalesced access (with column-major storage) while minimizing main memory traffic. For the tridiagonal solvers, to ensure a high level of concurrency, we adapt a divide-and-conquer approach and utilize co-operative group functionality to efficiently communicate between compute units (in registers) on GPUs. We implement these algorithms for NVIDIA GPUs and study the performance for varying matrix sizes (16 to 1024) and across a range of batch items (upto 1 x 106). We compare the performance of our implementations with the corresponding optimized vendor implementations (cuSPARSE and MKL), with the state-of-the-art GPU library MAGMA, and with the optimized LAPACK implementation provided by Intel MKL on Intel Skylake CPUs. We also showcase the effectiveness of our batched band solvers for matrices originating from XGC, a gyrokinetic Particle-In-Cell (PIC) application optimized for modeling the edge region plasma within a plasma physics application. We show that our implementations are on average similar to 2x (for batched banded solvers, compared to MAGMA and MKL) to similar to 3x (for batched tridiagonal solvers, compared to cuSPARSE) faster than the state-of-the-art and the vendor provided implementations.
Large dense matrices are ubiquitous in scientific computing, arising from the discretization of integral operators associated with elliptic pdes, Schur complement methods, covariances in spatial statistics, kernel-bas...
详细信息
Large dense matrices are ubiquitous in scientific computing, arising from the discretization of integral operators associated with elliptic pdes, Schur complement methods, covariances in spatial statistics, kernel-based machine learning, and numerical optimization problems. Hierarchical matrices are an efficient way for storing the dense matrices of very large dimension that appear in these and related settings. They exploit the fact that the underlying matrices, while formally dense, are data sparse. They have a structure consisting of blocks many of which can be well-approximated by low rank factorizations. A hierarchical organization of the blocks avoids superlinear growth in memory requirements to store n × n dense matrices in a scalable manner, requiring O(n) units of storage with a constant depending on a representative rank k for the low rank blocks. The asymptotically optimal storage requirement of the resulting hierarchical matrices is a critical advantage, particularly in extreme computing environments, characterized by low memory per processing core. The challenge then becomes to develop the parallel linear algebra operations that can be performed directly on this compressed representation. In this dissertation, I implement a set of hierarchical basic linear algebra subroutines (HBLAS) optimized for GPUs, including hierarchical matrix vector multiplication, orthogonalization, compression, low rank updates, and matrix multiplication. I develop a library of open source batched kernel operations previously missing on GPUs for the high performance implementation of the H2 operations, while relying wherever possible on existing open source and vendor kernels to ride future improvements in the technology. Fast marshaling routines extract the batch operation data from an efficient representation of the trees that compose the hierarchical matrices. The methods developed for GPUs extend to CPUs using the same code base with simple abstractions around the batched r
Randomized algorithms for the generation of low rank approximations of large dense matrices have become popular methods in scientific computing and machine learning. In this paper, we extend the scope of these methods...
详细信息
Randomized algorithms for the generation of low rank approximations of large dense matrices have become popular methods in scientific computing and machine learning. In this paper, we extend the scope of these methods and present batched GPU randomized algorithms for the efficient generation of low rank representations of large sets of small dense matrices, as well as their generalization to the construction of hierarchically low rank symmetric H-2 matrices with general partitioning structures. In both cases, the algorithms need to access the matrices only through matrix-vector multiplication operations which can be done in blocks to increase the arithmetic intensity and substantially boost the resulting performance. The batched GPU kernels are adaptive, allow nonuniform sizes in the matrices of the batch, and are more effective than SVD factorizations on matrices with fast decaying spectra. The hierarchical matrix generation consists of two phases, interleaved at every level of the matrix hierarchy. A first phase adaptively generates low rank approximations of matrix blocks through randomized matrix-vector sampling. A second phase accumulates and compresses these blocks into a hierarchical matrix that is incrementally constructed. The accumulation expresses the low rank blocks of a given level as a set of local low rank updates that are performed simultaneously on the whole matrix allowing high-performance batched kernels to be used in the compression operations. When the ranks of the blocks generated in the first phase are too large to be processed in a single operation, the low rank updates can be split into smaller-sized updates and applied in sequence. Assuming representative rank k, the resulting matrix has optimal O(kN) asymptotic storage complexity because of the nested bases it uses. The ability to generate an H-2 matrix from matrix-vector products allows us to support a general randomized matrix-matrix multiplication operation, an important kernel in hiera
In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully...
详细信息
In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variablesize batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix-vector multiplication kernel that transforms the linear systems' right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVlDlA's K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver. (C) 2018 Elsevier B.V. All rights reserved.
暂无评论