检索结果-内蒙古大学图书馆

Efficient solution of batched band linear systems on GPUs

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2025年

作者： Nayak, Pratik Aggarwal, Isha Anzt, Hartwig Karlsruhe Inst Technol Karlsruhe Germany Tech Univ Munich Bildungscampus 2Campus Heilbronn D-74076 Heilbronn Germany Univ Tennessee ICL Knoxville TN USA

The direct solution of batches of band linear systems in parallel is important for many applications. In this paper, we elaborate on three new GPU algorithms for the data-parallel direct solution of linear system batches, sharing a band structure. We develop algorithms for three matrix types: tridiagonal, small bandwidth, and wide bandwidth. We exploit the band structure of the matrix, and store it in an efficient fashion (LAPACK band matrix format) and ensure that the SIMD parallelism of the GPUs are maximized. We develop a panel-based factorization for wide-band matrices to ensure coalesced access (with column-major storage) while minimizing main memory traffic. For the tridiagonal solvers, to ensure a high level of concurrency, we adapt a divide-and-conquer approach and utilize co-operative group functionality to efficiently communicate between compute units (in registers) on GPUs. We implement these algorithms for NVIDIA GPUs and study the performance for varying matrix sizes (16 to 1024) and across a range of batch items (upto 1 x 106). We compare the performance of our implementations with the corresponding optimized vendor implementations (cuSPARSE and MKL), with the state-of-the-art GPU library MAGMA, and with the optimized LAPACK implementation provided by Intel MKL on Intel Skylake CPUs. We also showcase the effectiveness of our batched band solvers for matrices originating from XGC, a gyrokinetic Particle-In-Cell (PIC) application optimized for modeling the edge region plasma within a plasma physics application. We show that our implementations are on average similar to 2x (for batched banded solvers, compared to MAGMA and MKL) to similar to 3x (for batched tridiagonal solvers, compared to cuSPARSE) faster than the state-of-the-art and the vendor provided implementations.

关键词： batched algorithms band linear systems GPU computing direct solvers LU factorization partial pivoting

来源：评论

学校读者我要写书评

暂无评论

Hierarchical Matrix Operations on GPUs

Hierarchical Matrix Operations on GPUs

引用

作者： Boukaram, Wagih Halim King Abdullah University of Science and Technology

学位级别：博士

Large dense matrices are ubiquitous in scientific computing, arising from the discretization of integral operators associated with elliptic pdes, Schur complement methods, covariances in spatial statistics, kernel-based machine learning, and numerical optimization problems. Hierarchical matrices are an efficient way for storing the dense matrices of very large dimension that appear in these and related settings. They exploit the fact that the underlying matrices, while formally dense, are data sparse. They have a structure consisting of blocks many of which can be well-approximated by low rank factorizations. A hierarchical organization of the blocks avoids superlinear growth in memory requirements to store n × n dense matrices in a scalable manner, requiring O(n) units of storage with a constant depending on a representative rank k for the low rank blocks. The asymptotically optimal storage requirement of the resulting hierarchical matrices is a critical advantage, particularly in extreme computing environments, characterized by low memory per processing core. The challenge then becomes to develop the parallel linear algebra operations that can be performed directly on this compressed representation. In this dissertation, I implement a set of hierarchical basic linear algebra subroutines (HBLAS) optimized for GPUs, including hierarchical matrix vector multiplication, orthogonalization, compression, low rank updates, and matrix multiplication. I develop a library of open source batched kernel operations previously missing on GPUs for the high performance implementation of the H2 operations, while relying wherever possible on existing open source and vendor kernels to ride future improvements in the technology. Fast marshaling routines extract the batch operation data from an efficient representation of the trees that compose the hierarchical matrices. The methods developed for GPUs extend to CPUs using the same code base with simple abstractions around the batched r

关键词： Hierarchical Matrices Linear Algebra GPU batched algorithms Matrix Compression Randomized algorithms

来源：评论

学校读者我要写书评

暂无评论

RANDOMIZED GPU algorithms FOR THE CONSTRUCTION OF HIERARCHICAL MATRICES FROM MATRIX-VECTOR OPERATIONS

引用

SIAM JOURNAL ON SCIENTIFIC COMPUTING 2019年第4期41卷 C339-C366页

作者： Boukaram, Wajih Turkiyyah, George Keyes, David King Abdullah Univ Sci & Technol Extreme Comp Res Ctr Thuwal 239556900 Saudi Arabia Amer Univ Beirut Dept Comp Sci Beirut 11072020 Lebanon

Randomized algorithms for the generation of low rank approximations of large dense matrices have become popular methods in scientific computing and machine learning. In this paper, we extend the scope of these methods and present batched GPU randomized algorithms for the efficient generation of low rank representations of large sets of small dense matrices, as well as their generalization to the construction of hierarchically low rank symmetric H-2 matrices with general partitioning structures. In both cases, the algorithms need to access the matrices only through matrix-vector multiplication operations which can be done in blocks to increase the arithmetic intensity and substantially boost the resulting performance. The batched GPU kernels are adaptive, allow nonuniform sizes in the matrices of the batch, and are more effective than SVD factorizations on matrices with fast decaying spectra. The hierarchical matrix generation consists of two phases, interleaved at every level of the matrix hierarchy. A first phase adaptively generates low rank approximations of matrix blocks through randomized matrix-vector sampling. A second phase accumulates and compresses these blocks into a hierarchical matrix that is incrementally constructed. The accumulation expresses the low rank blocks of a given level as a set of local low rank updates that are performed simultaneously on the whole matrix allowing high-performance batched kernels to be used in the compression operations. When the ranks of the blocks generated in the first phase are too large to be processed in a single operation, the low rank updates can be split into smaller-sized updates and applied in sequence. Assuming representative rank k, the resulting matrix has optimal O(kN) asymptotic storage complexity because of the nested bases it uses. The ability to generate an H-2 matrix from matrix-vector products allows us to support a general randomized matrix-matrix multiplication operation, an important kernel in hiera

关键词： hierarchical matrices GPU randomized algorithms low rank factorization low rank updates batched algorithms matrix-matrix multiplication matrix compression nested bases

来源：评论

学校读者我要写书评

暂无评论

Variable-size batched Gauss-Jordan elimination for block-Jacobi preconditioning on graphics processors

引用

PARALLEL COMPUTING 2019年 81卷 131-146页

作者： Anzt, Hartwig Dongarra, Jack Flegar, Goran Quintana-Orti, Enrique S. Karlsruhe Inst Technol Karlsruhe Germany Univ Tennessee ICL Knoxville TN USA Oak Ridge Natl Lab Oak Ridge TN USA Univ Manchester Sch Comp Sci Manchester Lancs England Univ Jaume 1 Dept Ingn & Ciencia Comp Castellon de La Plana Spain

In this work, we address the efficient realization of block-Jacobi preconditioning on graphics processing units (GPUs). This task requires the solution of a collection of small and independent linear systems. To fully realize this implementation, we develop a variablesize batched matrix inversion kernel that uses Gauss-Jordan elimination (GJE) along with a variable-size batched matrix-vector multiplication kernel that transforms the linear systems' right-hand sides into the solution vectors. Our kernels make heavy use of the increased register count and the warp-local communication associated with newer GPU architectures. Moreover, in the matrix inversion, we employ an implicit pivoting strategy that migrates the workload (i.e., operations) to the place where the data resides instead of moving the data to the executing cores. We complement the matrix inversion with extraction and insertion strategies that allow the block-Jacobi preconditioner to be set up rapidly. The experiments on NVlDlA's K40 and P100 architectures reveal that our variable-size batched matrix inversion routine outperforms the CUDA basic linear algebra subroutine (cuBLAS) library functions that provide the same (or even less) functionality. We also show that the preconditioner setup and preconditioner application cost can be somewhat offset by the faster convergence of the iterative solver. (C) 2018 Elsevier B.V. All rights reserved.

关键词： batched algorithms Matrix inversion Gauss-Jordan elimination Block-Jacobi Sparse linear systems Graphics processor

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：