检索结果-内蒙古大学图书馆

mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2023年第2期37卷 165-179页

作者： Lopez, Florent Mary, Theo ANSYS Co Livermore Software Technol Canonsburg PA 15317 USA Sorbonne Univ CNRS 4 Pl JussieuLIP6 Paris France

Modern GPUs equipped with mixed precision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, state-of-the-art mixed half/single precision LU factorization algorithms all require the matrix to be stored in single precision, leading to expensive data movement and storage costs. This is explained by the fact that simply switching the storage precision from single to half leads to significant loss of accuracy, forfeiting all accuracy benefits from using tensor core technology. In this article, we propose a new factorization algorithm that is able to store the matrix in half precision without incurring any significant loss of accuracy. Our approach is based on a left-looking scheme employing single precision buffers of controlled size and a mixed precision doubly partitioned algorithm exploiting tensor cores in the panel factorizations. Our numerical results show that compared with the state of the art, the proposed approach is of similar accuracy but with only half the data movement and memory footprint, and hence potentially much faster: it achieves up to 2x and 3.5x speedups on V100 and A100 GPUs, respectively.

关键词： Numerical linear algebra mixed precision algorithms high-performance computing LU factorization tensor cores NVIDIA GPU rounding error analysis

来源：评论

学校读者我要写书评

暂无评论

mixed precision low-rank approximations and their application to block low-rank LU factorization

引用

IMA JOURNAL OF NUMERICAL ANALYSIS 2023年第4期43卷 2198-2227页

作者： Amestoy, Patrick Boiteau, Olivier Buttari, Alfredo Gerest, Matthieu Jezequel, Fabienne L'excellent, Jean-Yves Mary, Theo ENS Lyon Mumps Technol 46 Allee Italie F-69007 Lyon France EDF R&D F-91120 Palaiseau France CNRS IRIT 2 Rue Charles Camichel F-31071 Toulouse France EDF R&D F-75005 Paris France Sorbonne Univ CNRS LIP6 F-75005 Paris France Univ Paris Pantheon Assas F-75005 Paris France

We introduce a novel approach to exploit mixed precision arithmetic for low-rank approximations. Our approach is based on the observation that singular vectors associated with small singular values can be stored in lower precisions while preserving high accuracy overall. We provide an explicit criterion to determine which level of precision is needed for each singular vector. We apply this approach to block low-rank (BLR) matrices, most of whose off-diagonal blocks have low rank. We propose a new BLR LU factorization algorithm that exploits the mixed precision representation of the blocks. We carry out the rounding error analysis of this algorithm and prove that the use of mixed precision arithmetic does not compromise the numerical stability of the BLR LU factorization. Moreover, our analysis determines which level of precision is needed for each floating-point operation (flop), and therefore guides us toward an implementation that is both robust and efficient. We evaluate the potential of this new algorithm on a range of matrices coming from real-life problems in industrial and academic applications. We show that a large fraction of the entries in the LU factors and flops to perform the BLR LU factorization can be safely switched to lower precisions, leading to significant reductions of the storage and expected time costs, of up to a factor three using fp64, fp32, and bfloat16 arithmetics.

关键词： numerical linear algebra rounding error analysis floating-point arithmetic mixed precision algorithms multiprecision algorithms block low-rank matrices data sparse matrices LU factorization linear systems low-rank approximations singular value decomposition

来源：评论

学校读者我要写书评

暂无评论

mixed precision Randomized Low-Rank Approximation with GPU Tensor Cores 30th

Mixed Precision Randomized Low-Rank Approximation with GPU T...

引用

30th European Conference on Parallel and Distributed Processing (Euro-Par)

作者： Baboulin, Marc Donfack, Simplice Kaya, Oguz Mary, Theo Robeyns, Matthieu Univ Paris Saclay CNRS ENS Paris Saclay LMF Gif Sur Yvette France Univ Paris Saclay UVSQ INRIA CNRSCEAMaison Simulat Gif Sur Yvette France Univ Paris Saclay CNRS LISN Orsay France Sorbonne Univ CNRS LIP6 Paris France Inst Dev & Ressources Informat Sci Rue John von Neumann F-91403 Orsay France

ISBN: (纸本)9783031695827;9783031695834

Randomized projection methods have been shown to be very efficient at computing low-rank approximations (LRA) of large matrices. In this work, we investigate the design and development of such methods capable of exploiting recent mixed precision accelerators like GPUs equipped with tensor core units. We combine three new ideas to exploit mixed precision arithmetic in randomized LRA. The first is to perform the matrix multiplication with mixed precision fp16/fp32 tensor cores. The second is to use CholeskyQR orthonormalization, which is much faster on GPUs, while mitigating its numerical instability by using fp64 arithmetic. The third is to use a recently proposed iterative refinement method for LRA to improve the accuracy of the LRA by calling it twice. We implement the proposed approach on various GPU architectures and analyze its performance and accuracy. We compare with a standard randomized LRA entirely in fp32 arithmetic, which achieves an average accuracy of order 10(-4). Our results show that our approach without refinement is up to 8x faster, with an average accuracy of order 10(-2), which may be acceptable for some applications. Otherwise, we show that using refinement significantly improves the accuracy to an average of order 10(-5), while remaining up to 2.2x faster than the standard fp32 randomized LRA. This work illustrates the convergence of approximate computing techniques by combining low-rank approximations, randomization, mixed precision arithmetic, and GPU acceleration.

关键词： mixed precision algorithms randomized algorithms low-rank approximations GPU computing tensor cores

来源：评论

学校读者我要写书评

暂无评论

A High Performance QDWH-SVD Solver Using Hardware Accelerators

引用

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE 2016年第1期43卷 1–25页

作者： Sukkari, Dalal Ltaief, Hatem Keyes, David King Abdullah Univ Sci & Technol Extreme Comp Res 4700 King Abdullah Blvd Thuwal Jeddah 23955 Saudi Arabia

This article describes a new high performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. The standard QDWH-SVD algorithm was introduced by Nakatsukasa and Higham (SIAM SISC, 2013) and combines three successive computational stages: (1) the polar decomposition calculation of the original matrix using the QDWH algorithm, (2) the symmetric eigendecomposition of the resulting polar factor to obtain the singular values and the right singular vectors, and (3) the matrix-matrix multiplication to get the associated left singular vectors. A comprehensive test suite highlights the numerical robustness of the QDWH-SVD solver. Although it performs up to two times more flops when computing all singular vectors compared to the standard SVD solver algorithm, our new high performance implementation on single GPU results in up to 4x improvements for asymptotic matrix sizes, compared to the equivalent routines from existing state-of-the-art open-source and commercial libraries. However, when only singular values are needed, QDWH-SVD is penalized by performing more flops by an order of magnitude. The singular value only implementation of QDWH-SVD on single GPU can still run up to 18% faster than the best existing equivalent routines.

关键词： Design algorithms Singular value decomposition polar decomposition symmetric eigensolver mixed precision algorithms GPU-based scientific computing

来源：评论

学校读者我要写书评

暂无评论

Metaprogramming dense linear algebra solvers Applications to multi and many-core architectures 14

Metaprogramming dense linear algebra solvers Applications to...

引用

13th IEEE International Symposium on Parallel and Distributed Processing with Applications

作者： Masliah, Ian Baboulin, Marc Falcou, Joel Univ Paris 11 F-91405 Orsay France Inria F-91405 Orsay France

ISBN: (纸本)9781467379526

The increasing complexity of new parallel architectures has widened the gap between adaptability and efficiency of the codes. As high performance numerical libraries tend to focus more on performance, we wish to address this issue using a C++ library called NT2. By analyzing the properties of the linear algebra domain that can be extracted from numerical libraries and combining them with architectural features, we developed a generic approach to solve dense linear systems on various architectures including CPU and GPU. We have then extended our work with an example of a least squares solver based on semi-normal equations in mixed precision that cannot be found in current libraries. For the automatically generated solvers, we report performance comparisons with state-of-the-art codes, and show that it is possible to obtain a generic code with a high-level interface (similar to MATLAB) which runs either on CPU or GPU without generating a significant overhead.

关键词： GPU computing Metaprogramming Numerical libraries active libraries dense linear systems mixed precision algorithms Solvers Linear system core construction Linear algebra Libraries central processing units ARCHITECTURE metaprogram Codes GRAPPER PICK UP Graphics Processing Unit Parallel architectures Dense MATLAB

来源：评论

学校读者我要写书评

暂无评论

Energy Footprint of Advanced Dense Numerical Linear Algebra using Tile algorithms on Multicore Architectures

Energy Footprint of Advanced Dense Numerical Linear Algebra ...

引用

2nd International Conference on Cloud and Green Computing / 2nd International Conference on Social Computing and its Applications (CGC/SCA)

作者： Dongarra, Jack Ltaief, Hatem Luszczek, Piotr Weaver, Vincent M. Univ Tennessee Innovat Comp Lab Knoxville TN 37996 USA Oak Ridge Natl Lab Comp Sci & Math Div Oak Ridge TN USA Univ Manchester Sch Math Sch Comp Sci Manchester NH USA Natl Sci Fdn & Microsoft Research Manchester NH USA KAUST Supercomputing Lab Thuwal Saudi Arabia

ISBN: (纸本)9780769548647;9781467330275

We propose to study the impact on the energy footprint of two advanced algorithmic strategies in the context of high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement allow to run at the peak performance of single precision floating-point arithmetic while achieving double precision accuracy and (2) tree reduction technique exposes more parallelism when factorizing tall and skinny matrices for solving overdetermined systems of linear equations or calculating the singular value decomposition. Integrated within the PLASMA library using tile algorithms, which will eventually supersede the block algorithms from LAPACK, both strategies further excel in performance in the presence of a dynamic task scheduler while targeting multicore architecture. Energy consumption measurements are reported along with parallel performance numbers on a dual-socket quad-core Intel Xeon as well as a quad-socket quad-core Intel Sandy Bridge chip, both providing component-based energy monitoring at all levels of the system, through the PowerPack framework and the Running Average Power Limit model, respectively.

关键词： Power Consumption Dense Linear Algebra mixed precision algorithms Tree Reduction Tile algorithms Dynamic Scheduling PowerPack RAPL

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：