Modern GPUs equipped with mixedprecision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, state-of-the-art mixed half/single precision LU fact...
详细信息
Modern GPUs equipped with mixedprecision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, state-of-the-art mixed half/single precision LU factorization algorithms all require the matrix to be stored in single precision, leading to expensive data movement and storage costs. This is explained by the fact that simply switching the storage precision from single to half leads to significant loss of accuracy, forfeiting all accuracy benefits from using tensor core technology. In this article, we propose a new factorization algorithm that is able to store the matrix in half precision without incurring any significant loss of accuracy. Our approach is based on a left-looking scheme employing single precision buffers of controlled size and a mixedprecision doubly partitioned algorithm exploiting tensor cores in the panel factorizations. Our numerical results show that compared with the state of the art, the proposed approach is of similar accuracy but with only half the data movement and memory footprint, and hence potentially much faster: it achieves up to 2x and 3.5x speedups on V100 and A100 GPUs, respectively.
We introduce a novel approach to exploit mixedprecision arithmetic for low-rank approximations. Our approach is based on the observation that singular vectors associated with small singular values can be stored in lo...
详细信息
We introduce a novel approach to exploit mixedprecision arithmetic for low-rank approximations. Our approach is based on the observation that singular vectors associated with small singular values can be stored in lower precisions while preserving high accuracy overall. We provide an explicit criterion to determine which level of precision is needed for each singular vector. We apply this approach to block low-rank (BLR) matrices, most of whose off-diagonal blocks have low rank. We propose a new BLR LU factorization algorithm that exploits the mixedprecision representation of the blocks. We carry out the rounding error analysis of this algorithm and prove that the use of mixedprecision arithmetic does not compromise the numerical stability of the BLR LU factorization. Moreover, our analysis determines which level of precision is needed for each floating-point operation (flop), and therefore guides us toward an implementation that is both robust and efficient. We evaluate the potential of this new algorithm on a range of matrices coming from real-life problems in industrial and academic applications. We show that a large fraction of the entries in the LU factors and flops to perform the BLR LU factorization can be safely switched to lower precisions, leading to significant reductions of the storage and expected time costs, of up to a factor three using fp64, fp32, and bfloat16 arithmetics.
Randomized projection methods have been shown to be very efficient at computing low-rank approximations (LRA) of large matrices. In this work, we investigate the design and development of such methods capable of explo...
详细信息
ISBN:
(纸本)9783031695827;9783031695834
Randomized projection methods have been shown to be very efficient at computing low-rank approximations (LRA) of large matrices. In this work, we investigate the design and development of such methods capable of exploiting recent mixedprecision accelerators like GPUs equipped with tensor core units. We combine three new ideas to exploit mixedprecision arithmetic in randomized LRA. The first is to perform the matrix multiplication with mixedprecision fp16/fp32 tensor cores. The second is to use CholeskyQR orthonormalization, which is much faster on GPUs, while mitigating its numerical instability by using fp64 arithmetic. The third is to use a recently proposed iterative refinement method for LRA to improve the accuracy of the LRA by calling it twice. We implement the proposed approach on various GPU architectures and analyze its performance and accuracy. We compare with a standard randomized LRA entirely in fp32 arithmetic, which achieves an average accuracy of order 10(-4). Our results show that our approach without refinement is up to 8x faster, with an average accuracy of order 10(-2), which may be acceptable for some applications. Otherwise, we show that using refinement significantly improves the accuracy to an average of order 10(-5), while remaining up to 2.2x faster than the standard fp32 randomized LRA. This work illustrates the convergence of approximate computing techniques by combining low-rank approximations, randomization, mixedprecision arithmetic, and GPU acceleration.
We propose to study the impact on the energy footprint of two advanced algorithmic strategies in the context of high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement...
详细信息
ISBN:
(纸本)9780769548647;9781467330275
We propose to study the impact on the energy footprint of two advanced algorithmic strategies in the context of high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement allow to run at the peak performance of single precision floating-point arithmetic while achieving double precision accuracy and (2) tree reduction technique exposes more parallelism when factorizing tall and skinny matrices for solving overdetermined systems of linear equations or calculating the singular value decomposition. Integrated within the PLASMA library using tile algorithms, which will eventually supersede the block algorithms from LAPACK, both strategies further excel in performance in the presence of a dynamic task scheduler while targeting multicore architecture. Energy consumption measurements are reported along with parallel performance numbers on a dual-socket quad-core Intel Xeon as well as a quad-socket quad-core Intel Sandy Bridge chip, both providing component-based energy monitoring at all levels of the system, through the PowerPack framework and the Running Average Power Limit model, respectively.
The increasing complexity of new parallel architectures has widened the gap between adaptability and efficiency of the codes. As high performance numerical libraries tend to focus more on performance, we wish to addre...
详细信息
ISBN:
(纸本)9781467379526
The increasing complexity of new parallel architectures has widened the gap between adaptability and efficiency of the codes. As high performance numerical libraries tend to focus more on performance, we wish to address this issue using a C++ library called NT2. By analyzing the properties of the linear algebra domain that can be extracted from numerical libraries and combining them with architectural features, we developed a generic approach to solve dense linear systems on various architectures including CPU and GPU. We have then extended our work with an example of a least squares solver based on semi-normal equations in mixedprecision that cannot be found in current libraries. For the automatically generated solvers, we report performance comparisons with state-of-the-art codes, and show that it is possible to obtain a generic code with a high-level interface (similar to MATLAB) which runs either on CPU or GPU without generating a significant overhead.
This article describes a new high performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. The standa...
详细信息
This article describes a new high performance implementation of the QR-based Dynamically Weighted Halley Singular Value Decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. The standard QDWH-SVD algorithm was introduced by Nakatsukasa and Higham (SIAM SISC, 2013) and combines three successive computational stages: (1) the polar decomposition calculation of the original matrix using the QDWH algorithm, (2) the symmetric eigendecomposition of the resulting polar factor to obtain the singular values and the right singular vectors, and (3) the matrix-matrix multiplication to get the associated left singular vectors. A comprehensive test suite highlights the numerical robustness of the QDWH-SVD solver. Although it performs up to two times more flops when computing all singular vectors compared to the standard SVD solver algorithm, our new high performance implementation on single GPU results in up to 4x improvements for asymptotic matrix sizes, compared to the equivalent routines from existing state-of-the-art open-source and commercial libraries. However, when only singular values are needed, QDWH-SVD is penalized by performing more flops by an order of magnitude. The singular value only implementation of QDWH-SVD on single GPU can still run up to 18% faster than the best existing equivalent routines.
暂无评论