We propose fast numerical algorithms to improve the accuracy of singular vectors for a real matrix. Recently, Ogita and Aishima proposed an iterative refinement algorithm for singular value decomposition that is const...
详细信息
We propose fast numerical algorithms to improve the accuracy of singular vectors for a real matrix. Recently, Ogita and Aishima proposed an iterative refinement algorithm for singular value decomposition that is constructed with highly accurate matrix multiplications carried out six times per iteration. The algorithm runs for the problem that has no multiple and clustered singular values. In this study, we show that the same algorithm can be run with highly accurate matrix multiplications carried out five times. Also, we proposed four algorithms constructed with highly accurate matrix multiplications, two algorithms with the multiplications carried out four times, and the other two with the multiplications carried out five times. These algorithms adopt the idea of a mixed-precision iterative refinement method for linear systems. Numerical experiments demonstrate speed-up and quadratic convergence of the proposed algorithms. As a result, the fastest algorithm is 1.7 and 1.4 times faster than the Ogita-Aishima algorithm per iteration on a CPU and GPU, respectively.
We present the design and scalable implementation of an exascale climate emulator for addressing the escalating computational and storage requirements of high-resolution Earth System Model simulations. We utilize the ...
详细信息
ISBN:
(数字)9798350352917
ISBN:
(纸本)9798350352924;9798350352917
We present the design and scalable implementation of an exascale climate emulator for addressing the escalating computational and storage requirements of high-resolution Earth System Model simulations. We utilize the spherical harmonic transform to stochastically model spatio-temporal variations in climate data. This provides tunable spatio-temporal resolution and significantly improves the fidelity and granularity of climate emulation, achieving an ultra-high spatial resolution of 0.034 degrees (similar to 3.5 km) in space. Our emulator, trained on 318 billion hourly temperature data points from a 35-year and 31 billion daily data points from an 83-year global simulation ensemble, generates statistically consistent climate emulations. We extend linear solver software to mixed-precision arithmetic GPUs, applying different precisions within a single solver to adapt to different correlation strengths. The PaRSEC runtime system supports efficient parallel matrix operations by optimizing the dynamic balance between computation, communication, and memory requirements. Our BLAS3-rich code is optimized for systems equipped with four different families and generations of GPUs, scaling well to achieve 0.976 EFlop/s on 9,025 nodes (36,100 AMD MI250X multi-chip module (MCM) GPUs) of Frontier (nearly full system), 0.739 EFlop/s on 1,936 nodes (7,744 Grace-Hopper Superchips (GH200)) of Alps, 0.243 EFlop/s on 1,024 nodes (4,096 A100 GPUs) of Leonardo, and 0.375 EFlop/s on 3,072 nodes (18,432 V100 GPUs) of Summit.
Tensor-specialized hardware for supporting low-precision arithmetic has become an inevitable trend due to the ever-increasing demand on computational capability and energy efficiency in intelligent applications. The m...
详细信息
ISBN:
(纸本)9781728186139
Tensor-specialized hardware for supporting low-precision arithmetic has become an inevitable trend due to the ever-increasing demand on computational capability and energy efficiency in intelligent applications. The main challenge faced when accelerating a tensor program on tensor-specialized hardware is how to achieve the best performance possible in reduced precision by fully utilizing its computational resources while keeping the precision loss in a controlled manner. In this paper, we address this challenge by proposing QUANTENSOR, a new approach for accelerating general-purpose tensor programs by replacing its tensor computations with low-precision quantized tensor computations on NVIDIA Tensor Cores. The key novelty is a new residual-based precision refinement technique for controlling the quantization errors, allowing tradeoffs between performance and precision to be made. Evaluation with GEMM, deep neural networks, and linear algebra applications shows that QUANTENSOR can achieve remarkable performance improvements while reducing the precision loss incurred significantly at acceptable overheads.
暂无评论