Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly...
详细信息
ISBN:
(纸本)9781450383356
Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake. To accommodate the features of BLAS, which contains both memory-bound and computing-bound routines, we propose a hybrid strategy to incorporate faulttolerance into our brand-new BLAS implementation: duplicating computing instructions for memory-bound Level-1 and Level-2 BLAS routines and incorporating an algorithm-based fault tolerance mechanism for computing-bound Level-3 BLAS routines. Our high performance and low overhead are obtained from delicate assembly-level optimization and a kernel-fusion approach to the computing kernels. Experimental results demonstrate that FT-BLAS offers high reliability and high performance - faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14% and 21.70%, respectively, for routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.
In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. ...
详细信息
ISBN:
(纸本)9783030576752;9783030576745
In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-basedfault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires similar to 50% less redundancy than replication, while the overhead in execution time is only about 5-10%.
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft ...
详细信息
ISBN:
(数字)9781450351140
ISBN:
(纸本)9781450351140
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft errors online before the computation finishes. This paper presents an online ABFT scheme for FFT so that soft errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its faulttolerance overhead and improve its numerical stability and fault coverage, and finally incorporate it into the widely used FFTW library- one of the today's fastest FFT software implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing offline ABFT schemes;(2) it detects errors in a much more timely manner;and (3) it also has higher numerical stability and better fault coverage.
Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them o...
详细信息
ISBN:
(纸本)9781538683842
Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them only maintain checksum in one dimension;(2) their checking scheme is not efficient due to redundant checksum verifications;(3) they fail to protect PCIe communication;and (4) the checksum calculation based on a special type of matrix multiplication is far from efficient. By overcoming the above limitations, we design an efficient ABFT approach providing stronger protection for one-sided matrix decomposition methods on heterogeneous systems. First, we provide full matrix protection by using checksums in two dimensions. Second, our checking scheme is more efficient by prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors. Third, we protect PCIe communication by reordering checksum verifications and decomposition steps. Fourth, we accelerate the checksum calculation by 1.7x via better utilizing GPUs.
This paper continues to develop a fault tolerant extension of the sparse grid combination technique recently proposed in [B. Harding and M. Hegland, ANZIAM J. Electron. Suppl., 54 (2013), pp. C394-C411]. This approach...
详细信息
This paper continues to develop a fault tolerant extension of the sparse grid combination technique recently proposed in [B. Harding and M. Hegland, ANZIAM J. Electron. Suppl., 54 (2013), pp. C394-C411]. This approach to faulttolerance is novel for two reasons: First, the combination technique adds an additional level of parallelism, and second, it provides algorithm-based fault tolerance so that solutions can still be recovered if failures occur during computation. Previous work indicates how the combination technique may be adapted for a low number of faults. In this paper we develop a generalization of the combination technique for which arbitrary collections of coarse approximations may be combined to obtain an accurate approximation. A general fault tolerant combination technique for large numbers of faults is a natural consequence of this work. Using a renewal model for the time between faults on each node of a high performance computer, we also provide bounds on the expected error for interpolation with this algorithm in the presence of faults. Numerical experiments solving the scalar advection PDE demonstrate that the algorithm is resilient to faults on a real application. It is observed that the time to solution is not significantly affected by the presence of (simulated) faults. Additionally the expected error increases with the number of faults but is relatively small even for high fault rates. A comparison with traditional checkpoint-restart methods applied to the combination technique shows that our approach is highly scalable with respect to the number of faults.
To enhance the security and reliability of the widely-used stream ciphers, a novel mesh check-sum ABFT scheme for stream ciphers is developed. By utilising the ready-made arithmetic unit in stream ciphers, single and ...
详细信息
To enhance the security and reliability of the widely-used stream ciphers, a novel mesh check-sum ABFT scheme for stream ciphers is developed. By utilising the ready-made arithmetic unit in stream ciphers, single and multiple errors can be detected and corrected in a cheap way. To meet different requirements in practical applications, 4-D mesh check-sum ABFT scheme is proposed which can be applied to RC4 or other stream ciphers. The 2-D mesh check-sum ABFT scheme is able to detect and correct single error with high efficiency. The 4-D mesh check-sum ABFT scheme is capable of correcting up to three errors located randomly in an N-element matrix with acceptable computation and bandwidth overhead. The workload can be remarkably reduced when most communications are error-free. Our scheme also provides one-to-one mapping between index and check-sum, so that error can be located and recovered by easier logic and simpler operation.
暂无评论