General Matrix Multiplication (GEMM) is instrumental in myriads of scientific, high-performance computing, and machine learning applications such as computer vision, recommendation models, and weather forecasts. It is...
详细信息
ISBN:
(数字)9781665461863
ISBN:
(纸本)9781665461863
General Matrix Multiplication (GEMM) is instrumental in myriads of scientific, high-performance computing, and machine learning applications such as computer vision, recommendation models, and weather forecasts. It is vital to make them fail-safe in safety-critical and high-precision applications. Companies like Meta and Google have recently reported sporadic silent errors in GEMM computations traced to hardware sources. Silent errors are hard to detect, requiring specialized solutions to detect them. Hardware redundancy approaches such as double or triple modular redundancy effectively detect or correct such errors, but they have a large area and power overhead. algorithm-based fault tolerance (ABFT) has been shown to offer an effective alternative at a far lower overhead. Modern CPUs feature advanced vector extensions (AVX) capable of executing SIMD instructions. This paper describes a new ABFT approach designed to take advantage of the AVX feature. Our core algorithm relies on the classical tile-based outer-product approach but enhances standard check-sum calculation using a tile vector. The implementation parameters are fine-tuned to fit the available number of AVX registers. Our results indicate that we can achieve 100% error detection in GEMM at an overhead of just 0.21% for the integer data type. Unfortunately, due to rounding errors, addition of floating-point numbers is not an associative operation, creating difficulties for ABFT. To mitigate the impact of rounding errors, we introduce the concept of relative error checking and perform error analysis for various error classes to show that the proposed approach totally eliminates false positive errors.
Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly...
详细信息
ISBN:
(纸本)9781450383356
Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake. To accommodate the features of BLAS, which contains both memory-bound and computing-bound routines, we propose a hybrid strategy to incorporate faulttolerance into our brand-new BLAS implementation: duplicating computing instructions for memory-bound Level-1 and Level-2 BLAS routines and incorporating an algorithm-based fault tolerance mechanism for computing-bound Level-3 BLAS routines. Our high performance and low overhead are obtained from delicate assembly-level optimization and a kernel-fusion approach to the computing kernels. Experimental results demonstrate that FT-BLAS offers high reliability and high performance - faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14% and 21.70%, respectively, for routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.
As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popula...
详细信息
As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popular software fault-tolerant techniques, both the checkpoint-restart approach and the replication technique face challenges of scalability in the era of peta- and exa-scale systems due to their numerous processes. In this situation, algorithm-based approaches, or algorithm-based fault tolerance (ABFT) mechanisms, have become attractive because they are efficient and lightweight. Although the ABFT technique is algorithm-dependent, it is possible to implement it at a low level (e.g., in libraries for basic numerical algorithms) and make it application-independent. However, previous ABFT approaches have mainly aimed at achieving faulttolerance in integrated circuits (ICs) or at the architecture level and are therefore not suitable for HPC systems;e.g., they use checksums of rows and columns of matrices rather than checksums of blocks to detect errors. Furthermore, they cannot deal with errors caused by node failure, which are common in current HPC systems. To solve these problems, this paper proposes FT-PBLAS, a PBLAS-based library for fault-tolerant parallel linear algebra computations that can be regarded as a fault-tolerant version of the parallel basic linear algebra subprograms (PBLAS), because it provides a series of fault-tolerant versions of interfaces in PBLAS. To support the underlying error detection and recovery mechanisms in the library, we propose a block-checksum approach for non-fatal errors and a scheme for addressing node failure, respectively. We evaluate two fault-tolerant mechanisms and FT-PBLAS on HPC systems, and the experimental results demonstrate the performance of our library.
We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. ...
详细信息
ISBN:
(纸本)9783030483401;9783030483395
We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to "algorithm-based fault tolerance" checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.
In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. ...
详细信息
ISBN:
(纸本)9783030576752;9783030576745
In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-basedfault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires similar to 50% less redundancy than replication, while the overhead in execution time is only about 5-10%.
With the scaling up of high performance computers, resilience has become a big challenge. Among various kinds of software-basedfault-tolerant approaches, the algorithm-based fault tolerance (ABFT) has some attractive...
详细信息
ISBN:
(纸本)9781538666142
With the scaling up of high performance computers, resilience has become a big challenge. Among various kinds of software-basedfault-tolerant approaches, the algorithm-based fault tolerance (ABFT) has some attractive characteristics in the era of exa-scale systems, such as high efficiency and light-weight. In particular, considering that many engineering and scientific applications rely on some fundamental algorithms, it is possible to provide algorithm-basedfault-tolerant mechanisms in low level and make it application-independent. Previous fault-tolerant mechanisms for matrix computation use row and column checksums, which cannot be directly used in large-scale parallel systems. This paper proposes an algorithm-basedfault tolerant approach for matrix multiplication on large-scale parallel systems. The mechanism uses block-checksum which not only meets the requirement of matrix computations on large-scale parallel systems but also reduces the overhead of fault-tolerance compared to traditional schemes based on row and column checksums. In addition, this paper gives method for choosing the size of blocks to achieve balance between accuracy and efficiency. The complexity analysis and examples demonstrate effectiveness and feasibility of our approach.
With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently...
详细信息
With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is reconstructed by an asynchronous online recovery. The computations in both the faulty and the healthy subdomains must be coordinated in a sensitive way, in particular, both under- and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal recoupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchically weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The recoupling process is steered by local contributions of the error estimator before the fault. Failure scenarios when solving up to 6.9 x 10(11) unknowns on more than 245,766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method.
We consider the problem of computing a matrix-vector product Ax using a set of P parallel or distributed processing nodes prone to "straggling," i.e., unpredictable delays. Every processing node can access o...
详细信息
We consider the problem of computing a matrix-vector product Ax using a set of P parallel or distributed processing nodes prone to "straggling," i.e., unpredictable delays. Every processing node can access only a fraction (s/N) of the N-length vector x, and all processing nodes compute an equal number of dot products. We propose a novel error correcting code-that we call " Short-Dot"-that introduces redundant, shorter dot products such that only a subset of the nodes' outputs are sufficient to compute Ax. To address the problem of straggling in computing matrix-vector products, prior work uses replication or erasure coding to encode parts of the matrix A, but the length of the dot products computed at each processing node is still N. The key novelty in our work is that instead of computing the long dot products as required in the original matrix-vector product, we construct a larger number of redundant and short dot products that only require a fraction of x to be accessed during the computation. Short-Dot is thus useful in a communication-constrained scenario as it allows for only a fraction of x to be accessed by each processing node. Further, we show that in the particular regime where the number of available processing nodes is greater than the total number of dot products, Short-Dot has lower expected computation time under straggling under an exponential model compared to existing strategies, e.g. replication, in a scaling sense. We also derive fundamental limits on the trade-off between the length of the dot products and the recovery threshold, i.e., the required number of processing nodes, showing that Short-Dot is near-optimal.
As hardware devices like processor cores and memory sub-systems based on nano-scale technology nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applicatio...
详细信息
ISBN:
(纸本)9780769545905
As hardware devices like processor cores and memory sub-systems based on nano-scale technology nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applications with long computation/mission times, is becoming pronounced. In this paper, we present an algorithm-based fault tolerance (ABFT) scheme for an iterative linear solver engine based on the Conjugated Gradient method (CG) by taking the advantage of numerical defect correction. This method is "pay as you go", meaning that there is practically only a runtime overhead if errors occur and a correction is performed. Our experimental comparison with software-based Triple Modular Redundancy (TMR) clearly shows the runtime benefit of the proposed approach, good faulttolerance and no occurrence of silent data corruption.
Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them o...
详细信息
ISBN:
(纸本)9781538683842
Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them only maintain checksum in one dimension;(2) their checking scheme is not efficient due to redundant checksum verifications;(3) they fail to protect PCIe communication;and (4) the checksum calculation based on a special type of matrix multiplication is far from efficient. By overcoming the above limitations, we design an efficient ABFT approach providing stronger protection for one-sided matrix decomposition methods on heterogeneous systems. First, we provide full matrix protection by using checksums in two dimensions. Second, our checking scheme is more efficient by prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors. Third, we protect PCIe communication by reordering checksum verifications and decomposition steps. Fourth, we accelerate the checksum calculation by 1.7x via better utilizing GPUs.
暂无评论