The analysis of fault-tolerant multiprocessor systems that use concurrent error detection (CED) schemes is much more difficult than the analysis of conventional fault-tolerant architectures. Various analytical techniq...
详细信息
The analysis of fault-tolerant multiprocessor systems that use concurrent error detection (CED) schemes is much more difficult than the analysis of conventional fault-tolerant architectures. Various analytical techniques have been proposed to evaluate CED schemes deterministically. However, these approaches are based on worst-case assumptions related to the failure of system components. Often, the evaluation results do not reflect the actual faulttolerance capabilities of the system. In this paper, we develop a probabilistic approach to evaluate the fault detecting and locating capabilities of on-line checks in a system. The various probabilities associated with the checking schemes are identified and used in the framework of the matrix-based model [1]. based on these probabilistic matrices, estimates for the faulttolerance capabilities of various systems are derived analytically.
algorithm-based fault tolerance (ABFT) is a low-overhead system-level concurrent error detection and fault location scheme for multiprocessor systems. In this short note, we present new methods for the design of ABFT ...
详细信息
algorithm-based fault tolerance (ABFT) is a low-overhead system-level concurrent error detection and fault location scheme for multiprocessor systems. In this short note, we present new methods for the design of ABFT systems. Our design procedure is applicable to a wide range of systems in which processors share data elements. A feature of our design approach is that the type of checks to be used in the final system can be controlled by the system designer. We also present some new bounds on the number of checks needed in ABFT system design.
Designing checks to detect or locate errors in the data plays an important role in the design of fault tolerant systems. Recently, the problem of synthesizing the data-check (DC) relationship has received a lot of att...
详细信息
Designing checks to detect or locate errors in the data plays an important role in the design of fault tolerant systems. Recently, the problem of synthesizing the data-check (DC) relationship has received a lot of attention in the context of a natural paradigm for concurrent error detection/location known as algorithm-based fault tolerance (ABFT). Banerjee and Abraham have shown that an ABFT scheme can be modeled as a tripartite graph consisting of processors (P), data (D), and checks (C). Any technique for designing ABFT systems requires a procedure for synthesizing a DC relationship, which not only has a low overhead but also has all the properties required by the designer. The main contribution of this work is to propose a simple and novel algorithm called RANDGEN to generate DC graphs. This synthesis approach itself is very fast and can be fully parallelized. By simply varying its parameters, the same algorithm RANDGEN can produce DC graphs with a wide spectrum of properties, many of which have been considered very important in recent ABFT designs. RANDGEN produces s-error-detectable DC graphs with asymptotically the least number of checks for the first time. RANDGEN can also produce s-error-locatable DC graphs using only a small number of checks. This is the first general procedure for producing error-locatable graphs for any value of s. Another important outstanding problem in DC graph design is providing fast and practical methods for actually locating the errors in the data from the output pattern at the checks. We show that RANDGEN can be used to design DC graphs, which permit easy diagnosis, again with a small number of checks. It has been pointed out previously that ''uniform'' checks may simplify the design of the ABFT system. We show how RANDGEN can be modified very simply to produce uniform s-error-detectable/locatable DC graphs. Finally, we show how one can generalize these results to synthesize strictly s-error-detectable/locatable DC graphs which ca
algorithm-based fault tolerance (ABFT) is a technique to provide system level error detection and correction on array processors as well as multiprocessors at a low cost. Since the early 80s the technique has been ext...
详细信息
algorithm-based fault tolerance (ABFT) is a technique to provide system level error detection and correction on array processors as well as multiprocessors at a low cost. Since the early 80s the technique has been extensively applied to several linear algebraic algorithms, e.g., matrix multiplication, Gaussian elimination, QR factorization, and singular value decompositions, etc. An important class of problems in numerical linear algebra dealing with the iterative solution of linear algebraic equations arising due to the finite difference discretization or the finite element discretization of a partial differential equation, however, has been overlooked. The only exception is the recent application of algorithmbased error detection (ABED) encodings to the successive overrelaxation algorithm for Laplace's equation [11]. In this paper, ABED is applied to a multigrid algorithm for the iterative solution of a Poisson equation in two dimensions. Invariants are created to implement checking in the relaxation, the restriction, and the interpolation operators. Modifications to invariants due to roundoff errors accumulated within the operators, which often lead to a situation known as false alarms, have been addressed by deriving the expressions for the roundoff errors in the algebraic processes in the operators and correcting the invariants accordingly, ABED encoded multigrid algorithm is shown to be insensitive to the size and the range of the input data besides providing excellent error coverage at a low latency for floating-point, integer, and memory errors.
The weighted checksum scheme has been proposed as a low-cost fault tolerant procedure for parallel matrix computations. To guarantee multiple error detection and correction, the chosen weight vectors must satisfy some...
详细信息
The weighted checksum scheme has been proposed as a low-cost fault tolerant procedure for parallel matrix computations. To guarantee multiple error detection and correction, the chosen weight vectors must satisfy some very specific properties about linear independence. However, previous weight generating methods that fulfill the independence criteria have troubles with numerical overflow. We will present a new scheme that generates weight vectors via Chebyshev polynomials to meet the requirements about independence and to avoid the difficulties with overflow.
The authors present an algorithm-basedfault tolerant scheme for recursive least squares, appropriate for applications in adaptive signal processing. The technique is closely focused on the Gentleman-Kung-McWhirter tr...
详细信息
The authors present an algorithm-basedfault tolerant scheme for recursive least squares, appropriate for applications in adaptive signal processing. The technique is closely focused on the Gentleman-Kung-McWhirter triangular systolic array architecture for QR decomposition. Assuming that the array is subject to transient faults, widely separated in time and each affecting a single processor, an algorithm is given that corrects the full triangular array with computational overhead equivalent, on average, to the interpolation of a single extra vector into the data stream. No output residuals are lost in the fault recovery. The analysis is extended to a fault-tolerant algorithm for linearly constrained QR decomposition.
The JPEG 2000 image compression standard is designed for a broad range of data compression applications. The new standard is based on wavelet technology and layered coding in order to provide a rich feature compressed...
详细信息
The JPEG 2000 image compression standard is designed for a broad range of data compression applications. The new standard is based on wavelet technology and layered coding in order to provide a rich feature compressed image stream. The implementations of the JPEG 2000 codec are susceptible to computer-induced soft errors. One situation requiring faulttolerance is remote-sensing satellites, where high energy, particles and radiation produce single event upsets corrupting the highly susceptible data compression operations. This paper develops faulttolerance error-detecting capabilities for the major subsysyems that constitute a JPEG 2000 standard. The nature of the subsystem dictates the realistic fault model where some parts have numerical error impacts whereas others are properly modeled using bit-level variables. The critical operations of subunits such as Discrete Wavelet Transform (DWT) and quantization are protected against numerical errors. Concurrent error detection techniques are applied to accommodate the data type and numerical operations in each processing unit. On the other hand, the Embedded Block Coding with Optimal Truncation (EBCOT) system and the bitstream formation unit are protected against soft-error effects using binary decision variables and cyclic redundancy check (CRC) parity values, respectively. The techniques achieve excellent error-detecting capability at only a slight increase in complexity. The design strategies have been tested using Matlab programs and simulation results are presented.
Managing random hardware faults requires the faults to be detected online, thus simplifying recovery. algorithm-based fault tolerance has been proposed as a low-cost mechanism to check online the result of computation...
详细信息
Managing random hardware faults requires the faults to be detected online, thus simplifying recovery. algorithm-based fault tolerance has been proposed as a low-cost mechanism to check online the result of computations against random hardware failures. In this case, the checksum of the actual result is checked against a predicted checksum computed in parallel by a hardware checker. In this work, we target the design of such checkers for convolution engines that are currently the most critical building block in image processing and computer vision applications. The proposed convolution checksum checker, named ConvGuard, utilizes a newly introduced invariance condition of convolution to predict implicitly the output checksum using only the pixels at the border of the input image. In this way, ConvGuard reduces the power required for accumulating the input pixels without requiring large buffers to hold intermediate checksum results. The design of ConvGuard is generic and can be configured for different output sizes and strides. The experimental results show that ConvGuard utilizes only a small percentage of the area/power of an efficient convolution engine while being significantly smaller and more power efficient than a state-of-the-art checksum checker for various practical cases.
Basic Linear Algebra Subprograms (BLAS) serve as a foundational library for scientific computing and machine learning. In this article, we present a new BLAS implementation, FT-BLAS, that provides performance comparab...
详细信息
Basic Linear Algebra Subprograms (BLAS) serve as a foundational library for scientific computing and machine learning. In this article, we present a new BLAS implementation, FT-BLAS, that provides performance comparable to or faster than state-of -the-art BLAS libraries, while being capable of tolerating soft errors on-the-fly. At the algorithmic level, we propose a hybrid strategy to incorporate fault-tolerant functionality. For memory-bound Level-1 and Level-2 BLAS routines, we duplicate computing instructions and re-use data at the register level to avoid memory overhead when validating the runtime correctness. Here we novelly propose to utilize mask registers on AVX512-enabled processors and SIMD registers on AVX2-enabled processors to store intermediate com-parison results. For compute-bound Level-3 BLAS routines, we fuse memory-intensive operations such as checksum encoding and verification into the GEMM assembly kernels to optimize the mem-ory footprint. We also design cache-friendly parallel algorithms for our fault-tolerant library. Through a series of architectural-aware optimizations, we manage to maintain the fault-tolerant overhead at a negligible order (<3%). Experimental results obtained on widely-used processors such as Intel Skylake, Intel Cascade Lake, and AMD Zen2 demonstrate that FT-BLAS offers high reliability and high performance - faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14%, and 21.70%, respectively, for both serial and parallel routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.
algorithm-based fault tolerance (ABFT) is a scheme to improve the reliability of parallel architectures used for computation-intensive tasks. The exact implementation of an ABFT scheme is algorithm-dependent. ABFT sys...
详细信息
algorithm-based fault tolerance (ABFT) is a scheme to improve the reliability of parallel architectures used for computation-intensive tasks. The exact implementation of an ABFT scheme is algorithm-dependent. ABFT systems have very low overhead compared to other faulttolerance schemes with similar benefits. Few results are available in the area of general synthesis of ABFT systems. A two-stage approach to the synthesis of ABFT systems is proposed. In the first stage a system-level code is chosen to encode the data used in the algorithm. In the second stage the optimal architecture to implement the scheme is chosen using dependence graphs. Dependence graphs are a graph-theoretic form of algorithm representation. We demonstrate that not all architectures are ideal for the implementation of a particular ABFT scheme. We propose new measures to characterize the faulttolerance capability of a system to better exploit the proposed synthesis method. Dependence graphs can also be used for the synthesis of ABFT schemes for non-linear problems. An example of a fault-tolerant median filter is provided to illustrate their utility for such problems.
暂无评论