The need for reliability of computers has been increasing, as computers have been put to use in more and more practical applications. Multiprocessor architectures have provided elegant solutions for certain computatio...
详细信息
The need for reliability of computers has been increasing, as computers have been put to use in more and more practical applications. Multiprocessor architectures have provided elegant solutions for certain computationally expensive problems which find wide-ranging applications in areas such as defense and industry. Since computer-intensive applications are run on these architectures, the probability that some computations will incur error is not negligible. Hence faulttolerance plays an important role in the design of multiprocessor architectures. In this paper, we review a low-cost scheme for adding faulttolerance in multiprocessor architectures, called algorithm-based fault tolerance (ABFT). The concurrent error detecting and correcting capabilities of this scheme are demonstrated with the help of examples. Various issues of interest, the areas open to research and the limitations of ABFT are also pointed out. (C) 1997 Elsevier Science B.V.
Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kernel can be to...
详细信息
Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kernel can be tolerated without checkpointing or message logging. It has been proved in the previous algorithm-based fault tolerance research that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no matter which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship in the input checksum matrices can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix-matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version matrix-matrix multiplication algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging. Because no periodical checkpointing is involved, the faulttolerance overhead for this approach is surprisingly low.
The LU decomposition followed by forward/backward substitution is a very powerful technique for power flow studies. In order to ensure the reliability of computation, the algorithm-based fault tolerance (ABFT) is appl...
详细信息
ISBN:
(纸本)0780300815
The LU decomposition followed by forward/backward substitution is a very powerful technique for power flow studies. In order to ensure the reliability of computation, the algorithm-based fault tolerance (ABFT) is applied to LU decomposition in power flow studies. This technique is proposed not only to detect and correct errors caused by hardware failure but also to debug programs. Since the ABFT often suffers from roundoff errors when applied to the floating-point number system, a new technique called significant-bit maintenance arithmetic (SBMA) is also suggested for handling numerical problems.
VLSI-based processor arrays have been widely used for computation intensive applications such as matrix and graph algorithms. algorithm-based fault tolerance designs employing Various encoding/decoding schemes have be...
详细信息
VLSI-based processor arrays have been widely used for computation intensive applications such as matrix and graph algorithms. algorithm-based fault tolerance designs employing Various encoding/decoding schemes have been proposed for such systems to effectively tolerate operation time fault. In this paper, we propose an efficient algorithm-based fault tolerance design using the weighted data-check relationship, where the checks are obtained from the weighted data. The relationship is systematically defined as a new (n, k, N-w) Hamming checksum code, where n is the size of the code word, k is the number of information elements in the code word, and N-w is the number of weights employed, respectively. The proposed design with various weights is evaluated in terms of time and hardware overhead as well as overflow probability and round-off error. Two different schemes employing the (n, k, 2) and (n, k, 3) Hamming checksum code are illustrated using important matrix computations. Comparison with other schemes reveals that the (n, k, 3) Hamming checksum scheme is very efficient, while the hardware overhead is small.
We establish new lower and upper bounds for the combinatorial problem of constructing minimal test sets for error detection in multiprocessor systems. Our construction for detecting two errors produces minimal test se...
详细信息
We establish new lower and upper bounds for the combinatorial problem of constructing minimal test sets for error detection in multiprocessor systems. Our construction for detecting two errors produces minimal test sets, while that for three errors produces test sets whose size exceeds our lower bound by at most one. We also present a divide-and-conquer construction scheme for four or more errors.
algorithm-based fault tolerance has been proposed as a technique to detect incorrect computations in multiprocessor systems. In algorithm-based fault tolerance, processors produce data elements that are checked by con...
详细信息
algorithm-based fault tolerance has been proposed as a technique to detect incorrect computations in multiprocessor systems. In algorithm-based fault tolerance, processors produce data elements that are checked by concurrent error detection mechanisms. In this paper, we investigate the efficacy of this approach for diagnosis of processor faults. Because checks are performed on data elements, the problem of location of data errors must first be solved. We propose a probabilistic model for the faults and errors in a multiprocessor system and use it to evaluate the probabilities of correct error location and fault diagnosis. We investigate the number of checks that are necessary to guarantee error location with high probability. We also give specific check assignments that accomplish this goal. We then consider the problem of fault diagnosis when the locations of erroneous data elements are known. Previous work on fault diagnosis required that the data sets produced by different processors be disjoint. We show, for the first time, that fault diagnosis is possible with high probability, even in systems where processors combine to produce individual data elements.
algorithm-based fault tolerance (ABFT) is a low-cost system-level concurrent error detection and fault location scheme. The design problem for an ABFT system is concerned with the construction of a check set for detec...
详细信息
algorithm-based fault tolerance (ABFT) is a low-cost system-level concurrent error detection and fault location scheme. The design problem for an ABFT system is concerned with the construction of a check set for detecting errors or faults. In this paper, we analyze the construction of check sets from a combinatorial perspective, and propose a necessary and sufficient condition for the design of a check set that detects a given number of errors. We also propose a new bound for detecting three errors for algorithm-based fault tolerance systems.
algorithm-based fault tolerance (ABFT) is a popular approach to achieve fault and error detection in multiprocessor systems. The design problem for ABFT is concerned with the construction of a check set of minimum car...
详细信息
algorithm-based fault tolerance (ABFT) is a popular approach to achieve fault and error detection in multiprocessor systems. The design problem for ABFT is concerned with the construction of a check set of minimum cardinality that detects a specified number of errors or faults. Previous work on this problem has assumed an a priori-bound on size of a check. We motivate and carry out an investigation of the problem without the bounded check size assumption. We establish upper and lower bounds on the number of checks needed to detect a given number of errors. The upper bounds are obtained through new schemes which are easy to implement, and the lower bounds are established using new types of arguments. These bounds are sharply different from those previously established under the bounded check size model. We also show that unlike error detection, the design problem for fault detection is NP-hard even for detecting only one fault.
Soft errors in scientific computing applications are becoming inevitable with the ever-increasing system scale and execution time, and new technologies that feature increased transistor density and lower voltage. Soft...
详细信息
Soft errors in scientific computing applications are becoming inevitable with the ever-increasing system scale and execution time, and new technologies that feature increased transistor density and lower voltage. Soft errors can be mainly classified into two categories: bit-flipping error (e.g. 1 becomes -1) in random access memory;and computation error (e.g. 1+1=3) in floating point units. Traditionally, bit-flipping error is handled by the Error Correcting Code (ECC) technique, and computation error is dealt with the Triple Modular Redundancy (TMR) method. Note that, ECC cannot handle computation error, while TMR cannot deal with bit-flipping error and is not efficient on handling computation error. To uniformly and efficiently handle both computation and bit-flipping errors in matrix operations, the algorithm-based fault tolerance (ABFT) method is developed. This paper focuses on the detection of soft errors in the LU Decomposition with Partial Pivoting (LUPP) algorithm, which is widely used in scientific computing applications. First, this paper notes that existing ABFT methods are not adequate to detect soft errors in LUPP in terms of time or space. Then we propose a new ABFT algorithm which can detect soft errors in LUPP both flexible in time and comprehensive in space. Flexible in time means that soft errors can be detected flexibly during the execution instead of only at the end of LUPP, while comprehensive in space indicates that all of the elements in data matrices (L and U) will be covered for detecting soft errors. To show the feasibility and efficiency of the proposed algorithm, this paper has incorporated it into the implementation of LUPP in the widely used benchmark High Performance Linpack (HPL). Experiment results verify the feasibility of this algorithm: for soft errors injected at various timings and to different elements in LUPP, this algorithm has detected most of the injected errors, which have covered all of the errors that cannot pass the re
algorithm-based fault tolerance is a scheme of low-cost error protection in real-time digital signal processing environments and other computation-intensive tasks. In this paper, a new method for encoding data is prop...
详细信息
algorithm-based fault tolerance is a scheme of low-cost error protection in real-time digital signal processing environments and other computation-intensive tasks. In this paper, a new method for encoding data is proposed and, furthermore, tow kinds of error-correcting codes over Z2m, which can be used with fixed-point arithmetic in practical algorithm-basedfault tolerant systems, are introduced.
暂无评论