algorithm-based fault tolerance has been proposed as a technique to detect incorrect computations in multiprocessor systems. In algorithm-based fault tolerance, processors produce data elements that are checked by con...
详细信息
algorithm-based fault tolerance has been proposed as a technique to detect incorrect computations in multiprocessor systems. In algorithm-based fault tolerance, processors produce data elements that are checked by concurrent error detection mechanisms. In this paper, we investigate the efficacy of this approach for diagnosis of processor faults. Because checks are performed on data elements, the problem of location of data errors must first be solved. We propose a probabilistic model for the faults and errors in a multiprocessor system and use it to evaluate the probabilities of correct error location and fault diagnosis. We investigate the number of checks that are necessary to guarantee error location with high probability. We also give specific check assignments that accomplish this goal. We then consider the problem of fault diagnosis when the locations of erroneous data elements are known. Previous work on fault diagnosis required that the data sets produced by different processors be disjoint. We show, for the first time, that fault diagnosis is possible with high probability, even in systems where processors combine to produce individual data elements.
algorithm-based fault tolerance (ABFT) is a popular approach to achieve fault and error detection in multiprocessor systems. The design problem for ABFT is concerned with the construction of a check set of minimum car...
详细信息
algorithm-based fault tolerance (ABFT) is a popular approach to achieve fault and error detection in multiprocessor systems. The design problem for ABFT is concerned with the construction of a check set of minimum cardinality that detects a specified number of errors or faults. Previous work on this problem has assumed an a priori-bound on size of a check. We motivate and carry out an investigation of the problem without the bounded check size assumption. We establish upper and lower bounds on the number of checks needed to detect a given number of errors. The upper bounds are obtained through new schemes which are easy to implement, and the lower bounds are established using new types of arguments. These bounds are sharply different from those previously established under the bounded check size model. We also show that unlike error detection, the design problem for fault detection is NP-hard even for detecting only one fault.
algorithm-based fault tolerance is a scheme of low-cost error protection in real-time digital signal processing environments and other computation-intensive tasks. In this paper, a new method for encoding data is prop...
详细信息
algorithm-based fault tolerance is a scheme of low-cost error protection in real-time digital signal processing environments and other computation-intensive tasks. In this paper, a new method for encoding data is proposed and, furthermore, tow kinds of error-correcting codes over Z2m, which can be used with fixed-point arithmetic in practical algorithm-basedfault tolerant systems, are introduced.
This short note considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective faulttolerance at a low cost for computa...
详细信息
This short note considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective faulttolerance at a low cost for computation on matrices of moderate size;however, the methods do not scale well to floating-point operations on large systems. This short note proposes the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes. It is shown that the partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead.
The authors present an algorithm-basedfault tolerant scheme for recursive least squares, appropriate for applications in adaptive signal processing. The technique is closely focused on the Gentleman-Kung-McWhirter tr...
详细信息
The authors present an algorithm-basedfault tolerant scheme for recursive least squares, appropriate for applications in adaptive signal processing. The technique is closely focused on the Gentleman-Kung-McWhirter triangular systolic array architecture for QR decomposition. Assuming that the array is subject to transient faults, widely separated in time and each affecting a single processor, an algorithm is given that corrects the full triangular array with computational overhead equivalent, on average, to the interpolation of a single extra vector into the data stream. No output residuals are lost in the fault recovery. The analysis is extended to a fault-tolerant algorithm for linearly constrained QR decomposition.
We have developed an automated, compile time approach to generating error-detecting parallel programs. The compiler is used to identify statements implementing affine transformations within the program and automatical...
详细信息
ISBN:
(纸本)0818672617
We have developed an automated, compile time approach to generating error-detecting parallel programs. The compiler is used to identify statements implementing affine transformations within the program and automatically insert code for computing, manipulating, and comparing checksums in order to detect data errors at runtime. Statements which do not implement affine transformations are checked by duplication. Checksums are reused from one loop to the next if this is possible, rather than recomputing checksums for every statement. A global dataflow analysis is performed in order to determine points at which checksums need to be recomputed. We also use a novel method of specifying the data distributions of the check data using data distribution directives so that the computations on the original data and the corresponding check computations are performed on different processors. Results on the time overhead and error coverage of the error detecting parallel programs over the original programs are presented on an Intel Paragon distributed memory multicomputer.
We establish new lower and upper bounds for the combinatorial problem of constructing minimal test sets for error detection in multiprocessor systems. Our construction for detecting two errors produces minimal test se...
详细信息
We establish new lower and upper bounds for the combinatorial problem of constructing minimal test sets for error detection in multiprocessor systems. Our construction for detecting two errors produces minimal test sets, while that for three errors produces test sets whose size exceeds our lower bound by at most one. We also present a divide-and-conquer construction scheme for four or more errors.
algorithm-based fault tolerance (ABFT) is a low-overhead system-level concurrent error detection and fault location scheme for multiprocessor systems. In this short note, we present new methods for the design of ABFT ...
详细信息
algorithm-based fault tolerance (ABFT) is a low-overhead system-level concurrent error detection and fault location scheme for multiprocessor systems. In this short note, we present new methods for the design of ABFT systems. Our design procedure is applicable to a wide range of systems in which processors share data elements. A feature of our design approach is that the type of checks to be used in the final system can be controlled by the system designer. We also present some new bounds on the number of checks needed in ABFT system design.
Artificial Neural Networks are an interesting solution for several real-time applications in the area of signal and image processing, in particular since recent advances in VLSI integration technologies allow for effi...
详细信息
ISBN:
(纸本)0819416207
Artificial Neural Networks are an interesting solution for several real-time applications in the area of signal and image processing, in particular since recent advances in VLSI integration technologies allow for efficient hardware realizations. The use of dedicated circuits implementing the neural networks in mission-critical applications requires a high level of protection with respect to errors due to faults to guarantee output credibility and system availability. In this paper, the problem of concurrent error detection in dedicated neural networks is discussed by adopting an algorithm-based approach to check the inner product, i.e., the most of the computation performed in the neural network. Effectiveness and efficiency of this technique is shown and evaluated for the widely-used classes of neural paradigms.
The LU decomposition followed by forward/backward substitution is a very powerful technique for power flow studies. In order to ensure the reliability of computation, the algorithm-based fault tolerance (ABFT) is appl...
详细信息
ISBN:
(纸本)0780300815
The LU decomposition followed by forward/backward substitution is a very powerful technique for power flow studies. In order to ensure the reliability of computation, the algorithm-based fault tolerance (ABFT) is applied to LU decomposition in power flow studies. This technique is proposed not only to detect and correct errors caused by hardware failure but also to debug programs. Since the ABFT often suffers from roundoff errors when applied to the floating-point number system, a new technique called significant-bit maintenance arithmetic (SBMA) is also suggested for handling numerical problems.
暂无评论