检索结果-内蒙古大学图书馆

algorithm-based fault tolerance: a review

MICROPROCESSORS AND MICROSYSTEMS 1997年第3期21卷 151-161页

作者： Vijay, M Mittal, R Indian Inst Technol Dept Comp Sci & Engn Madras 600036 Tamil Nadu India

The need for reliability of computers has been increasing, as computers have been put to use in more and more practical applications. Multiprocessor architectures have provided elegant solutions for certain computationally expensive problems which find wide-ranging applications in areas such as defense and industry. Since computer-intensive applications are run on these architectures, the probability that some computations will incur error is not negligible. Hence fault tolerance plays an important role in the design of multiprocessor architectures. In this paper, we review a low-cost scheme for adding fault tolerance in multiprocessor architectures, called algorithm-based fault tolerance (ABFT). The concurrent error detecting and correcting capabilities of this scheme are demonstrated with the help of examples. Various issues of interest, the areas open to research and the limitations of ABFT are also pointed out. (C) 1997 Elsevier Science B.V.

关键词： algorithm-based fault tolerance fault tolerance multiprocessor architectures space redundancy time redundancy

来源：评论

学校读者我要写书评

暂无评论

algorithm-based fault tolerance for Fail-Stop Failures

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2008年第12期19卷 1628-1641页

作者： Chen, Zizhong Dongarra, Jack Colorado Sch Mines Dept Math & Comp Sci Golden CO 80401 USA Univ Tennessee Dept Elect Engn & Comp Sci Knoxville TN 37996 USA

Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kernel can be tolerated without checkpointing or message logging. It has been proved in the previous algorithm-based fault tolerance research that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no matter which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship in the input checksum matrices can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix-matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version matrix-matrix multiplication algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging. Because no periodical checkpointing is involved, the fault tolerance overhead for this approach is surprisingly low.

关键词： algorithm-based fault tolerance checkpointing fail-stop failures parallel matrix-matrix multiplication ScaLAPACK

来源：评论

学校读者我要写书评

暂无评论

algorithm-based fault tolerance IN COMPUTATION OF POWER FLOW

ALGORITHM-BASED FAULT TOLERANCE IN COMPUTATION OF POWER FLOW

引用

33RD MIDWEST SYMP ON CIRCUITS AND SYSTEMS

作者： CHEN, YP HAN, JY Dept of Electr & Comput Eng Illinois Inst of Technol Chicago IL USA

ISBN: (纸本)0780300815

The LU decomposition followed by forward/backward substitution is a very powerful technique for power flow studies. In order to ensure the reliability of computation, the algorithm-based fault tolerance (ABFT) is applied to LU decomposition in power flow studies. This technique is proposed not only to detect and correct errors caused by hardware failure but also to debug programs. Since the ABFT often suffers from roundoff errors when applied to the floating-point number system, a new technique called significant-bit maintenance arithmetic (SBMA) is also suggested for handling numerical problems.

关键词： algorithm-based fault tolerance POWER FLOW LU DECOMPOSITION FORWARD AND BACKWARD SUBSTITUTION INSITU METHOD PROGRAM DEBUGGING FLOATING-POINT ARITHMETIC ROUND-OFF ERROR SIGNIFICANT-BIT MAINTENANCE ARITHMETIC

来源：评论

学校读者我要写书评

暂无评论

An efficient algorithm-based fault tolerance design using the weighted data-check relationship

引用

IEEE TRANSACTIONS ON COMPUTERS 2001年第4期50卷 371-383页

作者： Youn, HY Oh, CG Choo, H Chung, JW Lee, D Sungkyunkwan Univ Sch Elect & Comp Engn Suwon 440746 South Korea Nsyst Commun San Diego CA 92127 USA Informat & Commun Uni Sch Engn Taejon South Korea

VLSI-based processor arrays have been widely used for computation intensive applications such as matrix and graph algorithms. algorithm-based fault tolerance designs employing Various encoding/decoding schemes have been proposed for such systems to effectively tolerate operation time fault. In this paper, we propose an efficient algorithm-based fault tolerance design using the weighted data-check relationship, where the checks are obtained from the weighted data. The relationship is systematically defined as a new (n, k, N-w) Hamming checksum code, where n is the size of the code word, k is the number of information elements in the code word, and N-w is the number of weights employed, respectively. The proposed design with various weights is evaluated in terms of time and hardware overhead as well as overflow probability and round-off error. Two different schemes employing the (n, k, 2) and (n, k, 3) Hamming checksum code are illustrated using important matrix computations. Comparison with other schemes reveals that the (n, k, 3) Hamming checksum scheme is very efficient, while the hardware overhead is small.

关键词： algorithm-based fault tolerance Hamming correcting code matrix computations overflow round-off error VLSI processor array

来源：评论

学校读者我要写书评

暂无评论

IMPROVED BOUNDS FOR algorithm-based fault-tolerance

引用

IEEE TRANSACTIONS ON COMPUTERS 1993年第5期42卷 630-635页

作者： ROSENKRANTZ, DJ RAVI, SS Department of Computer Science State University of New York Albany Albany NY USA

We establish new lower and upper bounds for the combinatorial problem of constructing minimal test sets for error detection in multiprocessor systems. Our construction for detecting two errors produces minimal test sets, while that for three errors produces test sets whose size exceeds our lower bound by at most one. We also present a divide-and-conquer construction scheme for four or more errors.

关键词： algorithm-based fault tolerance ERROR fault DETECTION LOWER BOUND ONLINE TEST UPPER BOUND

来源：评论

学校读者我要写书评

暂无评论

ALMOST CERTAIN fault-DIAGNOSIS THROUGH algorithm-based fault-tolerance

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1994年第5期5卷 532-539页

作者： BLOUGH, DM PELC, A UNIV QUEBEC DEPT INFORMATHULL J8X 3X7QUEBECCANADA

algorithm-based fault tolerance has been proposed as a technique to detect incorrect computations in multiprocessor systems. In algorithm-based fault tolerance, processors produce data elements that are checked by concurrent error detection mechanisms. In this paper, we investigate the efficacy of this approach for diagnosis of processor faults. Because checks are performed on data elements, the problem of location of data errors must first be solved. We propose a probabilistic model for the faults and errors in a multiprocessor system and use it to evaluate the probabilities of correct error location and fault diagnosis. We investigate the number of checks that are necessary to guarantee error location with high probability. We also give specific check assignments that accomplish this goal. We then consider the problem of fault diagnosis when the locations of erroneous data elements are known. Previous work on fault diagnosis required that the data sets produced by different processors be disjoint. We show, for the first time, that fault diagnosis is possible with high probability, even in systems where processors combine to produce individual data elements.

关键词： algorithm-based fault tolerance CONCURRENT ERROR DETECTION fault DIAGNOSIS INTERMITTENT faultS PROBABILISTIC ANALYSIS

来源：评论

学校读者我要写书评

暂无评论

Combinatorial analysis of check set construction for algorithm-based fault tolerance systems

引用

JOURNAL OF ELECTRONIC TESTING-THEORY AND APPLICATIONS 1998年第3期12卷 255-260页

作者： Wang, DQ Zhao, LC Dalian Maritime Univ Dept Basic Sci Dalian 116026 Peoples R China

algorithm-based fault tolerance (ABFT) is a low-cost system-level concurrent error detection and fault location scheme. The design problem for an ABFT system is concerned with the construction of a check set for detecting errors or faults. In this paper, we analyze the construction of check sets from a combinatorial perspective, and propose a necessary and sufficient condition for the design of a check set that detects a given number of errors. We also propose a new bound for detecting three errors for algorithm-based fault tolerance systems.

关键词： algorithm-based fault tolerance check set combinatorial problem error detecting

来源：评论

学校读者我要写书评

暂无评论

CONSTRUCTION OF CHECK SETS FOR algorithm-based fault-tolerance

引用

IEEE TRANSACTIONS ON COMPUTERS 1994年第6期43卷 641-650页

作者： GU, DC ROSENKRANTZ, DJ RAVI, SS SUNY ALBANY DEPT COMP SCIALBANYNY 12222

algorithm-based fault tolerance (ABFT) is a popular approach to achieve fault and error detection in multiprocessor systems. The design problem for ABFT is concerned with the construction of a check set of minimum cardinality that detects a specified number of errors or faults. Previous work on this problem has assumed an a priori-bound on size of a check. We motivate and carry out an investigation of the problem without the bounded check size assumption. We establish upper and lower bounds on the number of checks needed to detect a given number of errors. The upper bounds are obtained through new schemes which are easy to implement, and the lower bounds are established using new types of arguments. These bounds are sharply different from those previously established under the bounded check size model. We also show that unlike error detection, the design problem for fault detection is NP-hard even for detecting only one fault.

关键词： algorithm-based fault tolerance ONLINE CHECK ERROR fault DETECTION UPPER LOWER BOUNDS NP-COMPLETE

来源：评论

学校读者我要写书评

暂无评论

Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2015年第4期29卷 422-436页

作者： Yao, Erlin Zhang, Jiutian Chen, Mingyu Tan, Guangming Sun, Ninghui Chinese Acad Sci Inst Comp Technol State Key Lab Comp Architecture Beijing 100190 Peoples R China

Soft errors in scientific computing applications are becoming inevitable with the ever-increasing system scale and execution time, and new technologies that feature increased transistor density and lower voltage. Soft errors can be mainly classified into two categories: bit-flipping error (e.g. 1 becomes -1) in random access memory;and computation error (e.g. 1+1=3) in floating point units. Traditionally, bit-flipping error is handled by the Error Correcting Code (ECC) technique, and computation error is dealt with the Triple Modular Redundancy (TMR) method. Note that, ECC cannot handle computation error, while TMR cannot deal with bit-flipping error and is not efficient on handling computation error. To uniformly and efficiently handle both computation and bit-flipping errors in matrix operations, the algorithm-based fault tolerance (ABFT) method is developed. This paper focuses on the detection of soft errors in the LU Decomposition with Partial Pivoting (LUPP) algorithm, which is widely used in scientific computing applications. First, this paper notes that existing ABFT methods are not adequate to detect soft errors in LUPP in terms of time or space. Then we propose a new ABFT algorithm which can detect soft errors in LUPP both flexible in time and comprehensive in space. Flexible in time means that soft errors can be detected flexibly during the execution instead of only at the end of LUPP, while comprehensive in space indicates that all of the elements in data matrices (L and U) will be covered for detecting soft errors. To show the feasibility and efficiency of the proposed algorithm, this paper has incorporated it into the implementation of LUPP in the widely used benchmark High Performance Linpack (HPL). Experiment results verify the feasibility of this algorithm: for soft errors injected at various timings and to different elements in LUPP, this algorithm has detected most of the injected errors, which have covered all of the errors that cannot pass the re

关键词： Soft error error detection LU Decomposition with Partial Pivoting algorithm-based fault tolerance

来源：评论

学校读者我要写书评

暂无评论

ERROR-CORRECTING CODES OVER Z(2M) FOR algorithm-based fault-tolerance

引用

IEEE TRANSACTIONS ON COMPUTERS 1994年第3期43卷 370-374页

作者： FENG, GL RAO, TRN KOLLURU, MS Center of Advanced Computer Studies University of Southwestern Louisiana Lafayette LA USA

algorithm-based fault tolerance is a scheme of low-cost error protection in real-time digital signal processing environments and other computation-intensive tasks. In this paper, a new method for encoding data is proposed and, furthermore, tow kinds of error-correcting codes over Z2m, which can be used with fixed-point arithmetic in practical algorithm-based fault tolerant systems, are introduced.

关键词： algorithm-based fault tolerance BCH-LIKE CODES DECODING DATA ENCODING DATA ERROR-CORRECTING CODES OVER A RING REED SOLOMON-LIKE CODES

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：