检索结果-内蒙古大学图书馆

algorithm-based fault location and recovery for matrix computations on multiprocessor systems

IEEE TRANSACTIONS ON COMPUTERS 1996年第11期45卷 1239-1247页

作者： RoyChowdhury, A Banerjee, P NORTHWESTERN UNIV CTR PARALLEL & DISTRIBUTED COMPEVANSTONIL 60208 NORTHWESTERN UNIV DEPT ELECT & COMP ENGNEVANSTONIL 60208

algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose multicomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this paper, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a 16-processor Intel iPSC/2 hypercube multicomputer, which demonstrate acceptably low overheads for the single and double fault location and recovery cases.

关键词： algorithm-based fault-tolerance parallel numerical algorithms fault location fault recovery system level diagnosis coding theory

来源：评论

学校读者我要写书评

暂无评论

A new error analysis based method for tolerance computation for algorithm-based checks

引用

IEEE TRANSACTIONS ON COMPUTERS 1996年第2期45卷 238-243页

作者： Chowdhury, AR Banerjee, P Coordinated Science Laboratory University of Illinois Urbana IL USA

algorithm-based techniques are based on checking for the preservation of certain properties possessed by global data following a set of computations. This often involves the introduction of a check variable which Is updated in such a manner that, in the absence of roundoff errors, it equals the value of some function which involves all the data elements participating in the algorithm. However, the fact that roundoff errors accumulate in different ways in the updates involving the check variables and the computations involving data elements make it highly unlikely that the equality is preserved exactly for an implementation of the algorithm on a real computer. Thus, the check step involves verifying the preservation of the equality to within a tolerance value. In this brief contribution, we propose a method for determination of the tolerance based on error analysis techniques. We present results on three numerical algorithms which show the effectiveness of our approach for data sets of varying sizes and data ranges.

关键词： parallel algorithms algorithm-based fault-tolerance checksum encodings check thresholding roundoff error analysis

来源：评论

学校读者我要写书评

暂无评论

algorithm based error-detection schemes for iterative solution of partial differential equations

引用

IEEE TRANSACTIONS ON COMPUTERS 1996年第4期45卷 394-407页

作者： RoyChowdhury, A Bellas, N Banerjee, P Coordinated Science Lab University of Illinois Urbana IL USA

algorithm-based fault tolerance is an inexpensive method of achieving fault tolerance without requiring any hardware modifications. algorithm-based schemes have been proposed for a wide variety of numerical applications. However, for a particular class of numerical applications, namely those involving the iterative solution of linear systems arising from discretization of various PDEs, there exist almost no fault-tolerant algorithms in the literature. In this paper, we first describe an error-detecting version of a parallel algorithm for iteratively solving the Laplace equation over a rectangular grid. This error-detecting algorithm is based on the popular successive overrelaxation scheme with red-black ordering. We use the Laplace equation merely as a vehicle for discussion;later in the paper we show how to modify the algorithm to devise error-detecting iterative schemes for solving linear systems arising from discretizations of other PDEs, such as the Poisson equation and a variant of the Laplace equation with a mixed derivative term. We also discuss a modification of the basic scheme to handle situations where the underlying solution domain is not rectangular. We then discuss a somewhat different error-detecting algorithm for iterative solution of PDEs which can be expected to yield better error coverage. We also present a new way of dealing with the roundoff errors which complicate the check phase of algorithm-based schemes. Our approach is based on error analysis incorporating some simplifications and gives high fault coverage and no false alarms for a large variety of data sets. We report experimental results on the error coverage and performance overhead of our algorithm-based error-detection schemes on an Intel iPSC/2 hypercube multiprocessor. The timing overheads of our error-detecting algorithms Over the basic iterative algorithms involving no error detection decrease with increasing problem dimension and become small for large data sizes.

关键词： algorithm-based fault-tolerance parallel algorithms partial differential equations error analysis fault injection

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：