检索结果-内蒙古大学图书馆

PROBABILISTIC EVALUATION OF ONLINE CHECKS IN fault-TOLERANT MULTIPROCESSOR SYSTEMS

IEEE TRANSACTIONS ON COMPUTERS 1992年第5期41卷 532-541页

作者： NAIR, VSS HOSKOTE, YV ABRAHAM, JA UNIV TEXAS COMP ENGN RES CTR AUSTIN TX 78758 USA

The analysis of fault-tolerant multiprocessor systems that use concurrent error detection (CED) schemes is much more difficult than the analysis of conventional fault-tolerant architectures. Various analytical techniques have been proposed to evaluate CED schemes deterministically. However, these approaches are based on worst-case assumptions related to the failure of system components. Often, the evaluation results do not reflect the actual fault tolerance capabilities of the system. In this paper, we develop a probabilistic approach to evaluate the fault detecting and locating capabilities of on-line checks in a system. The various probabilities associated with the checking schemes are identified and used in the framework of the matrix-based model [1]. based on these probabilistic matrices, estimates for the fault tolerance capabilities of various systems are derived analytically.

关键词： algorithm-based fault tolerance CONCURRENT ERROR DETECTION fault COVERAGE LOCATABILITY PROBABILISTIC TECHNIQUES

来源：评论

学校读者我要写书评

暂无评论

DESIGN OF algorithm-based fault-TOLERANT MULTIPROCESSOR SYSTEMS FOR CONCURRENT ERROR-DETECTION AND fault-DIAGNOSIS

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1994年第10期5卷 1099-1106页

作者： VINNAKOTA, B JHA, NK PRINCETON UNIV DEPT ELECT ENGNPRINCETONNJ 08544

algorithm-based fault tolerance (ABFT) is a low-overhead system-level concurrent error detection and fault location scheme for multiprocessor systems. In this short note, we present new methods for the design of ABFT systems. Our design procedure is applicable to a wide range of systems in which processors share data elements. A feature of our design approach is that the type of checks to be used in the final system can be controlled by the system designer. We also present some new bounds on the number of checks needed in ABFT system design.

关键词： algorithm-based fault tolerance CONCURRENT ERROR DETECTION fault DETECTABILITY fault DIAGNOSABILITY SYSTEM-LEVEL fault tolerance

来源：评论

学校读者我要写书评

暂无评论

OPTIMAL-DESIGN OF CHECKS FOR ERROR-DETECTION AND LOCATION IN fault-TOLERANT MULTIPROCESSOR SYSTEMS

引用

IEEE TRANSACTIONS ON COMPUTERS 1993年第7期42卷 780-793页

作者： SITARAMAN, RK JHA, NK PRINCETON UNIV DEPT ELECT ENGNPRINCETONNJ 08544

Designing checks to detect or locate errors in the data plays an important role in the design of fault tolerant systems. Recently, the problem of synthesizing the data-check (DC) relationship has received a lot of attention in the context of a natural paradigm for concurrent error detection/location known as algorithm-based fault tolerance (ABFT). Banerjee and Abraham have shown that an ABFT scheme can be modeled as a tripartite graph consisting of processors (P), data (D), and checks (C). Any technique for designing ABFT systems requires a procedure for synthesizing a DC relationship, which not only has a low overhead but also has all the properties required by the designer. The main contribution of this work is to propose a simple and novel algorithm called RANDGEN to generate DC graphs. This synthesis approach itself is very fast and can be fully parallelized. By simply varying its parameters, the same algorithm RANDGEN can produce DC graphs with a wide spectrum of properties, many of which have been considered very important in recent ABFT designs. RANDGEN produces s-error-detectable DC graphs with asymptotically the least number of checks for the first time. RANDGEN can also produce s-error-locatable DC graphs using only a small number of checks. This is the first general procedure for producing error-locatable graphs for any value of s. Another important outstanding problem in DC graph design is providing fast and practical methods for actually locating the errors in the data from the output pattern at the checks. We show that RANDGEN can be used to design DC graphs, which permit easy diagnosis, again with a small number of checks. It has been pointed out previously that ''uniform'' checks may simplify the design of the ABFT system. We show how RANDGEN can be modified very simply to produce uniform s-error-detectable/locatable DC graphs. Finally, we show how one can generalize these results to synthesize strictly s-error-detectable/locatable DC graphs which ca

关键词： algorithm-based fault tolerance CHECKSUM ENCODING CONCURRENT ERROR DETECTION CONCURRENT ERROR LOCATION MAJORITY DIAGNOSIS RANDOMIZED algorithmS UNIFORM CHECKS

来源：评论

学校读者我要写书评

暂无评论

An algorithm-based error detection scheme for the multigrid method

引用

IEEE TRANSACTIONS ON COMPUTERS 2003年第9期52卷 1089-1099页

作者： Mishra, A Banerjee, P Virginia Polytech Inst & State Univ Bradley Dept Elect & Comp Engn Blacksburg VA 24061 USA Northwestern Univ Dept Elect & Comp Engn Ctr Parallel & Distributed Comp Evanston IL 60208 USA

algorithm-based fault tolerance (ABFT) is a technique to provide system level error detection and correction on array processors as well as multiprocessors at a low cost. Since the early 80s the technique has been extensively applied to several linear algebraic algorithms, e.g., matrix multiplication, Gaussian elimination, QR factorization, and singular value decompositions, etc. An important class of problems in numerical linear algebra dealing with the iterative solution of linear algebraic equations arising due to the finite difference discretization or the finite element discretization of a partial differential equation, however, has been overlooked. The only exception is the recent application of algorithm based error detection (ABED) encodings to the successive overrelaxation algorithm for Laplace's equation [11]. In this paper, ABED is applied to a multigrid algorithm for the iterative solution of a Poisson equation in two dimensions. Invariants are created to implement checking in the relaxation, the restriction, and the interpolation operators. Modifications to invariants due to roundoff errors accumulated within the operators, which often lead to a situation known as false alarms, have been addressed by deriving the expressions for the roundoff errors in the algebraic processes in the operators and correcting the invariants accordingly, ABED encoded multigrid algorithm is shown to be insensitive to the size and the range of the input data besides providing excellent error coverage at a low latency for floating-point, integer, and memory errors.

关键词： algorithm-based fault tolerance multigrid method rounding error analysis parallel error detection partial differential equations

来源：评论

学校读者我要写书评

暂无评论

A WELL CONDITIONED CHECKSUM SCHEME FOR algorithmIC fault tolerance

引用

INTEGRATION-THE VLSI JOURNAL 1991年第1期12卷 21-32页

作者： BOLEY, DL LUK, FT CORNELL UNIV SCH ELECT ENGN ITHACA NY 14853 USA

The weighted checksum scheme has been proposed as a low-cost fault tolerant procedure for parallel matrix computations. To guarantee multiple error detection and correction, the chosen weight vectors must satisfy some very specific properties about linear independence. However, previous weight generating methods that fulfill the independence criteria have troubles with numerical overflow. We will present a new scheme that generates weight vectors via Chebyshev polynomials to meet the requirements about independence and to avoid the difficulties with overflow.

关键词： algorithm-based fault tolerance CHEBYSHEV POLYNOMIALS LANCZOS algorithm BERLEKAMP-MASSEY algorithm

来源：评论

学校读者我要写书评

暂无评论

fault-tolerant QRD recursive least squares

引用

IEE PROCEEDINGS-COMPUTERS AND DIGITAL TECHNIQUES 1996年第2期143卷 137-144页

作者： Connolly, MP Fitzpatrick, P National Microelectronics Research Centre Cork Ireland

The authors present an algorithm-based fault tolerant scheme for recursive least squares, appropriate for applications in adaptive signal processing. The technique is closely focused on the Gentleman-Kung-McWhirter triangular systolic array architecture for QR decomposition. Assuming that the array is subject to transient faults, widely separated in time and each affecting a single processor, an algorithm is given that corrects the full triangular array with computational overhead equivalent, on average, to the interpolation of a single extra vector into the data stream. No output residuals are lost in the fault recovery. The analysis is extended to a fault-tolerant algorithm for linearly constrained QR decomposition.

关键词： algorithm-based fault tolerance error correction QR decomposition adaptive filtering linearly constrained QRD

来源：评论

学校读者我要写书评

暂无评论

fault tolerance design in JPEG 2000 image compression system

引用

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING 2005年第1期2卷 57-75页

作者： Nguyen, C Redinbo, GR Univ Calif Davis Dept Elect & Comp Engn Davis CA 95616 USA

The JPEG 2000 image compression standard is designed for a broad range of data compression applications. The new standard is based on wavelet technology and layered coding in order to provide a rich feature compressed image stream. The implementations of the JPEG 2000 codec are susceptible to computer-induced soft errors. One situation requiring fault tolerance is remote-sensing satellites, where high energy, particles and radiation produce single event upsets corrupting the highly susceptible data compression operations. This paper develops fault tolerance error-detecting capabilities for the major subsysyems that constitute a JPEG 2000 standard. The nature of the subsystem dictates the realistic fault model where some parts have numerical error impacts whereas others are properly modeled using bit-level variables. The critical operations of subunits such as Discrete Wavelet Transform (DWT) and quantization are protected against numerical errors. Concurrent error detection techniques are applied to accommodate the data type and numerical operations in each processing unit. On the other hand, the Embedded Block Coding with Optimal Truncation (EBCOT) system and the bitstream formation unit are protected against soft-error effects using binary decision variables and cyclic redundancy check (CRC) parity values, respectively. The techniques achieve excellent error-detecting capability at only a slight increase in complexity. The design strategies have been tested using Matlab programs and simulation results are presented.

关键词： fault-tolerant source coding soft errors JPEG 2000 standard data compression Discrete Wavelet Transform (DWT) algorithm-based fault tolerance error control codes Huffman coding error-checking concurrent error detection hardware reliability weighted sum parity

来源：评论

学校读者我要写书评

暂无评论

Low-Cost Online Convolution Checksum Checker

引用

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 2022年第2期30卷 201-212页

作者： Filippas, Dionysios Margomenos, Nikolaos Mitianoudis, Nikolaos Nicopoulos, Chrysostomos Dimitrakopoulos, Giorgos Democritus Univ Thrace Dept Elect & Comp Engn Xanthi 67100 Greece Univ Cyprus Dept Elect & Comp Engn CY-1678 Nicosia Cyprus

Managing random hardware faults requires the faults to be detected online, thus simplifying recovery. algorithm-based fault tolerance has been proposed as a low-cost mechanism to check online the result of computations against random hardware failures. In this case, the checksum of the actual result is checked against a predicted checksum computed in parallel by a hardware checker. In this work, we target the design of such checkers for convolution engines that are currently the most critical building block in image processing and computer vision applications. The proposed convolution checksum checker, named ConvGuard, utilizes a newly introduced invariance condition of convolution to predict implicitly the output checksum using only the pixels at the border of the input image. In this way, ConvGuard reduces the power required for accumulating the input pixels without requiring large buffers to hold intermediate checksum results. The design of ConvGuard is generic and can be configured for different output sizes and strides. The experimental results show that ConvGuard utilizes only a small percentage of the area/power of an efficient convolution engine while being significantly smaller and more power efficient than a state-of-the-art checksum checker for various practical cases.

关键词： Convolution Hardware Engines Computer architecture fault tolerant systems fault tolerance Convolutional neural networks algorithm-based fault tolerance convolution error detection reliability

来源：评论

学校读者我要写书评

暂无评论

FT-BLAS: A fault Tolerant High Performance BLAS Implementation on x86 CPUs

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2023年第12期34卷 3207-3223页

作者： Zhai, Yujia Giem, Elisabeth Zhao, Kai Liu, Jinyang Huang, Jiajun Wong, Bryan M. Shelton, Christian R. Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA Univ Alabama Birmingham Birmingham AL 35294 USA

Basic Linear Algebra Subprograms (BLAS) serve as a foundational library for scientific computing and machine learning. In this article, we present a new BLAS implementation, FT-BLAS, that provides performance comparable to or faster than state-of -the-art BLAS libraries, while being capable of tolerating soft errors on-the-fly. At the algorithmic level, we propose a hybrid strategy to incorporate fault-tolerant functionality. For memory-bound Level-1 and Level-2 BLAS routines, we duplicate computing instructions and re-use data at the register level to avoid memory overhead when validating the runtime correctness. Here we novelly propose to utilize mask registers on AVX512-enabled processors and SIMD registers on AVX2-enabled processors to store intermediate com-parison results. For compute-bound Level-3 BLAS routines, we fuse memory-intensive operations such as checksum encoding and verification into the GEMM assembly kernels to optimize the mem-ory footprint. We also design cache-friendly parallel algorithms for our fault-tolerant library. Through a series of architectural-aware optimizations, we manage to maintain the fault-tolerant overhead at a negligible order (<3%). Experimental results obtained on widely-used processors such as Intel Skylake, Intel Cascade Lake, and AMD Zen2 demonstrate that FT-BLAS offers high reliability and high performance - faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14%, and 21.70%, respectively, for both serial and parallel routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.

关键词： BLAS SIMD assembly optimization dual modular redundancy algorithm-based fault tolerance AVX-512 AVX2 OpenMP parallel algorithm

来源：评论

学校读者我要写书评

暂无评论

SYNTHESIS OF algorithm-based fault-TOLERANT SYSTEMS FOR DEPENDENCE GRAPHS

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1993年第8期4卷 864-874页

作者： VINNAKOTA, B JHA, NK PRINCETON UNIV DEPT ELECT ENGNPRINCETONNJ 08544

algorithm-based fault tolerance (ABFT) is a scheme to improve the reliability of parallel architectures used for computation-intensive tasks. The exact implementation of an ABFT scheme is algorithm-dependent. ABFT systems have very low overhead compared to other fault tolerance schemes with similar benefits. Few results are available in the area of general synthesis of ABFT systems. A two-stage approach to the synthesis of ABFT systems is proposed. In the first stage a system-level code is chosen to encode the data used in the algorithm. In the second stage the optimal architecture to implement the scheme is chosen using dependence graphs. Dependence graphs are a graph-theoretic form of algorithm representation. We demonstrate that not all architectures are ideal for the implementation of a particular ABFT scheme. We propose new measures to characterize the fault tolerance capability of a system to better exploit the proposed synthesis method. Dependence graphs can also be used for the synthesis of ABFT schemes for non-linear problems. An example of a fault-tolerant median filter is provided to illustrate their utility for such problems.

关键词： algorithm-based fault tolerance CHECKSUM ENCODING CONCURRENT ERROR DETECTION DEPENDENCE GRAPHS fault DETECTABILITY fault LOCATABILITY SYSTEM SYNTHESIS FOR fault tolerance

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：