检索结果-内蒙古大学图书馆

20th IEEE International Conference on High Performance Computing and Communications (HPCC) / 16th IEEE International Conference on Smart City (SmartCity) / 4th IEEE International Conference on Data Science and Systems (DSS)

作者： Zhu, Yanchao Liu, Yi Li, Mingzhen Qian, Depei Beihang Univ Sino German Joint Software Inst Beijing Peoples R China

ISBN: (纸本)9781538666142

With the scaling up of high performance computers, resilience has become a big challenge. Among various kinds of software-based fault-tolerant approaches, the algorithm-based fault tolerance (ABFT) has some attractive characteristics in the era of exa-scale systems, such as high efficiency and light-weight. In particular, considering that many engineering and scientific applications rely on some fundamental algorithms, it is possible to provide algorithm-based fault-tolerant mechanisms in low level and make it application-independent. Previous fault-tolerant mechanisms for matrix computation use row and column checksums, which cannot be directly used in large-scale parallel systems. This paper proposes an algorithm-based fault tolerant approach for matrix multiplication on large-scale parallel systems. The mechanism uses block-checksum which not only meets the requirement of matrix computations on large-scale parallel systems but also reduces the overhead of fault-tolerance compared to traditional schemes based on row and column checksums. In addition, this paper gives method for choosing the size of blocks to achieve balance between accuracy and efficiency. The complexity analysis and examples demonstrate effectiveness and feasibility of our approach.

关键词： algorithm-based fault tolerance large-scale parallel system matrix multiplication matrix computation

来源：评论

学校读者我要写书评

暂无评论

Analysis and randomized design of algorithm-based fault tolerant multiprocessor systems under an extended model

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1997年第7期8卷 757-768页

作者： Yajnik, S Jha, NK PRINCETON UNIV DEPT ELECT ENGNPRINCETONNJ 08544

Reliability of compute-intensive applications can be improved by introducing fault tolerance into the system. algorithm-based fault tolerance (ABFT) is a low-cost scheme which provides the required fault tolerance to the system through system level encoding. In this paper, we propose randomized construction techniques, under an extended model, for the design of ABFT systems with the required fault tolerance capability. The model considers failures in the processors performing the checking operations.

关键词： algorithm-based fault tolerance concurrent error detection concurrent fault location randomized algorithms fault diagnosis transient faults

来源：评论

学校读者我要写书评

暂无评论

MODIFYING REAL CONVOLUTIONAL-CODES FOR PROTECTING DIGITAL FILTERING SYSTEMS

引用

IEEE TRANSACTIONS ON INFORMATION THEORY 1993年第2期39卷 553-564页

作者： REDINBO, GR ZAGAR, B GRAZ TECH UNIV INST ALLEGEMEINE ELEKTROTECH & ELEKT MESSTECHA-8010 GRAZAUSTRIA

Digital filters when implemented with very dense high-speed electronic devices are susceptible to both temporary and permanent failures, not easily protected by conventional fault-tolerant computer design principles. A new method is presented for protecting the overall realization against both hard and soft errors at the data sample level using the error-detecting properties of real convolutional codes. The normal filter system is surrounded with parallel parity channels defined by a real systematic convolutional code. Erroneous behavior is detected by comparing externally the calculated and regenerated parity samples. A rate k/n real convolutional code produces (n - k) parity samples for every k input samples causing the parity channels to operate at a rate decimated by k. Significant complexity reductions are possible by modifying the code structure, without loss of error protection, yielding simplified parity channels with finite impulse response (FIR) structures with computational rates decimated by k. The code modification procedure determines k nonzero scaling coefficients for weighting respective rows of the convolutional code's generator matrix. A matrix equation involving the code's original parity values and the denominator polynomial of the digital filter's transfer function is formed. Row manipulations separate this equation into two parts, a set of homogeneous equations constraining the modifying scaling coefficients and another set defining the implementation of the code parity values. An annihilator subspace in the dual space related to a condensed matrix defines the family of acceptable scaling coefficients when the code parameter k is less then the number of parity values, (n - k), times the number of poles in the filter. The code modification process has been automated in a computer algorithm. The effects of parity filter quantizations are analyzed and a bound on the mean-square error in the parity comparisons is given.

关键词： REAL CONVOLUTIONAL CODES fault-TOLERANT PROCESSING algorithm-based fault tolerance PROTECTING DIGITAL FILTERING ANNIHILATOR SUBSPACES

来源：评论

学校读者我要写书评

暂无评论

DIAGNOSABILITY AND DIAGNOSIS OF algorithm-based fault-TOLERANT SYSTEMS

引用

IEEE TRANSACTIONS ON COMPUTERS 1993年第8期42卷 924-937页

作者： VINNAKOTA, B JHA, NK PRINCETON UNIV DEPT ELECT ENGN PRINCETON NJ 08544 USA

Parallel processing architectures are now in common use for signal processing and other computation-intensive applications. These applications are characterized by high throughput and long processing periods. Such characteristics decrease the reliability of high-performance architectures. The erroneous data produced by faulty processors could have damaging consequences, particularly in critical real-time applications. It is therefore desirable that any erroneous data produced by the system be detected and located as quickly as possible. algorithm-based fault tolerance (ABFT) is a low-cost system-level concurrent error detection and fault location scheme. We apply methods used in the analysis of multiprocessor systems employing system-level diagnosis to the analysis of ABFT systems. A new algorithm to analyze an ABFT system for its fault diagnosability is developed using these methods. based on this work, a fault diagnosis algorithm is developed for ABFT systems. No such algorithm has been presented previously.

关键词： algorithm-based fault tolerance CHECKSUM ENCODING CONCURRENT ERROR DETECTION CONCURRENT fault DIAGNOSIS fault DIAGNOSABILITY

来源：评论

学校读者我要写书评

暂无评论

Tests and tolerances for high-performance software-implemented fault detection

引用

IEEE TRANSACTIONS ON COMPUTERS 2003年第5期52卷 579-591页

作者： Turmon, M Granat, R Katz, DS Lou, JZ Jet Prop Lab Data Understanding Syst Grp Pasadena CA 91109 USA Jet Prop Lab Parallel Applicat Technol Grp Pasadena CA 91109 USA

We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common matrix and Fourier algorithms which return results satisfying a necessary condition having a linear form;the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finite-precision floating-point calculations. We concentrate on comprehensively defining and evaluating tests having various accuracy/computational burden tradeoffs, and we emphasize average-case algorithm behavior rather than using worst-case upper bounds on error.

关键词： algorithm-based fault tolerance result checking error analysis aerospace parallel numerical algorithms

来源：评论

学校读者我要写书评

暂无评论

A backward/forward recovery approach for the preconditioned conjugate gradient method

引用

JOURNAL OF COMPUTATIONAL SCIENCE 2016年 17卷 522-534页

作者： Fasi, Massimiliano Langou, Julien Robert, Yves Ucar, Bora Univ Manchester Manchester M13 9PL Lancs England Univ Colorado Denver Denver CO USA ENS Lyon Lyon France Univ Tennessee Knoxville TN USA Univ Lyon INRIA CNRS LIPUMR5668ENS LyonUCBL Lyon France

Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen (2013, pp. 167-176) has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every d iterations, and to checkpoint every c x d iterations. When a silent error is detected by the verification mechanism, one can rollback to and re-execute from the last checkpoint. In this paper, we also propose to combine checkpointing and verification, but we use algorithm-based fault tolerance (ABFT) rather than stability tests. ABFT can be used for error detection, but also for error detection and correction, allowing a forward recovery (and no rollback nor re-execution) when a single error is detected. We introduce an abstract performance model to compute the performance of all schemes, and we instantiate it using the preconditioned conjugate gradient algorithm. Finally, we validate our new approach through a set of simulations. (C) 2016 Elsevier B.V. All rights reserved.

关键词： fault-tolerance Silent errors algorithm-based fault tolerance Checkpointing Sparse matrix-vector multiplication Preconditioned conjugate gradient method

来源：评论

学校读者我要写书评

暂无评论

FT-PBLAS: PBLAS-based fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

引用

IEEE ACCESS 2020年 8卷 42674-42688页

作者： Zhu, Yanchao Liu, Yi Zhang, Guozhen Beihang Univ Sch Comp Sci & Engn Sino German Joint Software Inst Beijing 100191 Peoples R China Beihang Univ Beijing Key Lab Network Technol Beijing 100191 Peoples R China

As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popular software fault-tolerant techniques, both the checkpoint-restart approach and the replication technique face challenges of scalability in the era of peta- and exa-scale systems due to their numerous processes. In this situation, algorithm-based approaches, or algorithm-based fault tolerance (ABFT) mechanisms, have become attractive because they are efficient and lightweight. Although the ABFT technique is algorithm-dependent, it is possible to implement it at a low level (e.g., in libraries for basic numerical algorithms) and make it application-independent. However, previous ABFT approaches have mainly aimed at achieving fault tolerance in integrated circuits (ICs) or at the architecture level and are therefore not suitable for HPC systems;e.g., they use checksums of rows and columns of matrices rather than checksums of blocks to detect errors. Furthermore, they cannot deal with errors caused by node failure, which are common in current HPC systems. To solve these problems, this paper proposes FT-PBLAS, a PBLAS-based library for fault-tolerant parallel linear algebra computations that can be regarded as a fault-tolerant version of the parallel basic linear algebra subprograms (PBLAS), because it provides a series of fault-tolerant versions of interfaces in PBLAS. To support the underlying error detection and recovery mechanisms in the library, we propose a block-checksum approach for non-fatal errors and a scheme for addressing node failure, respectively. We evaluate two fault-tolerant mechanisms and FT-PBLAS on HPC systems, and the experimental results demonstrate the performance of our library.

关键词： fault tolerance fault tolerant systems Libraries Linear algebra Computational modeling Software Integrated circuits algorithm-based fault tolerance HPC systems node failure matrix multiplication linear algebra computations

来源：评论

学校读者我要写书评

暂无评论

Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory Footprint

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2024年第7期35卷 1307-1319页

作者： Loreti, Daniela Artioli, Marcello Ciampolini, Anna Univ Bologna Dept Comp Sci & Engn I-40126 Bologna Italy Italian Natl Agcy New Technol Energy & Sustainable I-40129 Bologna Italy

The scale of nowadays High Performance Computing (HPC) systems is the key element that determines the achievement of impressive performance, as well as the reason for their relatively limited reliability. Over the last decade, specific areas of the High Performance Computing (HPC) research field have addressed the issue at different levels, by enriching the infrastructure, the platforms, or the algorithms with fault tolerance features. In this work, we focus on the rather-pervasive task of computing the solution of a dense, unstructured linear system and we propose an algorithm-based technique to obtain fault tolerance to multiple anywhere-located faults during the parallel computation. We particularly study the ways to boost the performance of the rollback-free recovery, and we provide an extensive evaluation of our technique w.r.t. to other state-of-the-art algorithm-based methods.

关键词： fault tolerant systems fault tolerance Linear systems Circuit faults Vectors Program processors Task analysis Rollback-free recovery algorithm-based fault tolerance high performance computing linear systems solver

来源：评论

学校读者我要写书评

暂无评论

Adaptive control in roll-forward recovery for extreme scale multigrid

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2019年第5期33卷 817-837页

作者： Huber, Markus Ruede, Ulrich Wohlmuth, Barbara Tech Univ Munich Bolzmannstr 3 D-85748 Munich Germany Friedrich Alexander Univ Nurnberg Erlangen Erlangen Germany CERFACS Parallel Algorithms Project Toulouse France

With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is reconstructed by an asynchronous online recovery. The computations in both the faulty and the healthy subdomains must be coordinated in a sensitive way, in particular, both under- and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal recoupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchically weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The recoupling process is steered by local contributions of the error estimator before the fault. Failure scenarios when solving up to 6.9 x 10(11) unknowns on more than 245,766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method.

关键词： algorithm-based fault tolerance high-performance computing multigrid methods error estimator adaptive recovery

来源：评论

学校读者我要写书评

暂无评论

EVALUATION AND COMPARISON OF fault-TOLERANT SOFTWARE TECHNIQUES

引用

IEEE TRANSACTIONS ON RELIABILITY 1993年第2期42卷 190-204页

作者： HUDAK, J SUH, BH SIEWIOREK, D SEGALL, Z CARNEGIE MELLON UNIV DEPT ELECT & COMP ENGNPITTSBURGHPA 15213

Various fault-tolerant software techniques have been proposed in order to meet the reliability requirements of critical systems. This paper evaluates 4 implementations of fault-tolerant software techniques with respect to hardware and design faults. Project participants were divided into 4 groups, each of which developed fault-tolerant software based on a common specification. Each group applied one of the following techniques: n-version programming, recovery block, concurrent error-detection, and algorithm-based fault tolerance. Independent testing and modeling groups within the project then thoroughly analyzed the fault-tolerant software. Using fault-injection tools, the testing group subjected the fault-tolerant software to simulated design and hardware faults. Simulated design-faults included control flow, array boundary, computational, and post/pre increment/decrement software mutations. Simulated hardware-faults included code and data corruption. Data collected from the fault-injection experiment were then mapped into a discrete-time Markov model developed by the modeling group. based on this model, the effectiveness of each implementation of the fault-tolerant software technique with respect to availability, correctness, and time to failure given an error, is contrasted with measured data. Finally, the model is analyzed with respect to additional figures of merit identified during the modeling process, and the techniques are ranked using an application taxonomy.

关键词： fault-TOLERANT SOFTWARE N-VERSION PROGRAMMING RECOVERY BLOCK CONCURRENT ERROR-DETECTION algorithm-based fault tolerance MARKOV MODEL fault-INJECTION TESTING

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：