With the scaling up of high performance computers, resilience has become a big challenge. Among various kinds of software-basedfault-tolerant approaches, the algorithm-based fault tolerance (ABFT) has some attractive...
详细信息
ISBN:
(纸本)9781538666142
With the scaling up of high performance computers, resilience has become a big challenge. Among various kinds of software-basedfault-tolerant approaches, the algorithm-based fault tolerance (ABFT) has some attractive characteristics in the era of exa-scale systems, such as high efficiency and light-weight. In particular, considering that many engineering and scientific applications rely on some fundamental algorithms, it is possible to provide algorithm-basedfault-tolerant mechanisms in low level and make it application-independent. Previous fault-tolerant mechanisms for matrix computation use row and column checksums, which cannot be directly used in large-scale parallel systems. This paper proposes an algorithm-basedfault tolerant approach for matrix multiplication on large-scale parallel systems. The mechanism uses block-checksum which not only meets the requirement of matrix computations on large-scale parallel systems but also reduces the overhead of fault-tolerance compared to traditional schemes based on row and column checksums. In addition, this paper gives method for choosing the size of blocks to achieve balance between accuracy and efficiency. The complexity analysis and examples demonstrate effectiveness and feasibility of our approach.
Reliability of compute-intensive applications can be improved by introducing faulttolerance into the system. algorithm-based fault tolerance (ABFT) is a low-cost scheme which provides the required faulttolerance to ...
详细信息
Reliability of compute-intensive applications can be improved by introducing faulttolerance into the system. algorithm-based fault tolerance (ABFT) is a low-cost scheme which provides the required faulttolerance to the system through system level encoding. In this paper, we propose randomized construction techniques, under an extended model, for the design of ABFT systems with the required faulttolerance capability. The model considers failures in the processors performing the checking operations.
Digital filters when implemented with very dense high-speed electronic devices are susceptible to both temporary and permanent failures, not easily protected by conventional fault-tolerant computer design principles. ...
详细信息
Digital filters when implemented with very dense high-speed electronic devices are susceptible to both temporary and permanent failures, not easily protected by conventional fault-tolerant computer design principles. A new method is presented for protecting the overall realization against both hard and soft errors at the data sample level using the error-detecting properties of real convolutional codes. The normal filter system is surrounded with parallel parity channels defined by a real systematic convolutional code. Erroneous behavior is detected by comparing externally the calculated and regenerated parity samples. A rate k/n real convolutional code produces (n - k) parity samples for every k input samples causing the parity channels to operate at a rate decimated by k. Significant complexity reductions are possible by modifying the code structure, without loss of error protection, yielding simplified parity channels with finite impulse response (FIR) structures with computational rates decimated by k. The code modification procedure determines k nonzero scaling coefficients for weighting respective rows of the convolutional code's generator matrix. A matrix equation involving the code's original parity values and the denominator polynomial of the digital filter's transfer function is formed. Row manipulations separate this equation into two parts, a set of homogeneous equations constraining the modifying scaling coefficients and another set defining the implementation of the code parity values. An annihilator subspace in the dual space related to a condensed matrix defines the family of acceptable scaling coefficients when the code parameter k is less then the number of parity values, (n - k), times the number of poles in the filter. The code modification process has been automated in a computer algorithm. The effects of parity filter quantizations are analyzed and a bound on the mean-square error in the parity comparisons is given.
Parallel processing architectures are now in common use for signal processing and other computation-intensive applications. These applications are characterized by high throughput and long processing periods. Such cha...
详细信息
Parallel processing architectures are now in common use for signal processing and other computation-intensive applications. These applications are characterized by high throughput and long processing periods. Such characteristics decrease the reliability of high-performance architectures. The erroneous data produced by faulty processors could have damaging consequences, particularly in critical real-time applications. It is therefore desirable that any erroneous data produced by the system be detected and located as quickly as possible. algorithm-based fault tolerance (ABFT) is a low-cost system-level concurrent error detection and fault location scheme. We apply methods used in the analysis of multiprocessor systems employing system-level diagnosis to the analysis of ABFT systems. A new algorithm to analyze an ABFT system for its fault diagnosability is developed using these methods. based on this work, a fault diagnosis algorithm is developed for ABFT systems. No such algorithm has been presented previously.
We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event ups...
详细信息
We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common matrix and Fourier algorithms which return results satisfying a necessary condition having a linear form;the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finite-precision floating-point calculations. We concentrate on comprehensively defining and evaluating tests having various accuracy/computational burden tradeoffs, and we emphasize average-case algorithm behavior rather than using worst-case upper bounds on error.
Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen (2013, pp. 167-176) has shown how to combine such a verification mechanism (a stability test c...
详细信息
Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen (2013, pp. 167-176) has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every d iterations, and to checkpoint every c x d iterations. When a silent error is detected by the verification mechanism, one can rollback to and re-execute from the last checkpoint. In this paper, we also propose to combine checkpointing and verification, but we use algorithm-based fault tolerance (ABFT) rather than stability tests. ABFT can be used for error detection, but also for error detection and correction, allowing a forward recovery (and no rollback nor re-execution) when a single error is detected. We introduce an abstract performance model to compute the performance of all schemes, and we instantiate it using the preconditioned conjugate gradient algorithm. Finally, we validate our new approach through a set of simulations. (C) 2016 Elsevier B.V. All rights reserved.
As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popula...
详细信息
As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popular software fault-tolerant techniques, both the checkpoint-restart approach and the replication technique face challenges of scalability in the era of peta- and exa-scale systems due to their numerous processes. In this situation, algorithm-based approaches, or algorithm-based fault tolerance (ABFT) mechanisms, have become attractive because they are efficient and lightweight. Although the ABFT technique is algorithm-dependent, it is possible to implement it at a low level (e.g., in libraries for basic numerical algorithms) and make it application-independent. However, previous ABFT approaches have mainly aimed at achieving faulttolerance in integrated circuits (ICs) or at the architecture level and are therefore not suitable for HPC systems;e.g., they use checksums of rows and columns of matrices rather than checksums of blocks to detect errors. Furthermore, they cannot deal with errors caused by node failure, which are common in current HPC systems. To solve these problems, this paper proposes FT-PBLAS, a PBLAS-based library for fault-tolerant parallel linear algebra computations that can be regarded as a fault-tolerant version of the parallel basic linear algebra subprograms (PBLAS), because it provides a series of fault-tolerant versions of interfaces in PBLAS. To support the underlying error detection and recovery mechanisms in the library, we propose a block-checksum approach for non-fatal errors and a scheme for addressing node failure, respectively. We evaluate two fault-tolerant mechanisms and FT-PBLAS on HPC systems, and the experimental results demonstrate the performance of our library.
The scale of nowadays High Performance Computing (HPC) systems is the key element that determines the achievement of impressive performance, as well as the reason for their relatively limited reliability. Over the las...
详细信息
The scale of nowadays High Performance Computing (HPC) systems is the key element that determines the achievement of impressive performance, as well as the reason for their relatively limited reliability. Over the last decade, specific areas of the High Performance Computing (HPC) research field have addressed the issue at different levels, by enriching the infrastructure, the platforms, or the algorithms with faulttolerance features. In this work, we focus on the rather-pervasive task of computing the solution of a dense, unstructured linear system and we propose an algorithm-based technique to obtain faulttolerance to multiple anywhere-located faults during the parallel computation. We particularly study the ways to boost the performance of the rollback-free recovery, and we provide an extensive evaluation of our technique w.r.t. to other state-of-the-art algorithm-based methods.
With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently...
详细信息
With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is reconstructed by an asynchronous online recovery. The computations in both the faulty and the healthy subdomains must be coordinated in a sensitive way, in particular, both under- and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal recoupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchically weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The recoupling process is steered by local contributions of the error estimator before the fault. Failure scenarios when solving up to 6.9 x 10(11) unknowns on more than 245,766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method.
Various fault-tolerant software techniques have been proposed in order to meet the reliability requirements of critical systems. This paper evaluates 4 implementations of fault-tolerant software techniques with respec...
详细信息
Various fault-tolerant software techniques have been proposed in order to meet the reliability requirements of critical systems. This paper evaluates 4 implementations of fault-tolerant software techniques with respect to hardware and design faults. Project participants were divided into 4 groups, each of which developed fault-tolerant software based on a common specification. Each group applied one of the following techniques: n-version programming, recovery block, concurrent error-detection, and algorithm-based fault tolerance. Independent testing and modeling groups within the project then thoroughly analyzed the fault-tolerant software. Using fault-injection tools, the testing group subjected the fault-tolerant software to simulated design and hardware faults. Simulated design-faults included control flow, array boundary, computational, and post/pre increment/decrement software mutations. Simulated hardware-faults included code and data corruption. Data collected from the fault-injection experiment were then mapped into a discrete-time Markov model developed by the modeling group. based on this model, the effectiveness of each implementation of the fault-tolerant software technique with respect to availability, correctness, and time to failure given an error, is contrasted with measured data. Finally, the model is analyzed with respect to additional figures of merit identified during the modeling process, and the techniques are ranked using an application taxonomy.
暂无评论