检索结果-内蒙古大学图书馆

FT-CNN: algorithm-based fault tolerance for Convolutional Neural Networks

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2021年第7期32卷 1677-1689页

作者： Zhao, Kai Di, Sheng Li, Sihuan Liang, Xin Zhai, Yujia Chen, Jieyang Ouyang, Kaiming Cappello, Franck Chen, Zizhong Univ Calif Riverside Dept Comp Sci & Engn Riverside CA 92521 USA Argonne Natl Lab Math & Comp Sci Div Lemont IL 60439 USA Oak Ridge Natl Lab Comp Sci & Math Div Oak Ridge TN 37831 USA

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%similar to 8% in both error-free and error-injected situations).

关键词： Convolution Runtime Kernel fault tolerant systems fault tolerance Error correction codes Mathematical model algorithm-based fault tolerance deep learning silent data corruption reliability high-performance computing

来源：评论

学校读者我要写书评

暂无评论

Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2015年第4期29卷 422-436页

作者： Yao, Erlin Zhang, Jiutian Chen, Mingyu Tan, Guangming Sun, Ninghui Chinese Acad Sci Inst Comp Technol State Key Lab Comp Architecture Beijing 100190 Peoples R China

Soft errors in scientific computing applications are becoming inevitable with the ever-increasing system scale and execution time, and new technologies that feature increased transistor density and lower voltage. Soft errors can be mainly classified into two categories: bit-flipping error (e.g. 1 becomes -1) in random access memory;and computation error (e.g. 1+1=3) in floating point units. Traditionally, bit-flipping error is handled by the Error Correcting Code (ECC) technique, and computation error is dealt with the Triple Modular Redundancy (TMR) method. Note that, ECC cannot handle computation error, while TMR cannot deal with bit-flipping error and is not efficient on handling computation error. To uniformly and efficiently handle both computation and bit-flipping errors in matrix operations, the algorithm-based fault tolerance (ABFT) method is developed. This paper focuses on the detection of soft errors in the LU Decomposition with Partial Pivoting (LUPP) algorithm, which is widely used in scientific computing applications. First, this paper notes that existing ABFT methods are not adequate to detect soft errors in LUPP in terms of time or space. Then we propose a new ABFT algorithm which can detect soft errors in LUPP both flexible in time and comprehensive in space. Flexible in time means that soft errors can be detected flexibly during the execution instead of only at the end of LUPP, while comprehensive in space indicates that all of the elements in data matrices (L and U) will be covered for detecting soft errors. To show the feasibility and efficiency of the proposed algorithm, this paper has incorporated it into the implementation of LUPP in the widely used benchmark High Performance Linpack (HPL). Experiment results verify the feasibility of this algorithm: for soft errors injected at various timings and to different elements in LUPP, this algorithm has detected most of the injected errors, which have covered all of the errors that cannot pass the re

关键词： Soft error error detection LU Decomposition with Partial Pivoting algorithm-based fault tolerance

来源：评论

学校读者我要写书评

暂无评论

A-ABFT: Autonomous algorithm-based fault tolerance for Matrix Multiplications on Graphics Processing Units 44

A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matri...

引用

44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

作者： Braun, Claus Halder, Sebastian Wunderlich, Hans-Joachim Univ Stuttgart Inst Comp Architecture & Comp Engn D-70569 Stuttgart Germany

ISBN: (纸本)9781479922338

Graphics processing units (GPUs) enable large-scale scientific applications and simulations on the desktop. To allow scientific computing on GPUs with high performance and reliability requirements, the application of software-based fault tolerance is attractive. algorithm-based fault tolerance (ABFT) protects important scientific operations like matrix multiplications. However, the application to floating-point operations necessitates the runtime classification of errors into inevitable rounding errors, allowed compute errors in the magnitude of such rounding errors, and into critical errors that are larger than those and not tolerable. Hence, an ABFT scheme needs suitable rounding error bounds to detect errors reliably. The determination of such error bounds is a highly challenging task, especially since it has to be integrated tightly into the algorithm and executed autonomously with low performance overhead. In this work, A-ABFT for matrix multiplications on GPUs is introduced, which is a new, parallel ABFT scheme that determines rounding error bounds autonomously at runtime with low performance overhead and high error coverage.

关键词： algorithm-based fault tolerance Rounding Error Estimation GPU Matrix Multiplication

来源：评论

学校读者我要写书评

暂无评论

Parallel Reduction to Hessenberg Form with algorithm-based fault tolerance 13

Parallel Reduction to Hessenberg Form with Algorithm-Based F...

引用

International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

作者： Jia, Yulu Bosilca, George Dongarra, Jack J. Univ Tennessee Knoxville TN 37996 USA

ISBN: (纸本)9781450323789

This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an algorithm based fault tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLA-PACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.

关键词： algorithm-based fault tolerance Hessenberg reduction ScaLA-PACK Dense linear algebra Parallel numerical libraries

来源：评论

学校读者我要写书评

暂无评论

Rethinking algorithm-based fault tolerance with a Cooperative Software-Hardware Approach 13

Rethinking Algorithm-Based Fault Tolerance with a Cooperativ...

引用

International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

作者： Li, Dong Chen, Zizhong Wu, Panruo Vetter, Jeffrey S. Oak Ridge Natl Lab Oak Ridge TN 37831 USA Univ Calif Riverside Riverside CA 92521 USA Georgia Inst Technol Atlanta GA 30332 USA

ISBN: (纸本)9781450323789

algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware resilience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to redundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of improving performance and energy efficiency of ABFT-enabled applications. In particular, we study how to coordinate ABFT and error-correcting code (ECC) for main memory, and investigate the impact of this coordination on performance, energy, and resilience for ABFT-enabled applications. Scaling tests and analysis indicate that our approach saves up to 25% for system energy (and up to 40% for dynamic memory energy) with up to 18% performance improvement over traditional approaches of ABFT with ECC.

关键词： algorithm-based fault tolerance error-correcting code adaptive resilience

来源：评论

学校读者我要写书评

暂无评论

Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory Footprint

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2024年第7期35卷 1307-1319页

作者： Loreti, Daniela Artioli, Marcello Ciampolini, Anna Univ Bologna Dept Comp Sci & Engn I-40126 Bologna Italy Italian Natl Agcy New Technol Energy & Sustainable I-40129 Bologna Italy

The scale of nowadays High Performance Computing (HPC) systems is the key element that determines the achievement of impressive performance, as well as the reason for their relatively limited reliability. Over the last decade, specific areas of the High Performance Computing (HPC) research field have addressed the issue at different levels, by enriching the infrastructure, the platforms, or the algorithms with fault tolerance features. In this work, we focus on the rather-pervasive task of computing the solution of a dense, unstructured linear system and we propose an algorithm-based technique to obtain fault tolerance to multiple anywhere-located faults during the parallel computation. We particularly study the ways to boost the performance of the rollback-free recovery, and we provide an extensive evaluation of our technique w.r.t. to other state-of-the-art algorithm-based methods.

关键词： fault tolerant systems fault tolerance Linear systems Circuit faults Vectors Program processors Task analysis Rollback-free recovery algorithm-based fault tolerance high performance computing linear systems solver

来源：评论

学校读者我要写书评

暂无评论

FT-BLAS: A fault Tolerant High Performance BLAS Implementation on x86 CPUs

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2023年第12期34卷 3207-3223页

作者： Zhai, Yujia Giem, Elisabeth Zhao, Kai Liu, Jinyang Huang, Jiajun Wong, Bryan M. Shelton, Christian R. Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA Univ Alabama Birmingham Birmingham AL 35294 USA

Basic Linear Algebra Subprograms (BLAS) serve as a foundational library for scientific computing and machine learning. In this article, we present a new BLAS implementation, FT-BLAS, that provides performance comparable to or faster than state-of -the-art BLAS libraries, while being capable of tolerating soft errors on-the-fly. At the algorithmic level, we propose a hybrid strategy to incorporate fault-tolerant functionality. For memory-bound Level-1 and Level-2 BLAS routines, we duplicate computing instructions and re-use data at the register level to avoid memory overhead when validating the runtime correctness. Here we novelly propose to utilize mask registers on AVX512-enabled processors and SIMD registers on AVX2-enabled processors to store intermediate com-parison results. For compute-bound Level-3 BLAS routines, we fuse memory-intensive operations such as checksum encoding and verification into the GEMM assembly kernels to optimize the mem-ory footprint. We also design cache-friendly parallel algorithms for our fault-tolerant library. Through a series of architectural-aware optimizations, we manage to maintain the fault-tolerant overhead at a negligible order (<3%). Experimental results obtained on widely-used processors such as Intel Skylake, Intel Cascade Lake, and AMD Zen2 demonstrate that FT-BLAS offers high reliability and high performance - faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14%, and 21.70%, respectively, for both serial and parallel routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.

关键词： BLAS SIMD assembly optimization dual modular redundancy algorithm-based fault tolerance AVX-512 AVX2 OpenMP parallel algorithm

来源：评论

学校读者我要写书评

暂无评论

ATTNChecker: Highly-Optimized fault Tolerant Attention for Large Language Model Training 25

ATTNChecker: Highly-Optimized Fault Tolerant Attention for L...

引用

30th Symposium on Principles and Practice of Parallel Programming

作者： Liang, Yuhang Li, Xinyi Ren, Jie Li, Ang Fang, Bo Chen, Jieyang Univ Oregon Eugene OR 97403 USA Pacific Northwest Natl Lab Richland WA 99352 USA Coll William & Mary Williamsburg VA USA

ISBN: (纸本)9798400714436

Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, the training of these models is computationally intensive and susceptible to faults, particularly in the attention mechanism, which is a critical component of transformer-based LLMs. In this paper, we investigate the impact of faults on LLM training, focusing on INF, NaN, and near-INF values in the computation results with systematic fault injection experiments. We observe the propagation patterns of these errors, which can trigger non-trainable states in the model and disrupt training, forcing the procedure to load from checkpoints. To mitigate the impact of these faults, we propose ATTNChecker, the first algorithm-based fault tolerance (ABFT) technique tailored for the attention mechanism in LLMs. ATTNChecker is designed based on fault propagation patterns of LLM and incorporates performance optimization to adapt to both system reliability and model vulnerability while providing lightweight protection for fast LLM training. Evaluations on four LLMs show that ATTNChecker incurs on average 7% overhead on training while detecting and correcting all extreme errors. Compared with the state-of-the-art checkpoint/restore approach, ATTNChecker reduces recovery overhead by up to 49x.

关键词： algorithm-based fault tolerance Attention Mechanism Large Language Models Matrix Multiplication

来源：评论

学校读者我要写书评

暂无评论

Low-Cost Online Convolution Checksum Checker

引用

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 2022年第2期30卷 201-212页

作者： Filippas, Dionysios Margomenos, Nikolaos Mitianoudis, Nikolaos Nicopoulos, Chrysostomos Dimitrakopoulos, Giorgos Democritus Univ Thrace Dept Elect & Comp Engn Xanthi 67100 Greece Univ Cyprus Dept Elect & Comp Engn CY-1678 Nicosia Cyprus

Managing random hardware faults requires the faults to be detected online, thus simplifying recovery. algorithm-based fault tolerance has been proposed as a low-cost mechanism to check online the result of computations against random hardware failures. In this case, the checksum of the actual result is checked against a predicted checksum computed in parallel by a hardware checker. In this work, we target the design of such checkers for convolution engines that are currently the most critical building block in image processing and computer vision applications. The proposed convolution checksum checker, named ConvGuard, utilizes a newly introduced invariance condition of convolution to predict implicitly the output checksum using only the pixels at the border of the input image. In this way, ConvGuard reduces the power required for accumulating the input pixels without requiring large buffers to hold intermediate checksum results. The design of ConvGuard is generic and can be configured for different output sizes and strides. The experimental results show that ConvGuard utilizes only a small percentage of the area/power of an efficient convolution engine while being significantly smaller and more power efficient than a state-of-the-art checksum checker for various practical cases.

关键词： Convolution Hardware Engines Computer architecture fault tolerant systems fault tolerance Convolutional neural networks algorithm-based fault tolerance convolution error detection reliability

来源：评论

学校读者我要写书评

暂无评论

FT-GEMM: A fault Tolerant High Performance GEMM Implementation on x86 CPUs 23

FT-GEMM: A Fault Tolerant High Performance GEMM Implementati...

引用

32nd International Symposium on High-Performance Parallel and Distributed Computing (HPDC) part of the ACM Federated Computing Research Conference (FCRC)

作者： Wu, Shixun Zhai, Yujia Huang, Jiajun Jian, Zizhe Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA

ISBN: (纸本)9798400701559

General matrix/matrix multiplication (GEMM) is crucial for scientific computing and machine learning. However, the increased scale of the computing platforms raises concerns about hardware and software reliability. In this poster, we present FT-GEMM, a high-performance GEMM being capable of tolerating soft errors on-the-fly. We incorporate the fault tolerant functionality at algorithmic level by fusing the memory-intensive operations into the GEMM assembly kernels. We design a cache-friendly scheme for parallel FT-GEMM. Experimental results on Intel Cascade Lake demonstrate that FT-GEMM offers high reliability and performance - faster than Intel MKL, OpenBLAS, and BLIS by 3.50%similar to 22.14% for both serial and parallel GEMM, even under hundreds of errors injected per minute.

关键词： simd algorithm-based fault tolerance assembly optimization avx-512 dual modular redundancy gemm

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：