检索结果-内蒙古大学图书馆

IEEE 40th International Conference on Computer Design (ICCD)

作者： Mummidi, Chandra Sekhar Bal, Sandeep Goldstein, Brunno F. Srinivasan, Sudarshan Kundu, Sandip Univ Massachusetts Amherst MA 01003 USA Univ Fed Rio de Janeiro UFRJ Rio De Janeiro Brazil Intel Labs Mumbai Maharashtra India

ISBN: (数字)9781665461863

ISBN: (纸本)9781665461863

General Matrix Multiplication (GEMM) is instrumental in myriads of scientific, high-performance computing, and machine learning applications such as computer vision, recommendation models, and weather forecasts. It is vital to make them fail-safe in safety-critical and high-precision applications. Companies like Meta and Google have recently reported sporadic silent errors in GEMM computations traced to hardware sources. Silent errors are hard to detect, requiring specialized solutions to detect them. Hardware redundancy approaches such as double or triple modular redundancy effectively detect or correct such errors, but they have a large area and power overhead. algorithm-based fault tolerance (ABFT) has been shown to offer an effective alternative at a far lower overhead. Modern CPUs feature advanced vector extensions (AVX) capable of executing SIMD instructions. This paper describes a new ABFT approach designed to take advantage of the AVX feature. Our core algorithm relies on the classical tile-based outer-product approach but enhances standard check-sum calculation using a tile vector. The implementation parameters are fine-tuned to fit the available number of AVX registers. Our results indicate that we can achieve 100% error detection in GEMM at an overhead of just 0.21% for the integer data type. Unfortunately, due to rounding errors, addition of floating-point numbers is not an associative operation, creating difficulties for ABFT. To mitigate the impact of rounding errors, we introduce the concept of relative error checking and perform error analysis for various error classes to show that the proposed approach totally eliminates false positive errors.

关键词： algorithm-based fault tolerance Tile-based checksum Matrix Multiplication Rounding Error Estimation

来源：评论

学校读者我要写书评

暂无评论

FT-BLAS: A High Performance BLAS Implementation With Online fault tolerance 21

FT-BLAS: A High Performance BLAS Implementation With Online ...

引用

35th ACM International Conference on Supercomputing (ICS)

作者： Zhai, Yujia Giem, Elisabeth Fan, Quan Zhao, Kai Liu, Jinyang Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA

ISBN: (纸本)9781450383356

Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake. To accommodate the features of BLAS, which contains both memory-bound and computing-bound routines, we propose a hybrid strategy to incorporate fault tolerance into our brand-new BLAS implementation: duplicating computing instructions for memory-bound Level-1 and Level-2 BLAS routines and incorporating an algorithm-based fault tolerance mechanism for computing-bound Level-3 BLAS routines. Our high performance and low overhead are obtained from delicate assembly-level optimization and a kernel-fusion approach to the computing kernels. Experimental results demonstrate that FT-BLAS offers high reliability and high performance - faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14% and 21.70%, respectively, for routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.

关键词： BLAS SIMD Assembly Optimization Dual Modular Redundancy algorithm-based fault tolerance AVX-512

来源：评论

学校读者我要写书评

暂无评论

FT-PBLAS: PBLAS-based fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

引用

IEEE ACCESS 2020年 8卷 42674-42688页

作者： Zhu, Yanchao Liu, Yi Zhang, Guozhen Beihang Univ Sch Comp Sci & Engn Sino German Joint Software Inst Beijing 100191 Peoples R China Beihang Univ Beijing Key Lab Network Technol Beijing 100191 Peoples R China

As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popular software fault-tolerant techniques, both the checkpoint-restart approach and the replication technique face challenges of scalability in the era of peta- and exa-scale systems due to their numerous processes. In this situation, algorithm-based approaches, or algorithm-based fault tolerance (ABFT) mechanisms, have become attractive because they are efficient and lightweight. Although the ABFT technique is algorithm-dependent, it is possible to implement it at a low level (e.g., in libraries for basic numerical algorithms) and make it application-independent. However, previous ABFT approaches have mainly aimed at achieving fault tolerance in integrated circuits (ICs) or at the architecture level and are therefore not suitable for HPC systems;e.g., they use checksums of rows and columns of matrices rather than checksums of blocks to detect errors. Furthermore, they cannot deal with errors caused by node failure, which are common in current HPC systems. To solve these problems, this paper proposes FT-PBLAS, a PBLAS-based library for fault-tolerant parallel linear algebra computations that can be regarded as a fault-tolerant version of the parallel basic linear algebra subprograms (PBLAS), because it provides a series of fault-tolerant versions of interfaces in PBLAS. To support the underlying error detection and recovery mechanisms in the library, we propose a block-checksum approach for non-fatal errors and a scheme for addressing node failure, respectively. We evaluate two fault-tolerant mechanisms and FT-PBLAS on HPC systems, and the experimental results demonstrate the performance of our library.

关键词： fault tolerance fault tolerant systems Libraries Linear algebra Computational modeling Software Integrated circuits algorithm-based fault tolerance HPC systems node failure matrix multiplication linear algebra computations

来源：评论

学校读者我要写书评

暂无评论

Physics-based Checksums for Silent-Error Detection in PDE Solvers 25th

Physics-Based Checksums for Silent-Error Detection in PDE So...

引用

25th International Conference on Parallel and Distributed Computing (Euro-Par)

作者： Salloum, Maher Mayo, Jackson R. Armstrong, Robert C. Sandia Natl Labs POB 969 Livermore CA 94551 USA

ISBN: (纸本)9783030483401;9783030483395

We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to "algorithm-based fault tolerance" checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.

关键词： Silent errors Partial differential equations Linear algebra algorithm-based fault tolerance Checkpoint/restart

来源：评论

学校读者我要写书评

暂无评论

3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication 26th

3D Coded SUMMA: Communication-Efficient and Robust Parallel ...

引用

26th International Conference on Parallel and Distributed Computing (Euro-Par)

作者： Jeong, Haewon Yang, Yaoqing Gupta, Vipul Engelmann, Christian Low, Tze Meng Cadambe, Viveck Ramchandran, Kannan Grover, Pulkit Carnegie Mellon Univ Pittsburgh PA 15213 USA Univ Calif Berkeley Berkeley CA USA Oak Ridge Natl Lab Oak Ridge TN USA Penn State Univ State Coll PA USA

ISBN: (纸本)9783030576752;9783030576745

In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires similar to 50% less redundancy than replication, while the overhead in execution time is only about 5-10%.

关键词： Parallel matrix multiplication fault-tolerant algorithms algorithm-based fault tolerance Coded computing Communication-efficient algorithms Error detection and correction

来源：评论

学校读者我要写书评

暂无评论

Block-checksum-based fault tolerance for Matrix Multiplication on Large-Scale Parallel Systems 20

Block-checksum-based Fault Tolerance for Matrix Multiplicati...

引用

20th IEEE International Conference on High Performance Computing and Communications (HPCC) / 16th IEEE International Conference on Smart City (SmartCity) / 4th IEEE International Conference on Data Science and Systems (DSS)

作者： Zhu, Yanchao Liu, Yi Li, Mingzhen Qian, Depei Beihang Univ Sino German Joint Software Inst Beijing Peoples R China

ISBN: (纸本)9781538666142

With the scaling up of high performance computers, resilience has become a big challenge. Among various kinds of software-based fault-tolerant approaches, the algorithm-based fault tolerance (ABFT) has some attractive characteristics in the era of exa-scale systems, such as high efficiency and light-weight. In particular, considering that many engineering and scientific applications rely on some fundamental algorithms, it is possible to provide algorithm-based fault-tolerant mechanisms in low level and make it application-independent. Previous fault-tolerant mechanisms for matrix computation use row and column checksums, which cannot be directly used in large-scale parallel systems. This paper proposes an algorithm-based fault tolerant approach for matrix multiplication on large-scale parallel systems. The mechanism uses block-checksum which not only meets the requirement of matrix computations on large-scale parallel systems but also reduces the overhead of fault-tolerance compared to traditional schemes based on row and column checksums. In addition, this paper gives method for choosing the size of blocks to achieve balance between accuracy and efficiency. The complexity analysis and examples demonstrate effectiveness and feasibility of our approach.

关键词： algorithm-based fault tolerance large-scale parallel system matrix multiplication matrix computation

来源：评论

学校读者我要写书评

暂无评论

Adaptive control in roll-forward recovery for extreme scale multigrid

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2019年第5期33卷 817-837页

作者： Huber, Markus Ruede, Ulrich Wohlmuth, Barbara Tech Univ Munich Bolzmannstr 3 D-85748 Munich Germany Friedrich Alexander Univ Nurnberg Erlangen Erlangen Germany CERFACS Parallel Algorithms Project Toulouse France

With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is reconstructed by an asynchronous online recovery. The computations in both the faulty and the healthy subdomains must be coordinated in a sensitive way, in particular, both under- and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal recoupling, we introduce a stopping criterion based on a mathematical error estimator. It involves hierarchically weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The recoupling process is steered by local contributions of the error estimator before the fault. Failure scenarios when solving up to 6.9 x 10(11) unknowns on more than 245,766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method.

关键词： algorithm-based fault tolerance high-performance computing multigrid methods error estimator adaptive recovery

来源：评论

学校读者我要写书评

暂无评论

"Short-Dot": Computing Large Linear Transforms Distributedly Using Coded Short Dot Products

引用

IEEE TRANSACTIONS ON INFORMATION THEORY 2019年第10期65卷 6171-6193页

作者： Dutta, Sanghamitra Cadambe, Viveck Grover, Pulkit Carnegie Mellon Univ Dept Elect & Comp Engn Pittsburgh PA 15213 USA Penn State Univ Dept Elect Engn University Pk PA 16802 USA

We consider the problem of computing a matrix-vector product Ax using a set of P parallel or distributed processing nodes prone to "straggling," i.e., unpredictable delays. Every processing node can access only a fraction (s/N) of the N-length vector x, and all processing nodes compute an equal number of dot products. We propose a novel error correcting code-that we call " Short-Dot"-that introduces redundant, shorter dot products such that only a subset of the nodes' outputs are sufficient to compute Ax. To address the problem of straggling in computing matrix-vector products, prior work uses replication or erasure coding to encode parts of the matrix A, but the length of the dot products computed at each processing node is still N. The key novelty in our work is that instead of computing the long dot products as required in the original matrix-vector product, we construct a larger number of redundant and short dot products that only require a fraction of x to be accessed during the computation. Short-Dot is thus useful in a communication-constrained scenario as it allows for only a fraction of x to be accessed by each processing node. Further, we show that in the particular regime where the number of available processing nodes is greater than the total number of dot products, Short-Dot has lower expected computation time under straggling under an exponential model compared to existing strategies, e.g. replication, in a scaling sense. We also derive fundamental limits on the trade-off between the length of the dot products and the recovery threshold, i.e., the required number of processing nodes, showing that Short-Dot is near-optimal.

关键词： algorithm-based fault tolerance coded computing matrix sparsification stragglers

来源：评论

学校读者我要写书评

暂无评论

Numerical Defect Correction as an algorithm-based fault tolerance Technique for Iterative Solvers

Numerical Defect Correction as an Algorithm-Based Fault Tole...

引用

17th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC)

作者： Oboril, Fabian Tahoori, Mehdi B. Heuveline, Vincent Lukarski, Dimitar Weiss, Jan-Philipp Karlsruhe Inst Technol KIT Chair Dependable Nano Comp CDNC Karlsruhe Germany Karlsruhe Inst Technol KIT Engn Math & Comp Lab EMCL Karlsruhe Germany Karlsruhe Inst Technol KIT Shared Res Grp New Frontiers High Performance Comp Exploit Multicore & Coprocessor Technol Karlsruhe Germany

ISBN: (纸本)9780769545905

As hardware devices like processor cores and memory sub-systems based on nano-scale technology nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applications with long computation/mission times, is becoming pronounced. In this paper, we present an algorithm-based fault tolerance (ABFT) scheme for an iterative linear solver engine based on the Conjugated Gradient method (CG) by taking the advantage of numerical defect correction. This method is "pay as you go", meaning that there is practically only a runtime overhead if errors occur and a correction is performed. Our experimental comparison with software-based Triple Modular Redundancy (TMR) clearly shows the runtime benefit of the proposed approach, good fault tolerance and no occurrence of silent data corruption.

关键词： algorithm-based fault tolerance defect correction conjugated gradient triple modular redundancy checkpointing

来源：评论

学校读者我要写书评

暂无评论

fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs

Fault Tolerant One-sided Matrix Decompositions on Heterogene...

引用

International Conference on High Performance Computing, Networking, Storage, and Analysis (SC)

作者： Chen, Jieyang Li, Hongbo Li, Sihuan Liang, Xin Wu, Panruo Tao, Dingwen Ouyang, Kaiming Liu, Yuanlai Zhao, Kai Guan, Qiang Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA Univ Houston Houston TX 77004 USA Univ Alabama Tuscaloosa AL 35487 USA Kent State Univ Kent OH 44242 USA

ISBN: (纸本)9781538683842

Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them only maintain checksum in one dimension;(2) their checking scheme is not efficient due to redundant checksum verifications;(3) they fail to protect PCIe communication;and (4) the checksum calculation based on a special type of matrix multiplication is far from efficient. By overcoming the above limitations, we design an efficient ABFT approach providing stronger protection for one-sided matrix decomposition methods on heterogeneous systems. First, we provide full matrix protection by using checksums in two dimensions. Second, our checking scheme is more efficient by prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors. Third, we protect PCIe communication by reordering checksum verifications and decomposition steps. Fourth, we accelerate the checksum calculation by 1.7x via better utilizing GPUs.

关键词： algorithm-based fault tolerance Linear algebra Matrix decomposition GPU Heterogeneous system

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：