检索结果-内蒙古大学图书馆

35th ACM International Conference on Supercomputing (ICS)

作者： Zhai, Yujia Giem, Elisabeth Fan, Quan Zhao, Kai Liu, Jinyang Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA

ISBN: (纸本)9781450383356

Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake. To accommodate the features of BLAS, which contains both memory-bound and computing-bound routines, we propose a hybrid strategy to incorporate fault tolerance into our brand-new BLAS implementation: duplicating computing instructions for memory-bound Level-1 and Level-2 BLAS routines and incorporating an algorithm-based fault tolerance mechanism for computing-bound Level-3 BLAS routines. Our high performance and low overhead are obtained from delicate assembly-level optimization and a kernel-fusion approach to the computing kernels. Experimental results demonstrate that FT-BLAS offers high reliability and high performance - faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14% and 21.70%, respectively, for routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.

关键词： BLAS SIMD Assembly Optimization Dual Modular Redundancy algorithm-based fault tolerance AVX-512

来源：评论

学校读者我要写书评

暂无评论

3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication 26th

3D Coded SUMMA: Communication-Efficient and Robust Parallel ...

引用

26th International Conference on Parallel and Distributed Computing (Euro-Par)

作者： Jeong, Haewon Yang, Yaoqing Gupta, Vipul Engelmann, Christian Low, Tze Meng Cadambe, Viveck Ramchandran, Kannan Grover, Pulkit Carnegie Mellon Univ Pittsburgh PA 15213 USA Univ Calif Berkeley Berkeley CA USA Oak Ridge Natl Lab Oak Ridge TN USA Penn State Univ State Coll PA USA

ISBN: (纸本)9783030576752;9783030576745

In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires similar to 50% less redundancy than replication, while the overhead in execution time is only about 5-10%.

关键词： Parallel matrix multiplication fault-tolerant algorithms algorithm-based fault tolerance Coded computing Communication-efficient algorithms Error detection and correction

来源：评论

学校读者我要写书评

暂无评论

Correcting Soft Errors Online in Fast Fourier Transform 17

Correcting Soft Errors Online in Fast Fourier Transform

引用

International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

作者： Liang, Xin Chen, Jieyang Tao, Dingwen Li, Sihuan Wu, Panruo Li, Hongbo Ouyang, Kaiming Liu, Yuanlai Song, Fengguang Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA Indiana Univ Purdue Univ Indianapolis IN 46202 USA

ISBN: (数字)9781450351140

ISBN: (纸本)9781450351140

While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft errors online before the computation finishes. This paper presents an online ABFT scheme for FFT so that soft errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its fault tolerance overhead and improve its numerical stability and fault coverage, and finally incorporate it into the widely used FFTW library- one of the today's fastest FFT software implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing offline ABFT schemes;(2) it detects errors in a much more timely manner;and (3) it also has higher numerical stability and better fault coverage.

关键词： algorithm-based fault tolerance Soft Errors DFT FFT FFTW

来源：评论

学校读者我要写书评

暂无评论

fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs

Fault Tolerant One-sided Matrix Decompositions on Heterogene...

引用

International Conference on High Performance Computing, Networking, Storage, and Analysis (SC)

作者： Chen, Jieyang Li, Hongbo Li, Sihuan Liang, Xin Wu, Panruo Tao, Dingwen Ouyang, Kaiming Liu, Yuanlai Zhao, Kai Guan, Qiang Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA Univ Houston Houston TX 77004 USA Univ Alabama Tuscaloosa AL 35487 USA Kent State Univ Kent OH 44242 USA

ISBN: (纸本)9781538683842

Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them only maintain checksum in one dimension;(2) their checking scheme is not efficient due to redundant checksum verifications;(3) they fail to protect PCIe communication;and (4) the checksum calculation based on a special type of matrix multiplication is far from efficient. By overcoming the above limitations, we design an efficient ABFT approach providing stronger protection for one-sided matrix decomposition methods on heterogeneous systems. First, we provide full matrix protection by using checksums in two dimensions. Second, our checking scheme is more efficient by prioritizing the checksum verification according to the sensitivity of matrix operations to soft errors. Third, we protect PCIe communication by reordering checksum verifications and decomposition steps. Fourth, we accelerate the checksum calculation by 1.7x via better utilizing GPUs.

关键词： algorithm-based fault tolerance Linear algebra Matrix decomposition GPU Heterogeneous system

来源：评论

学校读者我要写书评

暂无评论

fault TOLERANT COMPUTATION WITH THE SPARSE GRID COMBINATION TECHNIQUE

引用

SIAM JOURNAL ON SCIENTIFIC COMPUTING 2015年第3期37卷 C331-C353页

作者： Harding, Brendan Hegland, Markus Larson, Jay Southern, James Australian Natl Univ Inst Math Sci Acton ACT 2601 Australia Fujitsu Labs Europe Hayes UB4 8FE Middx England

This paper continues to develop a fault tolerant extension of the sparse grid combination technique recently proposed in [B. Harding and M. Hegland, ANZIAM J. Electron. Suppl., 54 (2013), pp. C394-C411]. This approach to fault tolerance is novel for two reasons: First, the combination technique adds an additional level of parallelism, and second, it provides algorithm-based fault tolerance so that solutions can still be recovered if failures occur during computation. Previous work indicates how the combination technique may be adapted for a low number of faults. In this paper we develop a generalization of the combination technique for which arbitrary collections of coarse approximations may be combined to obtain an accurate approximation. A general fault tolerant combination technique for large numbers of faults is a natural consequence of this work. Using a renewal model for the time between faults on each node of a high performance computer, we also provide bounds on the expected error for interpolation with this algorithm in the presence of faults. Numerical experiments solving the scalar advection PDE demonstrate that the algorithm is resilient to faults on a real application. It is observed that the time to solution is not significantly affected by the presence of (simulated) faults. Additionally the expected error increases with the number of faults but is relatively small even for high fault rates. A comparison with traditional checkpoint-restart methods applied to the combination technique shows that our approach is highly scalable with respect to the number of faults.

关键词： exascale computing algorithm-based fault tolerance sparse grid combination technique parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

A mesh check-sum ABFT scheme for stream ciphers

引用

INTERNATIONAL JOURNAL OF COMMUNICATION NETWORKS AND DISTRIBUTED SYSTEMS 2009年第4期3卷 285-300页

作者： Zhang, Chang N. Liu, Xiao Wei Univ Regina TRLabs Dept Comp Sci Regina SK S4S 0A2 Canada

To enhance the security and reliability of the widely-used stream ciphers, a novel mesh check-sum ABFT scheme for stream ciphers is developed. By utilising the ready-made arithmetic unit in stream ciphers, single and multiple errors can be detected and corrected in a cheap way. To meet different requirements in practical applications, 4-D mesh check-sum ABFT scheme is proposed which can be applied to RC4 or other stream ciphers. The 2-D mesh check-sum ABFT scheme is able to detect and correct single error with high efficiency. The 4-D mesh check-sum ABFT scheme is capable of correcting up to three errors located randomly in an N-element matrix with acceptable computation and bandwidth overhead. The workload can be remarkably reduced when most communications are error-free. Our scheme also provides one-to-one mapping between index and check-sum, so that error can be located and recovered by easier logic and simpler operation.

关键词： stream cipher algorithm-based fault tolerance error detection error correction RC4 parity check sum arithmetic unit matrix computation exclusive or

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：