咨询与建议

限定检索结果

文献类型

  • 51 篇 期刊文献
  • 28 篇 会议

馆藏范围

  • 79 篇 电子文献
  • 0 种 纸本馆藏

日期分布

学科分类号

  • 78 篇 工学
    • 71 篇 计算机科学与技术...
    • 57 篇 电气工程
    • 6 篇 软件工程
    • 3 篇 电子科学与技术(可...
    • 3 篇 信息与通信工程
    • 2 篇 网络空间安全
    • 1 篇 控制科学与工程
  • 6 篇 理学
    • 5 篇 数学
    • 1 篇 物理学
  • 2 篇 管理学
    • 2 篇 管理科学与工程(可...

主题

  • 79 篇 algorithm-based ...
  • 14 篇 concurrent error...
  • 8 篇 fault tolerance
  • 8 篇 matrix multiplic...
  • 7 篇 error detection
  • 5 篇 fault tolerant s...
  • 4 篇 error correction
  • 4 篇 sparse grid comb...
  • 4 篇 checkpointing
  • 4 篇 checksum encodin...
  • 3 篇 fault diagnosis
  • 3 篇 weighted sum par...
  • 3 篇 simd
  • 3 篇 silent errors
  • 3 篇 silent data corr...
  • 3 篇 avx-512
  • 3 篇 high-performance...
  • 3 篇 parallel computi...
  • 3 篇 high performance...
  • 3 篇 pde solvers

机构

  • 6 篇 univ calif river...
  • 6 篇 princeton univ d...
  • 6 篇 univ calif davis...
  • 2 篇 princeton univ d...
  • 2 篇 univ calif river...
  • 2 篇 chinese acad sci...
  • 2 篇 australian natl ...
  • 2 篇 oak ridge natl l...
  • 1 篇 italian natl agc...
  • 1 篇 penn state univ ...
  • 1 篇 univ calif davis...
  • 1 篇 univ quebec dept...
  • 1 篇 national microel...
  • 1 篇 sungkyunkwan uni...
  • 1 篇 georgia inst tec...
  • 1 篇 oak ridge natl l...
  • 1 篇 univ lyon inria ...
  • 1 篇 politecn milan d...
  • 1 篇 carnegie mellon ...
  • 1 篇 sandia natl labs...

作者

  • 9 篇 chen zizhong
  • 8 篇 jha nk
  • 8 篇 redinbo gr
  • 4 篇 wu panruo
  • 4 篇 zhai yujia
  • 4 篇 chen jieyang
  • 4 篇 banerjee p
  • 4 篇 zhao kai
  • 3 篇 nguyen c
  • 3 篇 ouyang kaiming
  • 3 篇 liang xin
  • 3 篇 strazdins peter ...
  • 3 篇 harding brendan
  • 3 篇 li sihuan
  • 3 篇 vinnakota b
  • 3 篇 abraham ja
  • 2 篇 grover pulkit
  • 2 篇 liu jinyang
  • 2 篇 mayo jackson r.
  • 2 篇 tao dingwen

语言

  • 78 篇 英文
  • 1 篇 其他
检索条件"主题词=Algorithm-Based Fault Tolerance"
79 条 记 录,以下是1-10 订阅
FT-CNN: algorithm-based fault tolerance for Convolutional Neural Networks
收藏 引用
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2021年 第7期32卷 1677-1689页
作者: Zhao, Kai Di, Sheng Li, Sihuan Liang, Xin Zhai, Yujia Chen, Jieyang Ouyang, Kaiming Cappello, Franck Chen, Zizhong Univ Calif Riverside Dept Comp Sci & Engn Riverside CA 92521 USA Argonne Natl Lab Math & Comp Sci Div Lemont IL 60439 USA Oak Ridge Natl Lab Comp Sci & Math Div Oak Ridge TN 37831 USA
Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which m... 详细信息
来源: 评论
Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance
收藏 引用
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2015年 第4期29卷 422-436页
作者: Yao, Erlin Zhang, Jiutian Chen, Mingyu Tan, Guangming Sun, Ninghui Chinese Acad Sci Inst Comp Technol State Key Lab Comp Architecture Beijing 100190 Peoples R China
Soft errors in scientific computing applications are becoming inevitable with the ever-increasing system scale and execution time, and new technologies that feature increased transistor density and lower voltage. Soft... 详细信息
来源: 评论
A-ABFT: Autonomous algorithm-based fault tolerance for Matrix Multiplications on Graphics Processing Units  44
A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matri...
收藏 引用
44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
作者: Braun, Claus Halder, Sebastian Wunderlich, Hans-Joachim Univ Stuttgart Inst Comp Architecture & Comp Engn D-70569 Stuttgart Germany
Graphics processing units (GPUs) enable large-scale scientific applications and simulations on the desktop. To allow scientific computing on GPUs with high performance and reliability requirements, the application of ... 详细信息
来源: 评论
Parallel Reduction to Hessenberg Form with algorithm-based fault tolerance  13
Parallel Reduction to Hessenberg Form with Algorithm-Based F...
收藏 引用
International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
作者: Jia, Yulu Bosilca, George Dongarra, Jack J. Univ Tennessee Knoxville TN 37996 USA
This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctnes... 详细信息
来源: 评论
Rethinking algorithm-based fault tolerance with a Cooperative Software-Hardware Approach  13
Rethinking Algorithm-Based Fault Tolerance with a Cooperativ...
收藏 引用
International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
作者: Li, Dong Chen, Zizhong Wu, Panruo Vetter, Jeffrey S. Oak Ridge Natl Lab Oak Ridge TN 37831 USA Univ Calif Riverside Riverside CA 92521 USA Georgia Inst Technol Atlanta GA 30332 USA
algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any... 详细信息
来源: 评论
Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory Footprint
收藏 引用
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2024年 第7期35卷 1307-1319页
作者: Loreti, Daniela Artioli, Marcello Ciampolini, Anna Univ Bologna Dept Comp Sci & Engn I-40126 Bologna Italy Italian Natl Agcy New Technol Energy & Sustainable I-40129 Bologna Italy
The scale of nowadays High Performance Computing (HPC) systems is the key element that determines the achievement of impressive performance, as well as the reason for their relatively limited reliability. Over the las... 详细信息
来源: 评论
FT-BLAS: A fault Tolerant High Performance BLAS Implementation on x86 CPUs
收藏 引用
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2023年 第12期34卷 3207-3223页
作者: Zhai, Yujia Giem, Elisabeth Zhao, Kai Liu, Jinyang Huang, Jiajun Wong, Bryan M. Shelton, Christian R. Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA Univ Alabama Birmingham Birmingham AL 35294 USA
Basic Linear Algebra Subprograms (BLAS) serve as a foundational library for scientific computing and machine learning. In this article, we present a new BLAS implementation, FT-BLAS, that provides performance comparab... 详细信息
来源: 评论
ATTNChecker: Highly-Optimized fault Tolerant Attention for Large Language Model Training  25
ATTNChecker: Highly-Optimized Fault Tolerant Attention for L...
收藏 引用
30th Symposium on Principles and Practice of Parallel Programming
作者: Liang, Yuhang Li, Xinyi Ren, Jie Li, Ang Fang, Bo Chen, Jieyang Univ Oregon Eugene OR 97403 USA Pacific Northwest Natl Lab Richland WA 99352 USA Coll William & Mary Williamsburg VA USA
Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, the training of these models is computationally intensive and susceptible to faults, particu... 详细信息
来源: 评论
Low-Cost Online Convolution Checksum Checker
收藏 引用
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 2022年 第2期30卷 201-212页
作者: Filippas, Dionysios Margomenos, Nikolaos Mitianoudis, Nikolaos Nicopoulos, Chrysostomos Dimitrakopoulos, Giorgos Democritus Univ Thrace Dept Elect & Comp Engn Xanthi 67100 Greece Univ Cyprus Dept Elect & Comp Engn CY-1678 Nicosia Cyprus
Managing random hardware faults requires the faults to be detected online, thus simplifying recovery. algorithm-based fault tolerance has been proposed as a low-cost mechanism to check online the result of computation... 详细信息
来源: 评论
FT-GEMM: A fault Tolerant High Performance GEMM Implementation on x86 CPUs  23
FT-GEMM: A Fault Tolerant High Performance GEMM Implementati...
收藏 引用
32nd International Symposium on High-Performance Parallel and Distributed Computing (HPDC) part of the ACM Federated Computing Research Conference (FCRC)
作者: Wu, Shixun Zhai, Yujia Huang, Jiajun Jian, Zizhe Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA
General matrix/matrix multiplication (GEMM) is crucial for scientific computing and machine learning. However, the increased scale of the computing platforms raises concerns about hardware and software reliability. In... 详细信息
来源: 评论