咨询与建议

限定检索结果

文献类型

  • 51 篇 期刊文献
  • 28 篇 会议

馆藏范围

  • 79 篇 电子文献
  • 0 种 纸本馆藏

日期分布

学科分类号

  • 78 篇 工学
    • 71 篇 计算机科学与技术...
    • 57 篇 电气工程
    • 6 篇 软件工程
    • 3 篇 电子科学与技术(可...
    • 3 篇 信息与通信工程
    • 2 篇 网络空间安全
    • 1 篇 控制科学与工程
  • 6 篇 理学
    • 5 篇 数学
    • 1 篇 物理学
  • 2 篇 管理学
    • 2 篇 管理科学与工程(可...

主题

  • 79 篇 algorithm-based ...
  • 14 篇 concurrent error...
  • 8 篇 fault tolerance
  • 8 篇 matrix multiplic...
  • 7 篇 error detection
  • 5 篇 fault tolerant s...
  • 4 篇 error correction
  • 4 篇 sparse grid comb...
  • 4 篇 checkpointing
  • 4 篇 checksum encodin...
  • 3 篇 fault diagnosis
  • 3 篇 weighted sum par...
  • 3 篇 simd
  • 3 篇 silent errors
  • 3 篇 silent data corr...
  • 3 篇 avx-512
  • 3 篇 high-performance...
  • 3 篇 parallel computi...
  • 3 篇 high performance...
  • 3 篇 pde solvers

机构

  • 6 篇 univ calif river...
  • 6 篇 princeton univ d...
  • 6 篇 univ calif davis...
  • 2 篇 princeton univ d...
  • 2 篇 univ calif river...
  • 2 篇 chinese acad sci...
  • 2 篇 australian natl ...
  • 2 篇 oak ridge natl l...
  • 1 篇 italian natl agc...
  • 1 篇 penn state univ ...
  • 1 篇 univ calif davis...
  • 1 篇 univ quebec dept...
  • 1 篇 national microel...
  • 1 篇 sungkyunkwan uni...
  • 1 篇 georgia inst tec...
  • 1 篇 oak ridge natl l...
  • 1 篇 univ lyon inria ...
  • 1 篇 politecn milan d...
  • 1 篇 carnegie mellon ...
  • 1 篇 sandia natl labs...

作者

  • 9 篇 chen zizhong
  • 8 篇 jha nk
  • 8 篇 redinbo gr
  • 4 篇 wu panruo
  • 4 篇 zhai yujia
  • 4 篇 chen jieyang
  • 4 篇 banerjee p
  • 4 篇 zhao kai
  • 3 篇 nguyen c
  • 3 篇 ouyang kaiming
  • 3 篇 liang xin
  • 3 篇 strazdins peter ...
  • 3 篇 harding brendan
  • 3 篇 li sihuan
  • 3 篇 vinnakota b
  • 3 篇 abraham ja
  • 2 篇 grover pulkit
  • 2 篇 liu jinyang
  • 2 篇 mayo jackson r.
  • 2 篇 tao dingwen

语言

  • 78 篇 英文
  • 1 篇 其他
检索条件"主题词=algorithm-based fault tolerance"
79 条 记 录,以下是21-30 订阅
排序:
A-ABFT: Autonomous algorithm-based fault tolerance for Matrix Multiplications on Graphics Processing Units  44
A-ABFT: Autonomous Algorithm-Based Fault Tolerance for Matri...
收藏 引用
44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
作者: Braun, Claus Halder, Sebastian Wunderlich, Hans-Joachim Univ Stuttgart Inst Comp Architecture & Comp Engn D-70569 Stuttgart Germany
Graphics processing units (GPUs) enable large-scale scientific applications and simulations on the desktop. To allow scientific computing on GPUs with high performance and reliability requirements, the application of ... 详细信息
来源: 评论
ATTNChecker: Highly-Optimized fault Tolerant Attention for Large Language Model Training  25
ATTNChecker: Highly-Optimized Fault Tolerant Attention for L...
收藏 引用
30th Symposium on Principles and Practice of Parallel Programming
作者: Liang, Yuhang Li, Xinyi Ren, Jie Li, Ang Fang, Bo Chen, Jieyang Univ Oregon Eugene OR 97403 USA Pacific Northwest Natl Lab Richland WA 99352 USA Coll William & Mary Williamsburg VA USA
Large Language Models (LLMs) have demonstrated remarkable performance in various natural language processing tasks. However, the training of these models is computationally intensive and susceptible to faults, particu... 详细信息
来源: 评论
Block-checksum-based fault tolerance for Matrix Multiplication on Large-Scale Parallel Systems  20
Block-checksum-based Fault Tolerance for Matrix Multiplicati...
收藏 引用
20th IEEE International Conference on High Performance Computing and Communications (HPCC) / 16th IEEE International Conference on Smart City (SmartCity) / 4th IEEE International Conference on Data Science and Systems (DSS)
作者: Zhu, Yanchao Liu, Yi Li, Mingzhen Qian, Depei Beihang Univ Sino German Joint Software Inst Beijing Peoples R China
With the scaling up of high performance computers, resilience has become a big challenge. Among various kinds of software-based fault-tolerant approaches, the algorithm-based fault tolerance (ABFT) has some attractive... 详细信息
来源: 评论
Analysis and randomized design of algorithm-based fault tolerant multiprocessor systems under an extended model
收藏 引用
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1997年 第7期8卷 757-768页
作者: Yajnik, S Jha, NK PRINCETON UNIV DEPT ELECT ENGNPRINCETONNJ 08544
Reliability of compute-intensive applications can be improved by introducing fault tolerance into the system. algorithm-based fault tolerance (ABFT) is a low-cost scheme which provides the required fault tolerance to ... 详细信息
来源: 评论
MODIFYING REAL CONVOLUTIONAL-CODES FOR PROTECTING DIGITAL FILTERING SYSTEMS
收藏 引用
IEEE TRANSACTIONS ON INFORMATION THEORY 1993年 第2期39卷 553-564页
作者: REDINBO, GR ZAGAR, B GRAZ TECH UNIV INST ALLEGEMEINE ELEKTROTECH & ELEKT MESSTECHA-8010 GRAZAUSTRIA
Digital filters when implemented with very dense high-speed electronic devices are susceptible to both temporary and permanent failures, not easily protected by conventional fault-tolerant computer design principles. ... 详细信息
来源: 评论
Exploiting data representation for fault tolerance
收藏 引用
JOURNAL OF COMPUTATIONAL SCIENCE 2016年 14卷 51-60页
作者: Elliott, J. Hoemmen, M. Mueller, F. North Carolina State Univ Dept Comp Sci Raleigh NC 27695 USA Sandia Natl Labs Ctr Res Comp POB 5800 Albuquerque NM 87185 USA
Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. ... 详细信息
来源: 评论
DIAGNOSABILITY AND DIAGNOSIS OF algorithm-based fault-TOLERANT SYSTEMS
收藏 引用
IEEE TRANSACTIONS ON COMPUTERS 1993年 第8期42卷 924-937页
作者: VINNAKOTA, B JHA, NK PRINCETON UNIV DEPT ELECT ENGN PRINCETON NJ 08544 USA
Parallel processing architectures are now in common use for signal processing and other computation-intensive applications. These applications are characterized by high throughput and long processing periods. Such cha... 详细信息
来源: 评论
Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory Footprint
收藏 引用
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2024年 第7期35卷 1307-1319页
作者: Loreti, Daniela Artioli, Marcello Ciampolini, Anna Univ Bologna Dept Comp Sci & Engn I-40126 Bologna Italy Italian Natl Agcy New Technol Energy & Sustainable I-40129 Bologna Italy
The scale of nowadays High Performance Computing (HPC) systems is the key element that determines the achievement of impressive performance, as well as the reason for their relatively limited reliability. Over the las... 详细信息
来源: 评论
A backward/forward recovery approach for the preconditioned conjugate gradient method
收藏 引用
JOURNAL OF COMPUTATIONAL SCIENCE 2016年 第Part3期17卷 522-534页
作者: Fasi, Massimiliano Langou, Julien Robert, Yves Ucar, Bora Univ Manchester Manchester M13 9PL Lancs England Univ Colorado Denver Denver CO USA ENS Lyon Lyon France Univ Tennessee Knoxville TN USA Univ Lyon INRIA CNRS LIPUMR5668ENS LyonUCBL Lyon France
Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen (2013, pp. 167-176) has shown how to combine such a verification mechanism (a stability test c... 详细信息
来源: 评论
Tests and tolerances for high-performance software-implemented fault detection
收藏 引用
IEEE TRANSACTIONS ON COMPUTERS 2003年 第5期52卷 579-591页
作者: Turmon, M Granat, R Katz, DS Lou, JZ Jet Prop Lab Data Understanding Syst Grp Pasadena CA 91109 USA Jet Prop Lab Parallel Applicat Technol Grp Pasadena CA 91109 USA
We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event ups... 详细信息
来源: 评论