咨询与建议

限定检索结果

文献类型

  • 51 篇 期刊文献
  • 28 篇 会议

馆藏范围

  • 79 篇 电子文献
  • 0 种 纸本馆藏

日期分布

学科分类号

  • 78 篇 工学
    • 71 篇 计算机科学与技术...
    • 57 篇 电气工程
    • 6 篇 软件工程
    • 3 篇 电子科学与技术(可...
    • 3 篇 信息与通信工程
    • 2 篇 网络空间安全
    • 1 篇 控制科学与工程
  • 6 篇 理学
    • 5 篇 数学
    • 1 篇 物理学
  • 2 篇 管理学
    • 2 篇 管理科学与工程(可...

主题

  • 79 篇 algorithm-based ...
  • 14 篇 concurrent error...
  • 8 篇 fault tolerance
  • 8 篇 matrix multiplic...
  • 7 篇 error detection
  • 5 篇 fault tolerant s...
  • 4 篇 error correction
  • 4 篇 sparse grid comb...
  • 4 篇 checkpointing
  • 4 篇 checksum encodin...
  • 3 篇 fault diagnosis
  • 3 篇 weighted sum par...
  • 3 篇 simd
  • 3 篇 silent errors
  • 3 篇 silent data corr...
  • 3 篇 avx-512
  • 3 篇 high-performance...
  • 3 篇 parallel computi...
  • 3 篇 high performance...
  • 3 篇 pde solvers

机构

  • 6 篇 univ calif river...
  • 6 篇 princeton univ d...
  • 6 篇 univ calif davis...
  • 2 篇 princeton univ d...
  • 2 篇 univ calif river...
  • 2 篇 chinese acad sci...
  • 2 篇 australian natl ...
  • 2 篇 oak ridge natl l...
  • 1 篇 italian natl agc...
  • 1 篇 penn state univ ...
  • 1 篇 univ calif davis...
  • 1 篇 univ quebec dept...
  • 1 篇 national microel...
  • 1 篇 sungkyunkwan uni...
  • 1 篇 georgia inst tec...
  • 1 篇 oak ridge natl l...
  • 1 篇 univ lyon inria ...
  • 1 篇 politecn milan d...
  • 1 篇 carnegie mellon ...
  • 1 篇 sandia natl labs...

作者

  • 9 篇 chen zizhong
  • 8 篇 jha nk
  • 8 篇 redinbo gr
  • 4 篇 wu panruo
  • 4 篇 zhai yujia
  • 4 篇 chen jieyang
  • 4 篇 banerjee p
  • 4 篇 zhao kai
  • 3 篇 nguyen c
  • 3 篇 ouyang kaiming
  • 3 篇 liang xin
  • 3 篇 strazdins peter ...
  • 3 篇 harding brendan
  • 3 篇 li sihuan
  • 3 篇 vinnakota b
  • 3 篇 abraham ja
  • 2 篇 grover pulkit
  • 2 篇 liu jinyang
  • 2 篇 mayo jackson r.
  • 2 篇 tao dingwen

语言

  • 78 篇 英文
  • 1 篇 其他
检索条件"主题词=algorithm-Based fault tolerance"
79 条 记 录,以下是11-20 订阅
排序:
A Highly-Efficient Error Detection Technique for General Matrix Multiplication using Tiled Processing on SIMD Architecture  40
A Highly-Efficient Error Detection Technique for General Mat...
收藏 引用
IEEE 40th International Conference on Computer Design (ICCD)
作者: Mummidi, Chandra Sekhar Bal, Sandeep Goldstein, Brunno F. Srinivasan, Sudarshan Kundu, Sandip Univ Massachusetts Amherst MA 01003 USA Univ Fed Rio de Janeiro UFRJ Rio De Janeiro Brazil Intel Labs Mumbai Maharashtra India
General Matrix Multiplication (GEMM) is instrumental in myriads of scientific, high-performance computing, and machine learning applications such as computer vision, recommendation models, and weather forecasts. It is... 详细信息
来源: 评论
FT-BLAS: A High Performance BLAS Implementation With Online fault tolerance  21
FT-BLAS: A High Performance BLAS Implementation With Online ...
收藏 引用
35th ACM International Conference on Supercomputing (ICS)
作者: Zhai, Yujia Giem, Elisabeth Fan, Quan Zhao, Kai Liu, Jinyang Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA
Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly... 详细信息
来源: 评论
FT-PBLAS: PBLAS-based fault-Tolerant Linear Algebra Computation on High-performance Computing Systems
收藏 引用
IEEE ACCESS 2020年 8卷 42674-42688页
作者: Zhu, Yanchao Liu, Yi Zhang, Guozhen Beihang Univ Sch Comp Sci & Engn Sino German Joint Software Inst Beijing 100191 Peoples R China Beihang Univ Beijing Key Lab Network Technol Beijing 100191 Peoples R China
As high-performance computing (HPC) systems have scaled up, resilience has become a great challenge. To guarantee resilience, various kinds of hardware and software techniques have been proposed. However, among popula... 详细信息
来源: 评论
Physics-based Checksums for Silent-Error Detection in PDE Solvers  25th
Physics-Based Checksums for Silent-Error Detection in PDE So...
收藏 引用
25th International Conference on Parallel and Distributed Computing (Euro-Par)
作者: Salloum, Maher Mayo, Jackson R. Armstrong, Robert C. Sandia Natl Labs POB 969 Livermore CA 94551 USA
We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. ... 详细信息
来源: 评论
3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication  26th
3D Coded SUMMA: Communication-Efficient and Robust Parallel ...
收藏 引用
26th International Conference on Parallel and Distributed Computing (Euro-Par)
作者: Jeong, Haewon Yang, Yaoqing Gupta, Vipul Engelmann, Christian Low, Tze Meng Cadambe, Viveck Ramchandran, Kannan Grover, Pulkit Carnegie Mellon Univ Pittsburgh PA 15213 USA Univ Calif Berkeley Berkeley CA USA Oak Ridge Natl Lab Oak Ridge TN USA Penn State Univ State Coll PA USA
In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. ... 详细信息
来源: 评论
Block-checksum-based fault tolerance for Matrix Multiplication on Large-Scale Parallel Systems  20
Block-checksum-based Fault Tolerance for Matrix Multiplicati...
收藏 引用
20th IEEE International Conference on High Performance Computing and Communications (HPCC) / 16th IEEE International Conference on Smart City (SmartCity) / 4th IEEE International Conference on Data Science and Systems (DSS)
作者: Zhu, Yanchao Liu, Yi Li, Mingzhen Qian, Depei Beihang Univ Sino German Joint Software Inst Beijing Peoples R China
With the scaling up of high performance computers, resilience has become a big challenge. Among various kinds of software-based fault-tolerant approaches, the algorithm-based fault tolerance (ABFT) has some attractive... 详细信息
来源: 评论
Adaptive control in roll-forward recovery for extreme scale multigrid
收藏 引用
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2019年 第5期33卷 817-837页
作者: Huber, Markus Ruede, Ulrich Wohlmuth, Barbara Tech Univ Munich Bolzmannstr 3 D-85748 Munich Germany Friedrich Alexander Univ Nurnberg Erlangen Erlangen Germany CERFACS Parallel Algorithms Project Toulouse France
With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently... 详细信息
来源: 评论
"Short-Dot": Computing Large Linear Transforms Distributedly Using Coded Short Dot Products
收藏 引用
IEEE TRANSACTIONS ON INFORMATION THEORY 2019年 第10期65卷 6171-6193页
作者: Dutta, Sanghamitra Cadambe, Viveck Grover, Pulkit Carnegie Mellon Univ Dept Elect & Comp Engn Pittsburgh PA 15213 USA Penn State Univ Dept Elect Engn University Pk PA 16802 USA
We consider the problem of computing a matrix-vector product Ax using a set of P parallel or distributed processing nodes prone to "straggling," i.e., unpredictable delays. Every processing node can access o... 详细信息
来源: 评论
Numerical Defect Correction as an algorithm-based fault tolerance Technique for Iterative Solvers
Numerical Defect Correction as an Algorithm-Based Fault Tole...
收藏 引用
17th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC)
作者: Oboril, Fabian Tahoori, Mehdi B. Heuveline, Vincent Lukarski, Dimitar Weiss, Jan-Philipp Karlsruhe Inst Technol KIT Chair Dependable Nano Comp CDNC Karlsruhe Germany Karlsruhe Inst Technol KIT Engn Math & Comp Lab EMCL Karlsruhe Germany Karlsruhe Inst Technol KIT Shared Res Grp New Frontiers High Performance Comp Exploit Multicore & Coprocessor Technol Karlsruhe Germany
As hardware devices like processor cores and memory sub-systems based on nano-scale technology nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applicatio... 详细信息
来源: 评论
fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs
Fault Tolerant One-sided Matrix Decompositions on Heterogene...
收藏 引用
International Conference on High Performance Computing, Networking, Storage, and Analysis (SC)
作者: Chen, Jieyang Li, Hongbo Li, Sihuan Liang, Xin Wu, Panruo Tao, Dingwen Ouyang, Kaiming Liu, Yuanlai Zhao, Kai Guan, Qiang Chen, Zizhong Univ Calif Riverside Riverside CA 92521 USA Univ Houston Houston TX 77004 USA Univ Alabama Tuscaloosa AL 35487 USA Kent State Univ Kent OH 44242 USA
Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix decomposition on heterogeneous systems with GPUs have following limitations: (1) they do not provide sufficient protection as most of them o... 详细信息
来源: 评论