Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation...
详细信息
Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results cannot be trusted any more. A well known technique to correct soft errors in matrix-matrix multiplication is algorithm-basedfaulttolerance (ABFT). While ABET achieves much better efficiency than triple modular redundancy (TMR) - a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABET technique from off-line to on-line so that soft errors in matrix-matrix multiplication can be detected in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e. less than 1%) performance penalty over the ATLAS dgemm (). (C) 2013 Elsevier B.V. All rights reserved.
暂无评论