Resilience to failure is a key concern for next-generation high-performance computing systems. The dominant fault tolerance mechanism, coordinated checkpoint/restart, is projected to no longer be a viable option on th...
详细信息
ISBN:
(纸本)9781479956180
Resilience to failure is a key concern for next-generation high-performance computing systems. The dominant fault tolerance mechanism, coordinated checkpoint/restart, is projected to no longer be a viable option on these systems due to its predicted overheads. rollbackavoidance has the potential to prolong the viability of coordinated checkpoint/restart by allowing an application to make meaningful forward progress, perhaps with degraded performance, despite the occurrence or imminence of a failure. In this paper, we present two general analytic models for the performance of rollbackavoidancetechniques and validate these models against the performance of existing rollbackavoidancetechniques. We then use these models to evaluate the applicability of rollbackavoidance for next-generation exascale systems. This includes analysis of exascale system design questions such as: (1) how effective must an application-specific rollback avoidance technique be to usefully augment checkpointing in an exascale system? (2) when is rollbackavoidance on its own a viable alternative to coordinated checkpointing? and (3) how do rollbackavoidancetechniques and system characteristics interact to influence application performance?
暂无评论