Currently used global application checkpoint-restart will not be a suitable solution for HPC applications running on large scale as, given the predicted fault rates, it will impose a high load on the I/O subsystem and...
详细信息
ISBN:
(纸本)9783319273082;9783319273075
Currently used global application checkpoint-restart will not be a suitable solution for HPC applications running on large scale as, given the predicted fault rates, it will impose a high load on the I/O subsystem and lead to inefficient resource usage. Combining application checkpointing with messagelogging is appealing as it allows restarting only the processes that actually failed. One major issue with messageloggingprotocols is the high amount of memory required to store logs. In this work we propose to use additional dedicated resources to save the part of the logs that would not fit in the memory of a compute node. We show that, combined with a cluster-based hierarchicallogging technique, only few dedicated nodes would be required to accommodate the memory requirement of messageloggingprotocols. We additionally show that the proposed technique achieves a reasonable performance overhead.
暂无评论