Checkpointing is a common approach to prevent loss of a program's state after permanent node failures. When it is performed on application-level, less data need to be saved. This paper suggests an uncoordinated ap...
详细信息
ISBN:
(纸本)9781538678794
Checkpointing is a common approach to prevent loss of a program's state after permanent node failures. When it is performed on application-level, less data need to be saved. This paper suggests an uncoordinated application-level check-pointing technique for task pools. It selectively and incrementally saves only those tasks that have stayed in the pool during some period of time and that have not been saved before. The checkpoints are held in a resilient in-memory data store. Our technique applies to any task pool variant in which workers operate at the top of local pools, and work stealing operates at the bottom. Furthermore, the tasks must be free of side effects, and the final result must be calculated by reduction from individual task results. We implemented the technique for the lifeline-based global load balancing variant of task pools. This variant couples random victim selection with an overlay graph for termination detection. A fault-tolerant realization already exists in the form of a Java library, called JFT_GLB. It uses the APGAS and Hazelcast libraries underneath. Our implementation modifies JFT_GLB by replacing its nonselective checkpointing scheme with our new one. In experiments, we compared the overhead of the new scheme to that of JFT_GLB, with UTS, BC and two synthetic benchmarks. The new scheme required slightly more running time when local pools were small, and paid off otherwise.
暂无评论