In distributed computing systems, to mitigate the adverse effect of stragglers on the computation time, computation redundancy is used. The redundancy can be added proactively at the beginning, or reactively after som...
详细信息
ISBN:
(纸本)9781538674628
In distributed computing systems, to mitigate the adverse effect of stragglers on the computation time, computation redundancy is used. The redundancy can be added proactively at the beginning, or reactively after some time based on the delay pattern of the workers. While most of the existing work with reactive mitigation strategy only considered task replication, we propose a coded reactive straggler mitigation with an uncoded and a coded phase for distributed matrix-matrixmultiplication. Specifically, in the uncoded phase of the proposed reactive strategy, the master distributes the computational job without redundancy among workers and waits for some time. After the waiting time, the master cancels the remaining tasks. It then encodes the remaining tasks and distributes them among the workers that have already completed their computations. The expected execution time of the proposed method is analytically obtained. Furthermore, the optimal waiting time for the uncoded phase and the optimal code rate for the coded phase are investigated. Our simulation results demonstrate that the proposed coded reactive mitigation strategy significantly decreases the execution time in comparison with the proactive mitigation strategy or repetition-based reactive mitigation strategy.
暂无评论