In distributed computing systems, to mitigate the adverse effect of stragglers on the computation time, computation redundancy is used. The redundancy can be added proactively at the beginning, or reactively after som...
详细信息
In distributed computing systems, to mitigate the adverse effect of stragglers on the computation time, computation redundancy is used. The redundancy can be added proactively at the beginning, or reactively after some time based on the delay pattern of the workers. While most of the existing work with reactive mitigation strategy only considered task replication, we propose a coded reactive straggler mitigation with an uncoded and a coded phase for distributed matrix-matrix multiplications. Specifically, in the uncoded phase of the proposed strategy, the master distributes the computational job without redundancy among the workers. After a predetermined waiting time, the master cancels the remaining tasks. It then encodes the remaining tasks and distributes them among the workers. In the uncoded phase, in addition to the conventional erasure model, where workers can communicate only once, we consider multi-message communication (MMC) model to exploit the partial works done by workers. The optimum waiting time for the uncoded phase and the optimum code rate for the coded phase are also obtained. Our simulation results demonstrate that the proposed coded reactive mitigation significantly decreases the execution time in comparison with both the proactive mitigation strategy or the existing reactive mitigation strategy.
暂无评论