咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >DeFT: Mitigating Data Dependen... 收藏
arXiv

DeFT: Mitigating Data Dependencies for Flexible Communication Scheduling in Distributed Training

作     者:Meng, Lin Sun, Yuzhong 

作者机构:Institute of Computing Technology Chinese Academy of Sciences China University of Chinese Academy of Sciences China 

出 版 物:《arXiv》 (arXiv)

年 卷 期:2025年

核心收录:

主  题:Scheduling algorithms 

摘      要:Communication scheduling aims to reduce communication bottlenecks in data parallel training (DP) by maximizing the overlap between computation and communication. However, existing schemes fall short due to three main issues: (1) hard data dependencies break some overlapping between communication and computation;(2) high coverage rates impair further improvement on performance;(3) imbalanced communication/computation times of tensors caused by partitioning/fusion strategies cause more bubbles. To address these drawbacks, we propose a new communication scheduling scheme DeFT, whose key insight is to mitigate data dependencies and support flexible scheduling in distributed training. DeFT uncovers new overlapping chances in training by transforming the scheduling problem into multiple knapsack problems. Specifically, DeFT eliminates hard dependencies with delayed updates, reducing the coverage rate by adjusting update frequency and utilizing heterogeneous communication links, merging the computation times of backward or forward as the knapsack capacity to avoid the negative impact of unbalanced tensors. Additionally, DeFT preserves training accuracy by adjusting its scheduling strategy via convergence loss quantification. Extensive experiments with 16 A100 GPUs showed that DeFT achieved speedups of 29% to 115% on three representative benchmarks compared to US-Byte and Bytescheduler with no loss of accuracy. Copyright © 2025, The Authors. All rights reserved.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分