With the increasing volumes of data samples and deep neural network (DNN) models, efficiently scaling the training of DNN models has become a significant challenge for server clusters with AI accelerators in terms of ...
详细信息
With the increasing volumes of data samples and deep neural network (DNN) models, efficiently scaling the training of DNN models has become a significant challenge for server clusters with AI accelerators in terms of memory and computing efficiency. Existing parallelism schemes can be broadly classified into three categories: data parallelism (splitting data samples), modelparallelism (splitting model parameters), and pipeline modelparallelism (splitting model layers). Hybrid approaches split data and models, offering a comprehensive solution for parallel training. However, these methods encounter limitations in efficiently scaling larger models across more computing nodes, as they incur substantial memory constraints that affect training efficiency and overall throughput. In this paper, we propose HIPPIE, a hybrid parallel training framework designed to enhance memory efficiency and scalability of large DNN training. First, to evaluate the optimization effect more reasonably, we propose an index of Memory Efficiency (ME) to quantify the tradeoff between throughput and memory overhead. Second, driven by the informed ME optimization objective, we automatically partition the pipeline to balance the throughput and memory. Third, we optimize the model training process via a novel hybrid parallel scheduler that improves the throughput and scalability by informed pipeline scheduling and communication scheduling with gradient-hidden optimization. Experiments on various models show that HIPPIE achieves above 90% scaling efficiency on a 16-GPU platform. Moreover, HIPPIE increases throughput by up to 80%, while saving 57% of memory overhead and achieving 4.18x memory-efficiency improvement.
Training large Deep Convolutional Neural Networks (DCNNs) with increasingly large datasets to improve model accuracy has become extremely time-consuming. Distributed training methods, such as data parallelism (DP) and...
详细信息
暂无评论