Recently, the data-parallel pipeline approach has been widely used in training DNN models on commodity GPU servers. However, there are still three challenges for hybrid parallelism on commodity GPU servers: i) a balan...
详细信息
Recently, the data-parallel pipeline approach has been widely used in training DNN models on commodity GPU servers. However, there are still three challenges for hybrid parallelism on commodity GPU servers: i) a balanced model partition is crucial for efficiency, whereas prior works lack a sound solution to generate a balanced partition automatically;ii) an orchestrated device mapping is essential to reduce communication contention, however, prior works ignore server heterogeneity, exacerbating communication contention;iii) the startup overhead is inevitable and especially significant for deep pipelines, which is an essential source of pipeline bubbles and severely affects pipeline scalability. We propose AutoPipe-H to solve these three problems, which contains i) a pipeline partitioner component for automatically and quickly generating a balanced sub-block partition scheme;ii) a device mapping component that assigns pipeline stages to devices, considering server heterogeneity, to reduce communication contention;and iii) a distributedtraining runtime component that reduces pipeline startup overhead by splitting the micro-batch evenly. The experimental results show that AutoPipe-H can accelerate training by up to 1.26x over the hybrid parallelism framework DAPPLE and Piper, with a 2.73x-12.7x improvement in the partition balance and an order-of-magnitude time reduction in partition scheme searching.
Computational storage devices enable in-storage processing of data in place. These devices contain 64-bit application processors and hardware accelerators that can help improving performance and saving power by reduci...
详细信息
ISBN:
(数字)9781728110851
ISBN:
(纸本)9781728110851
Computational storage devices enable in-storage processing of data in place. These devices contain 64-bit application processors and hardware accelerators that can help improving performance and saving power by reducing or eliminating data movement between host computers and storage units. This paper proposes a framework, named Stannis, for distributed in-storage training of deepneuralnetworks on clusters of computational storage devices. This in-storage processing style of training ensures that private data never leaves the storage while fully controlling the public sharing of data. The Stannis framework distributes the workload based on the processing power of each worker by determining the proper batch size for each node. Stannis also ensures the availability of input data for all nodes to avoid rank stall while maximizing the utilization and overall processing speed. Experimental results show up to 2.7x speedup and 69% reduction in energy consumption with no significant loss in accuracy.
暂无评论