In distributed deep learning (DL), collective communication algorithms, such as Allreduce, used to share training results between graphical processing units (gpus) are an inevitable bottleneck. We hypothesize that a c...
详细信息
In distributed deep learning (DL), collective communication algorithms, such as Allreduce, used to share training results between graphical processing units (gpus) are an inevitable bottleneck. We hypothesize that a cache access latency occurred at every Allreduce is a significant bottleneck in the current computational systems with high-bandwidth interconnects for distributed DL. To reduce this frequency of latency, it is important to aggregate data at the network interfaces. We implement a data aggregation circuit in a field-programmable gate array (fpga). Using this fpga, we proposed novel Allreduce architecture and training strategy without accuracy degradation. Results of the measurement show Allreduce latency reduction to 1/4. Our system can also conceal about 90% of the communication overhead and improve scalability by 20%. The end-to-end time consumed for training in distributed DL with ResNet-50 and ImageNet is reduced to 87.3% without any degradation in validation accuracy.
暂无评论