Despite recent breakthroughs in distributed Graph Neural Network (gnn) training, large-scale graphs still generate significant network communication overhead, decreasing time and resource efficiency. Although recently...
详细信息
Despite recent breakthroughs in distributed Graph Neural Network (gnn) training, large-scale graphs still generate significant network communication overhead, decreasing time and resource efficiency. Although recently proposed partitioning or caching methods try to reduce communication inefficiencies and overheads, they are not sufficiently effective due to their sampling pattern-agnostic nature. This paper proposes a Pipelined Partition Aware Caching and Communication Efficient Refinement System (Pacer), a communication-efficient distributed gnn training system. First, Pacer intelligently estimates each partition's access frequency to each vertex by jointly considering the sampling method and graph topology. Then, it uses the estimated access frequency to refine partitions and caching vertices in its two-level cache (CPU and GPU) to minimize data transfer latency. Furthermore, Pacer incorporates a pipeline-based minibatching method to mask the effect of the network communication. Experimental results on real-world graphs show that Pacer outperforms state-of-the-art distributed gnn training system in training time by 40% on average.
Graph neural networks (gnns) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of gnns, it is still challenging for gnns to ...
详细信息
Graph neural networks (gnns) are a type of deep learning models that are trained on graphs and have been successfully applied in various domains. Despite the effectiveness of gnns, it is still challenging for gnns to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale gnns, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed gnn training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed gnn training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review of the optimization techniques for the distributed execution of gnntraining. In this survey, we analyze three major challenges in distributed gnn training: massive feature communication, the loss of model accuracy, and workload imbalance. Then, we introduce a new taxonomy for the optimization techniques in distributed gnn training that address the above challenges. The new taxonomy classifies existing techniques into four categories: gnn data partition, gnn batch generation, gnn execution model, and gnn communication protocol. We carefully discuss the techniques in each category. In the conclusion, we summarize existing distributedgnn systems for multi-graphics processing units (GPUs), GPU-clusters and central processing unit (CPU)-clusters, respectively, and present a discussion about the future direction of distributed gnn training.
暂无评论