In recent years, with the continuous development of artificial intelligence, deep learning algorithms are becoming more and more complex, and the scale of model training is also growing. The artificial intelligence pl...
详细信息
ISBN:
(纸本)9798350301199
In recent years, with the continuous development of artificial intelligence, deep learning algorithms are becoming more and more complex, and the scale of model training is also growing. The artificial intelligence platform also involves largescale model training in our computing network operating system project. However, with the increasing size of data sets and models, the traditional single-card training makes the training speed very slow, and the training accuracy needs to converge, which has yet to meet people's computational needs. This has led to the development of GPipe, PipeDream, and other famous pipelines. In this paper, an efficient pipeline parallel training optimization method is proposed. In our approach, multiple computing nodes process small batches of data in parallel in a pipeline manner. We have mainly done the following two aspects of work: First, we designed a weight buffer strategy to limit the number of weight versions generated and ensure the model's accuracy. And we also developed a tensor compression mechanism to improve the transmission rate. Secondly, we propose a prefix sum partition algorithm to ensure that the pipeline can achieve balanced partitioning and save the memory of computing resources. Compared with several popular pipeline parallel frameworks, the proposed method can achieve about twice the training acceleration and save about 30% - 40% of the memory usage.
Currently, the design of embedded signal processing system is often based on a specific application, but this idea is not conducive to the rapid development of signal processing technology. In this paper, a parallel p...
详细信息
ISBN:
(数字)9781510614086
ISBN:
(纸本)9781510614086;9781510614079
Currently, the design of embedded signal processing system is often based on a specific application, but this idea is not conducive to the rapid development of signal processing technology. In this paper, a parallel processing model architecture based on multi-core DSP platform is designed, and it is mainly suitable for the complex algorithms which are composed of different modules. This model combines the ideas of multi-level pipeline parallelism and message passing, and summarizes the advantages of the mainstream model of multi-core DSP (the Master-Slave model and the Data Flow model), so that it has better performance. This paper uses three-dimensional image generation algorithm to validate the efficiency of the proposed model by comparing with the effectiveness of the Master-Slave and the Data Flow model.
Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue ...
详细信息
ISBN:
(纸本)9781595939609
Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward's effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.
Graph neural networks (GNNs) have been successfully applied to many important application domains on graph data. As graphs become increasingly large, existing GNN training frameworks typically use mini-batch sampling ...
详细信息
Graph neural networks (GNNs) have been successfully applied to many important application domains on graph data. As graphs become increasingly large, existing GNN training frameworks typically use mini-batch sampling during feature aggregation to lower resource burdens, which unfortunately suffer from long memory accessing latency and inefficient data transfer of vertex features from CPU to GPU. This paper proposes 2PGraph, a system that addresses these limitations of mini-batch sampling and feature aggregation and supports fast and efficient single-GPU and distributed GNN training. First, 2PGraph presents a locality awareness GNN-training scheduling method that schedules the vertices based on the locality of the graph topology, significantly accelerating the sampling and aggregation, improving the data locality of vertex access, and limiting the range of neighborhood expansion. Second, 2PGraph proposes a GNN-layer-aware feature caching method on available GPU resources with a hit rate up to 100%, which avoids redundant data transfer between CPU and GPU. Third, 2PGraph presents a self-dependence cluster-based graph partition method, achieving high sampling and cache efficiency for distributed environments. Experimental results on real-world graph datasets show that 2PGraph reduces memory access latency by up to 90% mini-batch sampling, and data transfer time by up to 99%. For distributed GNN training over an 8-GPU cluster, 2PGraph achieves up to 8.7x performance speedup over state-of-the-art approaches.
Graph neural networks (GNNs) have been emerging as powerful learning tools for unstructured data and successfully applied to many graph-based application domains. Sampling-based GNN inference is commonly adopted in ex...
详细信息
ISBN:
(纸本)9781665435741
Graph neural networks (GNNs) have been emerging as powerful learning tools for unstructured data and successfully applied to many graph-based application domains. Sampling-based GNN inference is commonly adopted in existing graph learning frameworks to handle large-scale graphs. However, this approach is restricted by the problems of redundant vertex embedding computation in GPU and inefficient loading of vertex features from CPU to GPU. In this paper, we propose PC-Graph, a system that supports adaptive GNN inference and feature partition caching. PCGraph significantly reduces the vertex embedding computation time by adaptive GNN inference technology, which selects the optimal inference algorithm and minimizes vertex embedding computation. PCGraph also reduces the redundant data transfer between the CPU and GPU by partition the target vertices and caching their corresponding partitions in turn. We evaluate PCGraph against two state-of-the-art industrial GNN frameworks, i.e., PyG and DGL, on a diverse array of benchmarks. Experimental results show that PCGraph reduces up to 99% vertex embedding computation and 98.5% data loading time, and achieves up to 360x performance speedup over the state-of-the-art baselines.
Graph neural networks (GNNs) have been emerging as powerful learning tools for unstructured data and successfully applied to many graph-based application domains. Sampling-based graph training is commonly used in exis...
详细信息
ISBN:
(纸本)9781728196664
Graph neural networks (GNNs) have been emerging as powerful learning tools for unstructured data and successfully applied to many graph-based application domains. Sampling-based graph training is commonly used in existing GNN training frameworks to handle large-scale graphs. However, this type of approach is restricted by the problems of long memory accessing latency, neighborhood explosion during mini-batch sampling, and inefficient loading of vertex features from CPU to GPU. In this paper, we propose 2PGraph, a system that supports high-speed locality-aware mini-batch sampling and GNN layer-aware feature caching. 2PGraph significantly reduces sampling time by vertex-cluster sampling, which improves the locality of vertex access and limits the range of neighborhood expansion. To further reduce the sampling time in a distributed environment, we renumber the vertex numbers in subgraphs after graph partition, which improves the data locality of each partition. 2PGraph also avoids abundant data transfer between CPU and GPU through the feature data caching on available GPU resources with a hit rate of 100 %. Furthermore, 2PGraph develops a GNN layer-aware feature caching policy during data parallel training and achieves better cache efficiency and memory utilization. We evaluate 2PGraph against two state-of-the-art industrial GNN frameworks, i.e., PyG and DGL, on a diverse array of benchmarks. Experimental results show that 2PGraph reduces up to 90 % mini-batch sampling and 99 % data loading time, and achieves up to 8.7 x performance speedup over the state-of-the-art baselines on an 8-GPU cluster.
暂无评论