检索结果-内蒙古大学图书馆

23rd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

作者： Lv, Tiantian Wu, Lu Zhao, Zhigang Wang, Chunxiao Li, Chuantao Qilu Univ Technol Shandong Acad Sci Shandong Comp Sci Ctr Natl Supercomp Ctr Jinan Jinan Peoples R China

ISBN: (纸本)9798350301199

In recent years, with the continuous development of artificial intelligence, deep learning algorithms are becoming more and more complex, and the scale of model training is also growing. The artificial intelligence platform also involves largescale model training in our computing network operating system project. However, with the increasing size of data sets and models, the traditional single-card training makes the training speed very slow, and the training accuracy needs to converge, which has yet to meet people's computational needs. This has led to the development of GPipe, PipeDream, and other famous pipelines. In this paper, an efficient pipeline parallel training optimization method is proposed. In our approach, multiple computing nodes process small batches of data in parallel in a pipeline manner. We have mainly done the following two aspects of work: First, we designed a weight buffer strategy to limit the number of weight versions generated and ensure the model's accuracy. And we also developed a tensor compression mechanism to improve the transmission rate. Secondly, we propose a prefix sum partition algorithm to ensure that the pipeline can achieve balanced partitioning and save the memory of computing resources. Compared with several popular pipeline parallel frameworks, the proposed method can achieve about twice the training acceleration and save about 30% - 40% of the memory usage.

关键词： deep learning pipeline parallel weight buffer balancing partition

来源：评论

学校读者我要写书评

暂无评论

The design of multi-core DSP parallel model based on message passing and multi-level pipeline

The design of multi-core DSP parallel model based on message...

引用

Annual Conference of the Chinese-Society-for-Optical-Engineering (CSOE) on Applied Optics and Photonics China (AOPC) - Space Optics and Earth Imaging and Space Navigation

作者： Niu, Jingyu Hu, Jian He, Wenjing Meng, Fanrong Li, Chuanrong Univ Chinese Acad Sci Beijing 100049 Peoples R China Chinese Acad Sci Acad Optoelect Lab Quantitat Remote Sensing Informat Technol Beijing 100094 Peoples R China

ISBN: (数字)9781510614086

ISBN: (纸本)9781510614086;9781510614079

Currently, the design of embedded signal processing system is often based on a specific application, but this idea is not conducive to the rapid development of signal processing technology. In this paper, a parallel processing model architecture based on multi-core DSP platform is designed, and it is mainly suitable for the complex algorithms which are composed of different modules. This model combines the ideas of multi-level pipeline parallelism and message passing, and summarizes the advantages of the mainstream model of multi-core DSP (the Master-Slave model and the Data Flow model), so that it has better performance. This paper uses three-dimensional image generation algorithm to validate the efficiency of the proposed model by comparing with the effectiveness of the Master-Slave and the Data Flow model.

关键词： multi-core DSP parallel processing pipeline parallel

来源：评论

学校读者我要写书评

暂无评论

FastForward for Efficient pipeline parallelism A Cache-Optimized Concurrent Lock-Free Queue 08

FastForward for Efficient Pipeline Parallelism A Cache-Optim...

引用

ACM SIGPLAN Symposium on Principles and Practice of parallel Programming (PPoPP 08)

作者： Giacomoni, John Moseley, Tipp Vachharajani, Manish Univ Colorado Boulder CO 80309 USA

ISBN: (纸本)9781595939609

Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward's effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.

关键词： FastForward linearizability lock-free multicore multiprocessors nonblocking synchronization pipeline parallel queue

来源：评论

学校读者我要写书评

暂无评论

Accelerating GNN Training by Adapting Large Graphs to Distributed Heterogeneous Architectures

引用

IEEE TRANSACTIONS ON COMPUTERS 2023年第12期72卷 3473-3488页

作者： Zhang, Lizhi Lu, Kai Lai, Zhiquan Fu, Yongquan Tang, Yu Li, Dongsheng Natl Univ Def Technol Sch Comp Sci & Technol Changsha 410073 Hunan Peoples R China

Graph neural networks (GNNs) have been successfully applied to many important application domains on graph data. As graphs become increasingly large, existing GNN training frameworks typically use mini-batch sampling during feature aggregation to lower resource burdens, which unfortunately suffer from long memory accessing latency and inefficient data transfer of vertex features from CPU to GPU. This paper proposes 2PGraph, a system that addresses these limitations of mini-batch sampling and feature aggregation and supports fast and efficient single-GPU and distributed GNN training. First, 2PGraph presents a locality awareness GNN-training scheduling method that schedules the vertices based on the locality of the graph topology, significantly accelerating the sampling and aggregation, improving the data locality of vertex access, and limiting the range of neighborhood expansion. Second, 2PGraph proposes a GNN-layer-aware feature caching method on available GPU resources with a hit rate up to 100%, which avoids redundant data transfer between CPU and GPU. Third, 2PGraph presents a self-dependence cluster-based graph partition method, achieving high sampling and cache efficiency for distributed environments. Experimental results on real-world graph datasets show that 2PGraph reduces memory access latency by up to 90% mini-batch sampling, and data transfer time by up to 99%. For distributed GNN training over an 8-GPU cluster, 2PGraph achieves up to 8.7x performance speedup over state-of-the-art approaches.

关键词： Training Graphics processing units Graph neural networks Loading pipelines Distributed databases Social networking (online) pipeline parallel data parallel sampling dataloading cache

来源：评论

学校读者我要写书评

暂无评论

PCGraph: Accelerating GNN Inference on Large Graphs via Partition Caching 19

PCGraph: Accelerating GNN Inference on Large Graphs via Part...

引用

19th IEEE International Symposium on parallel and Distributed Processing with Applications (IEEE ISPA)

作者： Zhang, Lizhi Lai, Zhiquan Tang, Yu Li, Dongsheng Liu, Feng Luo, Xiaochun Natl Univ Def Technol Comp Coll Natl Lab Parallel & Distributed Proc PDL Changsha Hunan Peoples R China PLA News Media Ctr Beijing Peoples R China

ISBN: (纸本)9781665435741

Graph neural networks (GNNs) have been emerging as powerful learning tools for unstructured data and successfully applied to many graph-based application domains. Sampling-based GNN inference is commonly adopted in existing graph learning frameworks to handle large-scale graphs. However, this approach is restricted by the problems of redundant vertex embedding computation in GPU and inefficient loading of vertex features from CPU to GPU. In this paper, we propose PC-Graph, a system that supports adaptive GNN inference and feature partition caching. PCGraph significantly reduces the vertex embedding computation time by adaptive GNN inference technology, which selects the optimal inference algorithm and minimizes vertex embedding computation. PCGraph also reduces the redundant data transfer between the CPU and GPU by partition the target vertices and caching their corresponding partitions in turn. We evaluate PCGraph against two state-of-the-art industrial GNN frameworks, i.e., PyG and DGL, on a diverse array of benchmarks. Experimental results show that PCGraph reduces up to 99% vertex embedding computation and 98.5% data loading time, and achieves up to 360x performance speedup over the state-of-the-art baselines.

关键词： graph neural networks inference embedding computation feature caching pipeline parallel

来源：评论

学校读者我要写书评

暂无评论

2PGraph: Accelerating GNN Training over Large Graphs on GPU Clusters

2PGraph: Accelerating GNN Training over Large Graphs on GPU ...

引用

IEEE International Conference on Cluster Computing (Cluster)

作者： Zhang, Lizhi Lai, Zhiquan Li, Shengwei Tang, Yu Liu, Feng Li, Dongsheng Natl Univ Def Technol Natl Lab Parallel & Distributed Proc PDL Comp Coll Changsha Peoples R China

ISBN: (纸本)9781728196664

Graph neural networks (GNNs) have been emerging as powerful learning tools for unstructured data and successfully applied to many graph-based application domains. Sampling-based graph training is commonly used in existing GNN training frameworks to handle large-scale graphs. However, this type of approach is restricted by the problems of long memory accessing latency, neighborhood explosion during mini-batch sampling, and inefficient loading of vertex features from CPU to GPU. In this paper, we propose 2PGraph, a system that supports high-speed locality-aware mini-batch sampling and GNN layer-aware feature caching. 2PGraph significantly reduces sampling time by vertex-cluster sampling, which improves the locality of vertex access and limits the range of neighborhood expansion. To further reduce the sampling time in a distributed environment, we renumber the vertex numbers in subgraphs after graph partition, which improves the data locality of each partition. 2PGraph also avoids abundant data transfer between CPU and GPU through the feature data caching on available GPU resources with a hit rate of 100 %. Furthermore, 2PGraph develops a GNN layer-aware feature caching policy during data parallel training and achieves better cache efficiency and memory utilization. We evaluate 2PGraph against two state-of-the-art industrial GNN frameworks, i.e., PyG and DGL, on a diverse array of benchmarks. Experimental results show that 2PGraph reduces up to 90 % mini-batch sampling and 99 % data loading time, and achieves up to 8.7 x performance speedup over the state-of-the-art baselines on an 8-GPU cluster.

关键词： graph neural networks pipeline parallel data parallel sampling data caching

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：