检索结果-内蒙古大学图书馆

Training large-scale language models with limited GPU memory:a survey

Frontiers of Information Technology & Electronic Engineering 2025年第3期26卷 309-331页

作者： Yu TANG Linbo QIAO Lujia YIN Peng LIANG Ao SHEN Zhilin YANG Lizhi ZHANG Dongsheng LI National Key Laboratory of Parallel and Distributed Computing College of ComputerNational University of Defense TechnologyChangsha 410073China

Large-scale models have gained significant attention in a wide range of fields,such as computer vision and natural language processing,due to their effectiveness across various ***,a notable hurdle in training these large-scale models is the limited memory capacity of graphics processing units(GPUs).In this paper,we present a comprehensive survey focused on training large-scale models with limited GPU *** exploration commences by scrutinizing the factors that contribute to the consumption of GPU memory during the training process,namely model parameters,model states,and model *** this analysis,we present an in-depth overview of the relevant research work that addresses these aspects ***,the paper concludes by presenting an outlook on the future of memory optimization in training large-scale language models,emphasizing the necessity for continued research and innovation in this *** survey serves as a valuable resource for researchers and practitioners keen on comprehending the challenges and advancements in training large-scale language models with limited GPU memory.

关键词： Training techniques Memory optimization Model parameters Model states Model activations

来源：评论

学校读者我要写书评

暂无评论

SIGNGD with Error Feedback Meets Lazily Aggregated Technique:Communication-Efficient Algorithms for distributed Learning

引用

Tsinghua Science and Technology 2022年第1期27卷 174-185页

作者： Xiaoge Deng Tao Sun Feng Liu Dongsheng Li National Laboratory for Parallel and Distributed Processing(PDL) College of ComputerNational University of Defense TechnologyChangsha 410073China

The proliferation of massive datasets has led to significant interests in distributed algorithms for solving large-scale machine learning ***,the communication overhead is a major bottleneck that hampers the scalability of distributed machine learning *** this paper,we design two communication-efficient algorithms for distributed learning *** first one is named EF-SIGNGD,in which we use the 1-bit(sign-based) gradient quantization method to save the communication ***,the error feedback technique,i.e.,incorporating the error made by the compression operator into the next step,is employed for the convergence *** second algorithm is called LE-SIGNGD,in which we introduce a well-designed lazy gradient aggregation rule to EF-SIGNGD that can detect the gradients with small changes and reuse the outdated ***-SIGNGD saves communication costs both in transmitted bits and communication ***,we show that LE-SIGNGD is convergent under some mild *** effectiveness of the two proposed algorithms is demonstrated through experiments on both real and synthetic data.

关键词： distributed learning communication-efficient algorithm convergence analysis

来源：评论

学校读者我要写书评

暂无评论

FMCC-RT: a scalable and fine-grained all-reduce algorithm for large-scale SMP clusters

引用

Science China(Information Sciences) 2025年第5期68卷 362-379页

作者： Jintao PENG Jie LIU Jianbin FANG Min XIE Yi DAI Zhiquan LAI Bo YANG Chunye GONG Xinjun MAO Guo MAO Jie REN School of Computer Science and Technology National University of Defense Technology Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology Laboratory of Digitizing Software for Frontier Equipment National University of Defense Technology National Supercomputer Center in Tianjin School of Computer Science Shaanxi Normal University

All-reduce is a widely used communication technique for distributed and parallel applications typically implemented using either a tree-based or ring-based scheme. Each of these approaches has its own limitations: tree-based schemes struggle with efficiently exchanging large messages, while ring-based solutions assume constant communication throughput,an unrealistic expectation in modern network communication infrastructures. We present FMCC-RT, an all-reduce approach that combines the advantages of tree-and ring-based implementations while mitigating their drawbacks. FMCC-RT dynamically switches between tree and ring-based implementations depending on the size of the message being processed. It utilizes an analytical model to assess the impact of message sizes on the achieved throughput, enabling the derivation of optimal work partitioning parameters. Furthermore, FMCC-RT is designed with an Open MPI-compatible API, requiring no modification to user code. We evaluated FMCC-RT through micro-benchmarks and real-world application tests. Experimental results show that FMCC-RT outperforms state-of-the-art tree-and ring-based methods, achieving speedups of up to 5.6×.

关键词： all-reduce collective communication MPI scalability

来源：评论

学校读者我要写书评

暂无评论

CD-Sched: An Automated Scheduling Framework for Accelerating Neural Network Training on Shared Memory CPU-DSP Platforms

CD-Sched: An Automated Scheduling Framework for Accelerating...

引用

2023 International Conference on Power, Communication, Computing and Networking Technologies, PCCNT 2023

作者： Xiao, Yuanyuan Lai, Zhiquan Li, Dongsheng National Key Laboratory of Parallel and Distributed Processing Computer College National University of Defense Technology Changsha China

ISBN: (纸本)9781450399951

DSP holds significant potential for important applications in Deep Neural Networks. However, there is currently a lack of research focused on shared-memory CPU-DSP heterogeneous chips. This paper proposes CD-Sched, an automated scheduling framework that aims to address this gap. By predicting the latency of operators on both CPU and DSP, CD-Sched automatically schedules the computation of operators to the appropriate computing device. This scheduling optimization accelerates the computation of individual operators and ultimately improves the overall training time of neural networks. In end-to-end training tasks, CD-Sched can significantly reduce the overall training time, with an average reduction of approximately 10.77%. © 2023 ACM.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

Smoothing Point Adjustment-Based Evaluation of Time Series Anomaly Detection 48

Smoothing Point Adjustment-Based Evaluation of Time Series A...

引用

48th IEEE International Conference on Acoustics, Speech and Signal processing, ICASSP 2023

作者： Liu, Mingyu Wang, Yijie Xu, Hongzuo Zhou, Xiaohui Li, Bin Wang, Yongjun National University of Defense Technology Science and Technology on Parallel and Distributed Processing Laboratory College of Computer Changsha China

ISBN: (纸本)9781728163277

Anomalies in time series appear consecutively, forming anomaly segments. Applying the classical point-based evaluation metrics to evaluate the detection performance of segments leads to considerable underestimation, so most related studies resort to point adjustment. This operation treats all points as true positives within a segment equally when only one individual point alarms, resulting in significant overestimation and creating an illusion of superior performance. This paper proposes smoothing point adjustment, a novel range-based evaluation protocol for time series anomaly detection. Our protocol reflects detection performance impartially by carefully considering the specific location and frequency of alarms in the raw results. It is achieved by smoothly determining the adjustment range and rewarding early detection via a ranging function and a rewarding function. Compared with other evaluation metrics, experiments on different datasets show that our protocol can yield a performance ranking of various methods more consistent with the desired situation. © 2023 IEEE.

关键词： Anomaly Detection Evaluation Protocol Point Adjustment Time Series

来源：评论

学校读者我要写书评

暂无评论

AFMA-Track: Adaptive Fusion of Motion and Appearance for Robust Multi-object Tracking 27th

AFMA-Track: Adaptive Fusion of Motion and Appearance for ...

引用

27th International Conference on Pattern Recognition, ICPR 2024

作者： Liao, Wei Luo, Lei Zhang, Chunyuan College of Computer Science and Technology National University of Defence Technology Changsha China Science and Technology on Parallel and Distributed Processing Laboratory College of Computer Science and Technology National University of Defense Technology Changsha China

ISBN: (纸本)9783031784439

Motion and appearance cues play a crucial role in Multi-object Tracking (MOT) algorithms for associating objects across consecutive frames. While most MOT methods prioritize accurate motion modeling and distinctive appearance representations, the use of appearance and motion cues is often confined to simplistic association techniques. For instance, fixed weights are commonly employed to combine the intersection-over-union (IoU) matrix and appearance similarity matrix, yielding an association cost matrix. To harness the full potential of motion and appearance cues across diverse scenarios, we propose an innovative approach that dynamically balances motion and appearance cues based on scene and object information during the association process. Furthermore, we introduce a new mechanism for updating appearance representations, effectively mitigating noise introduced by occlusion. Our method demonstrates state-of-the-art performance on the MOT17 and MOT20 test sets. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

关键词： Object detection

来源：评论

学校读者我要写书评

暂无评论

Funnel: An Efficient Sparse Attention Accelerator with Multi-Dataflow Fusion 22

Funnel: An Efficient Sparse Attention Accelerator with Multi...

引用

22nd IEEE International Symposium on parallel and distributed processing with Applications, ISPA 2024

作者： Ma, Shenghong Xu, Jinwei Jiang, Jingfei Wang, Yaohua Li, Dongsheng National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing College of Computer Changsha China

ISBN: (纸本)9798331509712

The self-attention mechanism is the core component of Transformer, which provides a powerful ability to understand the sequence context. However, the self-attention mechanism also suffers from a large amount of redundant computation. Model sparsification can effectively reduce computational load, but the irregularity of non-zeros introduced by sparsification significantly decreases hardware efficiency. This paper proposes Funnel, an accelerator that dynamically predicts sparse attention patterns and efficiently processes unstructured sparse data. Firstly, we adopt a fast quantization method based on lookup table to minimize the cost of sparse patterns prediction. Secondly, we propose Funnel Computing Unit (FCU), a hardware architecture that efficiently handles sparse attention through multi-dataflow fusion. Sampled Dense-Dense Matrix Multiplication (SDDMM) and Sparse-Dense Matrix Multiplication (SpMM) are core components of sparse attention mechanism. FCU unifies the computation ways of matrix inner product and row-wise product to support SDDMM and SpMM at the same time, which greatly reduces the storage and movement overhead of intermediate results. Lastly, we devise a lightweight buffer and data tiling strategy tailored to the proposed accelerator, aimed at enhancing data reuse. Experiments demonstrate that our accelerator achieves 0.10-0.25 sparsity with small accuracy loss. When computing the self-attention layer, it attains hardware efficiency ranging from 60% to 85%. Compared to CPU and GPU, it achieves 5.60x and 8.20x speedup. Compared to the state-of-the-art attention accelerators A3, SpAtten, FTRANS, and Sanger, it achieves 7.37x, 4.52x, 9.58x, and 3.08x speedup. © 2024 IEEE.

关键词： FPGA Funnel computing unit Sparse attention Transformer

来源：评论

学校读者我要写书评

暂无评论

Mbapp: Efficient Memory-Balanced Pipeline parallelism for Large Model Fine-Tuning on Commodity GPU Servers 24

Mbapp: Efficient Memory-Balanced Pipeline Parallelism for La...

引用

5th International Conference on computer Information and Big Data Applications, CIBDA 2024

作者： Liu, Yujie Lai, Zhiquan Li, Dongsheng National Key Laboratory of Parallel and Distributed Computing College of Computer National University of Defense Technology Changsha410000 China

ISBN: (纸本)9798400718106

Large-scale models have demonstrated outstanding performance across various downstream tasks. Pipeline parallelism is essential for fine-tuning large models on commodity GPU servers, as it plays a crucial role in making their exceptional performance more accessible and widespread. Existing approaches have encountered challenges in attaining effective memory-balanced pipeline parallelism. This paper presents a novel memory load-balanced pipeline parallel solution that aims to distribute memory usage evenly across stages on commodity GPU servers by leveraging NVLink bridges. The solution presents a novel approach for transferring data from GPUs to CPUs through the PCI-e link between adjacent GPUs interconnected by the NVLink bridge. Moreover, our methodology orchestrates data transfer operations to reduce offloading latency during the fine-tuning of large models. The evaluation demonstrates that our approach enhances switching efficiency by 1.46x to 1.76x and improves throughput by 2.04x to 3.59x compared to the PyTorch offloading technique. © 2024 ACM.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

Communication Analysis for Multidimensional parallel Training of Large-scale DNN Models 25

Communication Analysis for Multidimensional Parallel Trainin...

引用

25th IEEE International Conferences on High Performance Computing and Communications, 9th International Conference on Data Science and Systems, 21st IEEE International Conference on Smart City and 9th IEEE International Conference on Dependability in Sensor, Cloud and Big Data Systems and Applications, HPCC/DSS/SmartCity/DependSys 2023

作者： Lai, Zhiquan Hao, Yanqi Li, Shengwei Li, Dongsheng College of Computer National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing Changsha China

ISBN: (纸本)9798350330014

Multidimensional parallel training has been widely applied to train large-scale deep learning models like GPT-3. The efficiency of parameter communication among training devices/processes is often the performance bottleneck of large model training. Analysis of parameter communication mode and traffic has important reference significance for the research of interconnection network design and computing task scheduling to improve the training performance. In this paper, we analyze the parametric communication modes in typical 3D parallel training (data parallelism, pipeline parallelism, and tensor parallelism), and model the traffic in different communication modes. Finally, taking GPT-3 as an example, we present the communication in its 3D parallel training. © 2023 IEEE.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

Efficient Large Models Fine-tuning on Commodity Servers via Memory-balanced Pipeline parallelism 25

Efficient Large Models Fine-tuning on Commodity Servers via ...

引用

作者： Liu, Yujie Lai, Zhiquan Liu, Weijie Wang, Wei Li, Dongsheng College of Computer National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing Changsha China

ISBN: (纸本)9798350330014

Large models have achieved impressive performance in many downstream tasks. Using pipeline parallelism to fine-tune large models on commodity GPU servers is an important way to make the excellent performance of large models available to the general public. Previous solutions fail to achieve an efficient memory-balanced pipeline parallelism. In this poster, we introduce a memory load-balanced pipeline parallel solution. This solution balances memory consumption across stages on commodity GPU servers via NVLink bridges. It establishes a new pathway to offload data from GPU to CPU by using the PCIe link of adjacent GPUs connected by the NVLink bridge. Furthermore, our method orchestrates offload operations to minimize the offload latency during large model fine-tuning. Experiments demonstrate that our solution can balance the memory footprint among pipeline stages without sacrificing training performance. © 2023 IEEE.

关键词： Program processors

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：