检索结果-内蒙古大学图书馆

Automatic parallelism strategy generation with minimalmemory redundancy

Frontiers of Information Technology & Electronic Engineering 2025年第1期26卷 109-118页

作者： Yanqi SHI Peng LIANG Hao ZHENG Linbo QIAO Dongsheng LI National Key Laboratory of Parallel and Distributed Computing National University of Defense TechnologyChangsha 410000China

Large-scale deep learning models are trained distributedly due to memory and computing resource *** existing strategy generation approaches take optimal memory minimization as the *** fill in this gap,we propose a novel algorithm that generates optimal parallelism strategies with the constraint of minimal memory *** propose a novel redundant memory cost model to calculate the memory overhead of each operator in a given parallel *** generate the optimal parallelism strategy,we formulate the parallelism strategy search problem into an integer linear programming problem and use an efficient solver to find minimal-memory intra-operator parallelism ***,the proposed algorithm has been extended and implemented in a multi-dimensional parallel training framework and is characterized by high throughput and minimal memory *** results demonstrate that our approach achieves memory savings of up to 67%compared to the latest Megatron-LM strategies;in contrast,the gap between the throughput of our approach and its counterparts is not large.

关键词： Deep learning Automatic parallelism Minimal memory redundancy

来源：评论

学校读者我要写书评

暂无评论

Training large-scale language models with limited GPU memory:a survey

引用

Frontiers of Information Technology & Electronic Engineering 2025年第3期26卷 309-331页

作者： Yu TANG Linbo QIAO Lujia YIN Peng LIANG Ao SHEN Zhilin YANG Lizhi ZHANG Dongsheng LI National Key Laboratory of Parallel and Distributed Computing College of ComputerNational University of Defense TechnologyChangsha 410073China

Large-scale models have gained significant attention in a wide range of fields,such as computer vision and natural language processing,due to their effectiveness across various ***,a notable hurdle in training these large-scale models is the limited memory capacity of graphics processing units(GPUs).In this paper,we present a comprehensive survey focused on training large-scale models with limited GPU *** exploration commences by scrutinizing the factors that contribute to the consumption of GPU memory during the training process,namely model parameters,model states,and model *** this analysis,we present an in-depth overview of the relevant research work that addresses these aspects ***,the paper concludes by presenting an outlook on the future of memory optimization in training large-scale language models,emphasizing the necessity for continued research and innovation in this *** survey serves as a valuable resource for researchers and practitioners keen on comprehending the challenges and advancements in training large-scale language models with limited GPU memory.

关键词： Training techniques Memory optimization Model parameters Model states Model activations

来源：评论

学校读者我要写书评

暂无评论

Optimizing Fine-Tuning in Quantized Language Models:An In-Depth Analysis of Key Variables

引用

Computers, Materials & Continua 2025年第1期82卷 307-325页

作者： Ao Shen Zhiquan Lai Dongsheng Li Xiaoyu Hu National Key Laboratory of Parallel and Distributed Computing National University of Defense TechnologyChangsha410073China Strategic Assessments and Consultation Institute Academy of Military ScienceBeijing100091China

Large-scale Language Models(LLMs)have achieved significant breakthroughs in Natural Language Processing(NLP),driven by the pre-training and fine-tuning *** this approach allows models to specialize in specific tasks with reduced training costs,the substantial memory requirements during fine-tuning present a barrier to broader ***-Efficient Fine-Tuning(PEFT)techniques,such as Low-Rank Adaptation(LoRA),and parameter quantization methods have emerged as solutions to address these challenges by optimizing memory usage and computational *** these,QLoRA,which combines PEFT and quantization,has demonstrated notable success in reducing memory footprints during fine-tuning,prompting the development of various QLoRA *** these advancements,the quantitative impact of key variables on the fine-tuning performance of quantized LLMs remains *** study presents a comprehensive analysis of these key variables,focusing on their influence across different layer types and depths within LLM *** investigation uncovers several critical findings:(1)Larger layers,such as MLP layers,can maintain performance despite reductions in adapter rank,while smaller layers,like self-attention layers,aremore sensitive to such changes;(2)The effectiveness of balancing factors depends more on specific values rather than layer type or depth;(3)In quantization-aware fine-tuning,larger layers can effectively utilize smaller adapters,whereas smaller layers struggle to do *** insights suggest that layer type is a more significant determinant of fine-tuning success than layer depth when optimizing quantized ***,for the same discount of trainable parameters,reducing the trainable parameters in a larger layer is more effective in preserving fine-tuning accuracy than in a smaller *** study provides valuable guidance for more efficient fine-tuning strategies and opens avenues for further research into optimizing LLM

关键词： Large-scale Language Model Parameter-Efficient Fine-Tuning parameter quantization key variable trainable parameters experimental analysis

来源：评论

学校读者我要写书评

暂无评论

Exploring Quantization Techniques for Large-Scale Language Models: Methods, Challenges and Future Directions 24

Exploring Quantization Techniques for Large-Scale Language M...

引用

9th International Conference on Cyber Security and Information Engineering, ICCSIE 2024

作者： Shen, Ao Lai, Zhiquan Li, Dongsheng National Key Laboratory of Parallel and Distributed Computing National University of Defense Technology China

ISBN: (纸本)9798400718137

Breakthroughs in natural language processing (NLP) by large-scale language models (LLMs) have led to superior performance in multilingual tasks such as translation, summarization, and Q&A. However, the size and complexity of these models raise challenges in terms of computational requirements, memory usage, and energy consumption. Quantization strategies, as a type of model compression technique, have gained attention for their advantages in reducing model size and accelerating inference speed. In this paper, we review the rapid development of quantization techniques for LLMs, and systematically explore methods such as post-training quantization (PTQ), quantization-aware fine-tuning (QAF), and quantization-aware training (QAT), which provide a comprehensive solution to the resource-intensive problem of LLMs from training to post-deployment. We also analyze state-of-the-art benchmarks and datasets to evaluate the effectiveness of quantitative methods in terms of performance retention and computational efficiency. The purpose of this review is to provide researchers with a snapshot of the latest progress in LLM quantization techniques, helping them to quickly grasp the dynamics of the field and understand the key techniques and challenges, so that they can more efficiently devote themselves to this evolving research area. © 2024 Copyright held by the owner/author(s).

关键词： Natural language processing systems

来源：评论

学校读者我要写书评

暂无评论

U-shaped Dual Attention Transformer: An Efficient Transformer Based on Channel and Spatial Attention 4

U-shaped Dual Attention Transformer: An Efficient Transforme...

引用

4th International Conference on Artificial Intelligence, Robotics, and Communication, ICAIRC 2024

作者： Zhai, Zhaoyuan Qiao, Peng Li, Rongchun Zhou, Zhen National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing Changsha China

ISBN: (纸本)9798331531225

Transformer-based methods have demonstrated remarkable performance on image super-resolution tasks. Due to high computational complexity, researchers have been working to achieve a balance between computation costs and performance. Restormer has achieved commendable balance by utilizing global channel attention. However, the performance is limited by insufficient local pixel reconstruction. In this paper, we propose a U-shaped Dual Attention Transformer (UDAT) with local-global receptive field, addressing the limitation of Restormer in local pixel reconstruction. We propose a dense window channel attention to enhance the local feature representation, more efficient in computational complexity. Experiments demonstrate that our UDAT achieve superior performance compared with Restormer on benchmark datasets, surpassing Restormer by 0.57 dB on the Urban100 dataset. On-par with SwinIR, our method reduces computational complexity by 3.2 times and improves inference speed by 2 times, achieving a better balance between computational costs and performance. © 2024 IEEE.

关键词： Pixels

来源：评论

学校读者我要写书评

暂无评论

Funnel: An Efficient Sparse Attention Accelerator with Multi-Dataflow Fusion 22

Funnel: An Efficient Sparse Attention Accelerator with Multi...

引用

22nd IEEE International Symposium on parallel and distributed Processing with Applications, ISPA 2024

作者： Ma, Shenghong Xu, Jinwei Jiang, Jingfei Wang, Yaohua Li, Dongsheng National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing College of Computer Changsha China

ISBN: (纸本)9798331509712

The self-attention mechanism is the core component of Transformer, which provides a powerful ability to understand the sequence context. However, the self-attention mechanism also suffers from a large amount of redundant computation. Model sparsification can effectively reduce computational load, but the irregularity of non-zeros introduced by sparsification significantly decreases hardware efficiency. This paper proposes Funnel, an accelerator that dynamically predicts sparse attention patterns and efficiently processes unstructured sparse data. Firstly, we adopt a fast quantization method based on lookup table to minimize the cost of sparse patterns prediction. Secondly, we propose Funnel computing Unit (FCU), a hardware architecture that efficiently handles sparse attention through multi-dataflow fusion. Sampled Dense-Dense Matrix Multiplication (SDDMM) and Sparse-Dense Matrix Multiplication (SpMM) are core components of sparse attention mechanism. FCU unifies the computation ways of matrix inner product and row-wise product to support SDDMM and SpMM at the same time, which greatly reduces the storage and movement overhead of intermediate results. Lastly, we devise a lightweight buffer and data tiling strategy tailored to the proposed accelerator, aimed at enhancing data reuse. Experiments demonstrate that our accelerator achieves 0.10-0.25 sparsity with small accuracy loss. When computing the self-attention layer, it attains hardware efficiency ranging from 60% to 85%. Compared to CPU and GPU, it achieves 5.60x and 8.20x speedup. Compared to the state-of-the-art attention accelerators A3, SpAtten, FTRANS, and Sanger, it achieves 7.37x, 4.52x, 9.58x, and 3.08x speedup. © 2024 IEEE.

关键词： FPGA Funnel computing unit Sparse attention Transformer

来源：评论

学校读者我要写书评

暂无评论

Mbapp: Efficient Memory-Balanced Pipeline parallelism for Large Model Fine-Tuning on Commodity GPU Servers 24

Mbapp: Efficient Memory-Balanced Pipeline Parallelism for La...

引用

5th International Conference on Computer Information and Big Data Applications, CIBDA 2024

作者： Liu, Yujie Lai, Zhiquan Li, Dongsheng National Key Laboratory of Parallel and Distributed Computing College of Computer National University of Defense Technology Changsha410000 China

ISBN: (纸本)9798400718106

Large-scale models have demonstrated outstanding performance across various downstream tasks. Pipeline parallelism is essential for fine-tuning large models on commodity GPU servers, as it plays a crucial role in making their exceptional performance more accessible and widespread. Existing approaches have encountered challenges in attaining effective memory-balanced pipeline parallelism. This paper presents a novel memory load-balanced pipeline parallel solution that aims to distribute memory usage evenly across stages on commodity GPU servers by leveraging NVLink bridges. The solution presents a novel approach for transferring data from GPUs to CPUs through the PCI-e link between adjacent GPUs interconnected by the NVLink bridge. Moreover, our methodology orchestrates data transfer operations to reduce offloading latency during the fine-tuning of large models. The evaluation demonstrates that our approach enhances switching efficiency by 1.46x to 1.76x and improves throughput by 2.04x to 3.59x compared to the PyTorch offloading technique. © 2024 ACM.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

HAF: a hybrid annotation framework based on expert knowledge and learning technique

引用

Science China(Information Sciences) 2022年第1期65卷 276-278页

作者： Zhixing LI Yue YU Tao WANG Gang YIN Xinjun MAO Huaimin WANG Key Laboratory of Parallel and Distributed Computing National University of Defense Technology College of Computer National University of Defense Technology

Dear editor,The increasing awareness of the potential value hidden in data has resulted in many data mining studies being conducted. In the domain of software engineering, for example, developers' behavioral data and code review data have been leveraged in social coding sites to automatically recommend relevant projects [1] and candidate reviewers [2, 3].

关键词：

来源：评论

学校读者我要写书评

暂无评论

Communication Analysis for Multidimensional parallel Training of Large-scale DNN Models 25

Communication Analysis for Multidimensional Parallel Trainin...

引用

25th IEEE International Conferences on High Performance computing and Communications, 9th International Conference on Data Science and Systems, 21st IEEE International Conference on Smart City and 9th IEEE International Conference on Dependability in Sensor, Cloud and Big Data Systems and Applications, HPCC/DSS/SmartCity/DependSys 2023

作者： Lai, Zhiquan Hao, Yanqi Li, Shengwei Li, Dongsheng College of Computer National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing Changsha China

ISBN: (纸本)9798350330014

Multidimensional parallel training has been widely applied to train large-scale deep learning models like GPT-3. The efficiency of parameter communication among training devices/processes is often the performance bottleneck of large model training. Analysis of parameter communication mode and traffic has important reference significance for the research of interconnection network design and computing task scheduling to improve the training performance. In this paper, we analyze the parametric communication modes in typical 3D parallel training (data parallelism, pipeline parallelism, and tensor parallelism), and model the traffic in different communication modes. Finally, taking GPT-3 as an example, we present the communication in its 3D parallel training. © 2023 IEEE.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

Efficient Large Models Fine-tuning on Commodity Servers via Memory-balanced Pipeline parallelism 25

Efficient Large Models Fine-tuning on Commodity Servers via ...

引用

作者： Liu, Yujie Lai, Zhiquan Liu, Weijie Wang, Wei Li, Dongsheng College of Computer National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing Changsha China

ISBN: (纸本)9798350330014

Large models have achieved impressive performance in many downstream tasks. Using pipeline parallelism to fine-tune large models on commodity GPU servers is an important way to make the excellent performance of large models available to the general public. Previous solutions fail to achieve an efficient memory-balanced pipeline parallelism. In this poster, we introduce a memory load-balanced pipeline parallel solution. This solution balances memory consumption across stages on commodity GPU servers via NVLink bridges. It establishes a new pathway to offload data from GPU to CPU by using the PCIe link of adjacent GPUs connected by the NVLink bridge. Furthermore, our method orchestrates offload operations to minimize the offload latency during large model fine-tuning. Experiments demonstrate that our solution can balance the memory footprint among pipeline stages without sacrificing training performance. © 2023 IEEE.

关键词： Program processors

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：