检索结果-内蒙古大学图书馆

Training large-scale language models with limited GPU memory:a survey

Frontiers of Information technology & Electronic Engineering 2025年第3期26卷 309-331页

作者： Yu TANG Linbo QIAO Lujia YIN Peng LIANG Ao SHEN Zhilin YANG Lizhi ZHANG Dongsheng LI National Key Laboratory of Parallel and Distributed Computing College of ComputerNational University of Defense TechnologyChangsha 410073China

Large-scale models have gained significant attention in a wide range of fields,such as computer vision and natural language processing,due to their effectiveness across various ***,a notable hurdle in training these large-scale models is the limited memory capacity of graphics processing units(GPUs).In this paper,we present a comprehensive survey focused on training large-scale models with limited GPU *** exploration commences by scrutinizing the factors that contribute to the consumption of GPU memory during the training process,namely model parameters,model states,and model *** this analysis,we present an in-depth overview of the relevant research work that addresses these aspects ***,the paper concludes by presenting an outlook on the future of memory optimization in training large-scale language models,emphasizing the necessity for continued research and innovation in this *** survey serves as a valuable resource for researchers and practitioners keen on comprehending the challenges and advancements in training large-scale language models with limited GPU memory.

关键词： Training techniques Memory optimization Model parameters Model states Model activations

来源：评论

学校读者我要写书评

暂无评论

Isolate Sets Based parallel Louvain Method for Community Detection

引用

Journal of computer science & technology 2023年第2期38卷 373-390页

作者：郄航窦勇黄震熊运生 Science and Technology on Parallel and Distributed Laboratory School of Computer National University of Defense TechnologyChangsha 410073China

Community detection is a vital task in many fields,such as social networks and financial analysis,to name a *** Louvain method,the main workhorse of community detection,is a popular heuristic *** apply it to large-scale graph networks,researchers have proposed several parallel Louvain methods(PLMs),which suffer from two challenges:the latency in the information synchronization,and the community *** tackle these two challenges,we propose an isolate sets based parallel Louvain method(IPLM)and a fusion IPLM with the hashtables based Louvain method(FIPLM),which are based on a novel graph partition *** graph partition algorithm divides the graph network into subgraphs called isolate sets,in which the vertices are relatively decoupled from *** first describe the concepts and properties of the isolate *** we propose an algorithm to divide the graph network into isolate sets,which enjoys the same computation complexity as the breadth-first ***,we propose IPLM,which can efficiently calculate and update vertices information in parallel without latency or community ***,we achieve further acceleration by FIPLM,which maintains a high quality of community detection with a faster speedup than *** two methods are for shared-memory architecture,and we implement our methods on an 8-core PC;the experiments show that IPLM achieves a maximum speedup of 4.62x and outputs higher modularity(maximum 4.76%)than the serial Louvain method on 14 of 18 ***,FIPLM achieves a maximum speedup of 7.26x.

关键词： parallel computing isolate set graph partition Louvain method community detection

来源：评论

学校读者我要写书评

暂无评论

FMCC-RT: a scalable and fine-grained all-reduce algorithm for large-scale SMP clusters

引用

science China(Information sciences) 2025年第5期68卷 362-379页

作者： Jintao PENG Jie LIU Jianbin FANG Min XIE Yi DAI Zhiquan LAI Bo YANG Chunye GONG Xinjun MAO Guo MAO Jie REN School of Computer Science and Technology National University of Defense Technology Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology Laboratory of Digitizing Software for Frontier Equipment National University of Defense Technology National Supercomputer Center in Tianjin School of Computer Science Shaanxi Normal University

All-reduce is a widely used communication technique for distributed and parallel applications typically implemented using either a tree-based or ring-based scheme. Each of these approaches has its own limitations: tree-based schemes struggle with efficiently exchanging large messages, while ring-based solutions assume constant communication throughput,an unrealistic expectation in modern network communication infrastructures. We present FMCC-RT, an all-reduce approach that combines the advantages of tree-and ring-based implementations while mitigating their drawbacks. FMCC-RT dynamically switches between tree and ring-based implementations depending on the size of the message being processed. It utilizes an analytical model to assess the impact of message sizes on the achieved throughput, enabling the derivation of optimal work partitioning parameters. Furthermore, FMCC-RT is designed with an Open MPI-compatible API, requiring no modification to user code. We evaluated FMCC-RT through micro-benchmarks and real-world application tests. Experimental results show that FMCC-RT outperforms state-of-the-art tree-and ring-based methods, achieving speedups of up to 5.6×.

关键词： all-reduce collective communication MPI scalability

来源：评论

学校读者我要写书评

暂无评论

AFMA-Track: Adaptive Fusion of Motion and Appearance for Robust Multi-object Tracking 27th

AFMA-Track: Adaptive Fusion of Motion and Appearance for ...

引用

27th International Conference on Pattern Recognition, ICPR 2024

作者： Liao, Wei Luo, Lei Zhang, Chunyuan College of Computer Science and Technology National University of Defence Technology Changsha China Science and Technology on Parallel and Distributed Processing Laboratory College of Computer Science and Technology National University of Defense Technology Changsha China

ISBN: (纸本)9783031784439

Motion and appearance cues play a crucial role in Multi-object Tracking (MOT) algorithms for associating objects across consecutive frames. While most MOT methods prioritize accurate motion modeling and distinctive appearance representations, the use of appearance and motion cues is often confined to simplistic association techniques. For instance, fixed weights are commonly employed to combine the intersection-over-union (IoU) matrix and appearance similarity matrix, yielding an association cost matrix. To harness the full potential of motion and appearance cues across diverse scenarios, we propose an innovative approach that dynamically balances motion and appearance cues based on scene and object information during the association process. Furthermore, we introduce a new mechanism for updating appearance representations, effectively mitigating noise introduced by occlusion. Our method demonstrates state-of-the-art performance on the MOT17 and MOT20 test sets. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

关键词： Object detection

来源：评论

学校读者我要写书评

暂无评论

A Deeply Pipelined 64-bit Multiplier for High-Performance RISC-V Processors 6

A Deeply Pipelined 64-bit Multiplier for High-Performance RI...

引用

6th International Conference on Frontier Technologies of Information and computer, ICFTIC 2024

作者： Liu, Wenyi Hu, Feng Li, Guilan Xu, Bangjian Niu, Xin College of Computer Science and Electronic Hunan University Changsha China College of Computer National University of Defense Technology Science and Technology on Parallel and Distributed Laboratory Changsha China

ISBN: (纸本)9798331541750

The multiplier is an important component of the processor's computing unit. Multiplication, multiplication, addition, and multiplication and subtraction operations are widely used in various signal processing algorithms. Based on this, this article extends the multiplication instructions of RISC-V, designs a 64-bit multiplier that combines the booth2 algorithm and wallace compression technology, and designs a deep pipeline mechanism for the multiplier to improve performance. Finally, Through logical simulation and emulation, the result output is correct. Comprehensive results show that under the 28nm cmos process, the MAC unit can reach 2.22Ghz. © 2024 IEEE.

关键词： Pipeline processing systems

来源：评论

学校读者我要写书评

暂无评论

SIGNGD with Error Feedback Meets Lazily Aggregated Technique:Communication-Efficient Algorithms for distributed Learning

引用

Tsinghua science and technology 2022年第1期27卷 174-185页

作者： Xiaoge Deng Tao Sun Feng Liu Dongsheng Li National Laboratory for Parallel and Distributed Processing(PDL) College of ComputerNational University of Defense TechnologyChangsha 410073China

The proliferation of massive datasets has led to significant interests in distributed algorithms for solving large-scale machine learning ***,the communication overhead is a major bottleneck that hampers the scalability of distributed machine learning *** this paper,we design two communication-efficient algorithms for distributed learning *** first one is named EF-SIGNGD,in which we use the 1-bit(sign-based) gradient quantization method to save the communication ***,the error feedback technique,i.e.,incorporating the error made by the compression operator into the next step,is employed for the convergence *** second algorithm is called LE-SIGNGD,in which we introduce a well-designed lazy gradient aggregation rule to EF-SIGNGD that can detect the gradients with small changes and reuse the outdated ***-SIGNGD saves communication costs both in transmitted bits and communication ***,we show that LE-SIGNGD is convergent under some mild *** effectiveness of the two proposed algorithms is demonstrated through experiments on both real and synthetic data.

关键词： distributed learning communication-efficient algorithm convergence analysis

来源：评论

学校读者我要写书评

暂无评论

Smoothing Point Adjustment-Based Evaluation of Time Series Anomaly Detection 48

Smoothing Point Adjustment-Based Evaluation of Time Series A...

引用

48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

作者： Liu, Mingyu Wang, Yijie Xu, Hongzuo Zhou, Xiaohui Li, Bin Wang, Yongjun National University of Defense Technology Science and Technology on Parallel and Distributed Processing Laboratory College of Computer Changsha China

ISBN: (纸本)9781728163277

Anomalies in time series appear consecutively, forming anomaly segments. Applying the classical point-based evaluation metrics to evaluate the detection performance of segments leads to considerable underestimation, so most related studies resort to point adjustment. This operation treats all points as true positives within a segment equally when only one individual point alarms, resulting in significant overestimation and creating an illusion of superior performance. This paper proposes smoothing point adjustment, a novel range-based evaluation protocol for time series anomaly detection. Our protocol reflects detection performance impartially by carefully considering the specific location and frequency of alarms in the raw results. It is achieved by smoothly determining the adjustment range and rewarding early detection via a ranging function and a rewarding function. Compared with other evaluation metrics, experiments on different datasets show that our protocol can yield a performance ranking of various methods more consistent with the desired situation. © 2023 IEEE.

关键词： Anomaly Detection Evaluation Protocol Point Adjustment Time Series

来源：评论

学校读者我要写书评

暂无评论

Funnel: An Efficient Sparse Attention Accelerator with Multi-Dataflow Fusion 22

Funnel: An Efficient Sparse Attention Accelerator with Multi...

引用

22nd IEEE International Symposium on parallel and distributed Processing with Applications, ISPA 2024

作者： Ma, Shenghong Xu, Jinwei Jiang, Jingfei Wang, Yaohua Li, Dongsheng National University of Defense Technology National Key Laboratory of Parallel and Distributed Computing College of Computer Changsha China

ISBN: (纸本)9798331509712

The self-attention mechanism is the core component of Transformer, which provides a powerful ability to understand the sequence context. However, the self-attention mechanism also suffers from a large amount of redundant computation. Model sparsification can effectively reduce computational load, but the irregularity of non-zeros introduced by sparsification significantly decreases hardware efficiency. This paper proposes Funnel, an accelerator that dynamically predicts sparse attention patterns and efficiently processes unstructured sparse data. Firstly, we adopt a fast quantization method based on lookup table to minimize the cost of sparse patterns prediction. Secondly, we propose Funnel Computing Unit (FCU), a hardware architecture that efficiently handles sparse attention through multi-dataflow fusion. Sampled Dense-Dense Matrix Multiplication (SDDMM) and Sparse-Dense Matrix Multiplication (SpMM) are core components of sparse attention mechanism. FCU unifies the computation ways of matrix inner product and row-wise product to support SDDMM and SpMM at the same time, which greatly reduces the storage and movement overhead of intermediate results. Lastly, we devise a lightweight buffer and data tiling strategy tailored to the proposed accelerator, aimed at enhancing data reuse. Experiments demonstrate that our accelerator achieves 0.10-0.25 sparsity with small accuracy loss. When computing the self-attention layer, it attains hardware efficiency ranging from 60% to 85%. Compared to CPU and GPU, it achieves 5.60x and 8.20x speedup. Compared to the state-of-the-art attention accelerators A3, SpAtten, FTRANS, and Sanger, it achieves 7.37x, 4.52x, 9.58x, and 3.08x speedup. © 2024 IEEE.

关键词： FPGA Funnel computing unit Sparse attention Transformer

来源：评论

学校读者我要写书评

暂无评论

Graph Structure Learning via Transfer Entropy for Multivariate Time Series Anomaly Detection

Graph Structure Learning via Transfer Entropy for Multivaria...

引用

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

作者： Liu, Mingyu Wang, Yijie Zhou, Xiaohui Wang, Yongjun National Key Laboratory of Parallel and Distributed Computing College of Computer Science and Technology National University of Defense Technology Changsha China College of Computer Science and Technology National University of Defense Technology Changsha China

ISBN: (纸本)9798350368741

Multivariate time series anomaly detection (MTAD) poses a challenge due to temporal and feature dependencies. The critical aspects of enhancing the detection performance lie in accurately capturing the dependencies between variables within the sliding window and effectively leveraging them. Existing studies rely on domain knowledge to pre-set the window size, and overlook the strength of dependencies while calculating direction based on variable similarity. This paper proposes GSLTE, a graph structure learning method for MTAD. GSLTE employs Fast Fourier Transform to conduct iterative segmentation of the whole series, selecting the dominant Fourier frequency as the window size for each subsequence within the minimum interval. GSLTE quantifies the direction and strength of the dependencies based on variable-lag transfer entropy which is achieved through Dynamic Time Warping method to learn asymmetric links between variables. Extensive experiments show that GNN-based MTAD methods applying GSLTE can further improve anomaly detection performance while outperforming state-of-the-art competitors. © 2025 IEEE.

关键词： Anomaly detection Graph structure learning Multivariate time series Window size selection

来源：评论

学校读者我要写书评

暂无评论

Mbapp: Efficient Memory-Balanced Pipeline parallelism for Large Model Fine-Tuning on Commodity GPU Servers 24

Mbapp: Efficient Memory-Balanced Pipeline Parallelism for La...

引用

5th International Conference on computer Information and Big Data Applications, CIBDA 2024

作者： Liu, Yujie Lai, Zhiquan Li, Dongsheng National Key Laboratory of Parallel and Distributed Computing College of Computer National University of Defense Technology Changsha410000 China

ISBN: (纸本)9798400718106

Large-scale models have demonstrated outstanding performance across various downstream tasks. Pipeline parallelism is essential for fine-tuning large models on commodity GPU servers, as it plays a crucial role in making their exceptional performance more accessible and widespread. Existing approaches have encountered challenges in attaining effective memory-balanced pipeline parallelism. This paper presents a novel memory load-balanced pipeline parallel solution that aims to distribute memory usage evenly across stages on commodity GPU servers by leveraging NVLink bridges. The solution presents a novel approach for transferring data from GPUs to CPUs through the PCI-e link between adjacent GPUs interconnected by the NVLink bridge. Moreover, our methodology orchestrates data transfer operations to reduce offloading latency during the fine-tuning of large models. The evaluation demonstrates that our approach enhances switching efficiency by 1.46x to 1.76x and improves throughput by 2.04x to 3.59x compared to the PyTorch offloading technique. © 2024 ACM.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：