检索结果-内蒙古大学图书馆

Automatic parallelism strategy generation with minimalmemory redundancy

Frontiers of Information Technology & Electronic Engineering 2025年第1期26卷 109-118页

作者： Yanqi SHI Peng LIANG Hao ZHENG Linbo QIAO Dongsheng LI National Key Laboratory of Parallel and Distributed Computing National University of Defense TechnologyChangsha 410000China

Large-scale deep learning models are trained distributedly due to memory and computing resource *** existing strategy generation approaches take optimal memory minimization as the *** fill in this gap,we propose a novel algorithm that generates optimal parallelism strategies with the constraint of minimal memory *** propose a novel redundant memory cost model to calculate the memory overhead of each operator in a given parallel *** generate the optimal parallelism strategy,we formulate the parallelism strategy search problem into an integer linear programming problem and use an efficient solver to find minimal-memory intra-operator parallelism ***,the proposed algorithm has been extended and implemented in a multi-dimensional parallel training framework and is characterized by high throughput and minimal memory *** results demonstrate that our approach achieves memory savings of up to 67%compared to the latest Megatron-LM strategies;in contrast,the gap between the throughput of our approach and its counterparts is not large.

关键词： Deep learning Automatic parallelism Minimal memory redundancy

来源：评论

学校读者我要写书评

暂无评论

Training large-scale language models with limited GPU memory:a survey

引用

Frontiers of Information Technology & Electronic Engineering 2025年第3期26卷 309-331页

作者： Yu TANG Linbo QIAO Lujia YIN Peng LIANG Ao SHEN Zhilin YANG Lizhi ZHANG Dongsheng LI National Key Laboratory of Parallel and Distributed Computing College of ComputerNational University of Defense TechnologyChangsha 410073China

Large-scale models have gained significant attention in a wide range of fields,such as computer vision and natural language processing,due to their effectiveness across various ***,a notable hurdle in training these large-scale models is the limited memory capacity of graphics processing units(GPUs).In this paper,we present a comprehensive survey focused on training large-scale models with limited GPU *** exploration commences by scrutinizing the factors that contribute to the consumption of GPU memory during the training process,namely model parameters,model states,and model *** this analysis,we present an in-depth overview of the relevant research work that addresses these aspects ***,the paper concludes by presenting an outlook on the future of memory optimization in training large-scale language models,emphasizing the necessity for continued research and innovation in this *** survey serves as a valuable resource for researchers and practitioners keen on comprehending the challenges and advancements in training large-scale language models with limited GPU memory.

关键词： Training techniques Memory optimization Model parameters Model states Model activations

来源：评论

学校读者我要写书评

暂无评论

Optimizing Fine-Tuning in Quantized Language Models:An In-Depth Analysis of Key Variables

引用

Computers, Materials & Continua 2025年第1期82卷 307-325页

作者： Ao Shen Zhiquan Lai Dongsheng Li Xiaoyu Hu National Key Laboratory of Parallel and Distributed Computing National University of Defense TechnologyChangsha410073China Strategic Assessments and Consultation Institute Academy of Military ScienceBeijing100091China

Large-scale Language Models(LLMs)have achieved significant breakthroughs in Natural Language Processing(NLP),driven by the pre-training and fine-tuning *** this approach allows models to specialize in specific tasks with reduced training costs,the substantial memory requirements during fine-tuning present a barrier to broader ***-Efficient Fine-Tuning(PEFT)techniques,such as Low-Rank Adaptation(LoRA),and parameter quantization methods have emerged as solutions to address these challenges by optimizing memory usage and computational *** these,QLoRA,which combines PEFT and quantization,has demonstrated notable success in reducing memory footprints during fine-tuning,prompting the development of various QLoRA *** these advancements,the quantitative impact of key variables on the fine-tuning performance of quantized LLMs remains *** study presents a comprehensive analysis of these key variables,focusing on their influence across different layer types and depths within LLM *** investigation uncovers several critical findings:(1)Larger layers,such as MLP layers,can maintain performance despite reductions in adapter rank,while smaller layers,like self-attention layers,aremore sensitive to such changes;(2)The effectiveness of balancing factors depends more on specific values rather than layer type or depth;(3)In quantization-aware fine-tuning,larger layers can effectively utilize smaller adapters,whereas smaller layers struggle to do *** insights suggest that layer type is a more significant determinant of fine-tuning success than layer depth when optimizing quantized ***,for the same discount of trainable parameters,reducing the trainable parameters in a larger layer is more effective in preserving fine-tuning accuracy than in a smaller *** study provides valuable guidance for more efficient fine-tuning strategies and opens avenues for further research into optimizing LLM

关键词： Large-scale Language Model Parameter-Efficient Fine-Tuning parameter quantization key variable trainable parameters experimental analysis

来源：评论

学校读者我要写书评

暂无评论

Deep Time Series Anomaly Detection with Local Temporal Pattern Learning

Deep Time Series Anomaly Detection with Local Temporal Patte...

引用

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

作者： Li, Yizhou Wang, Yijie Xu, Hongzuo Zhou, Xiaohui National Key Laboratory of Parallel and Distributed Computing College of Computer Science and Technology National University of Defense Technology Changsha410073 China Beijing100091 China

ISBN: (纸本)9798350368741

Self-supervised time series anomaly detection (TSAD) demonstrates remarkable performance improvement by extracting high-level data semantics through proxy tasks. Nonetheless, most existing self-supervised TSAD techniques rely on manual- or neural-based transformations when designing proxy tasks, overlooking the intrinsic temporal patterns of time series. This paper proposes a local temporal pattern learning-based time series anomaly detection (LTPAD). LTPAD first generates sub-sequences. Pairwise sub-sequences naturally manifest proximity relationships along the time axis, and such correlations can be used to construct supervision and train neural networks to facilitate the learning of temporal patterns. Time intervals between two sub-sequences serve as labels for sub-sequence pairs. By classifying these labeled data pairs, our model captures the local temporal patterns of time series, thereby modeling the temporal pattern-aware "normality". Abnormal scores of testing data are acquired by evaluating their conformity to these learned patterns shared in training data. Extensive experiments show that LTPAD significantly outperforms state-of-the-art competitors. © 2025 IEEE.

关键词： Local Temporal Pattern Self-supervised Learning Time Series Anomaly Detection

来源：评论

学校读者我要写书评

暂无评论

LSSM-SpMM: A Long-Row Splitting and Short-Row Merging Approach for parallel SpMM on PEZY-SC3s 24th

LSSM-SpMM: A Long-Row Splitting and Short-Row Merging Appro...

引用

24th International Conference on Algorithms and Architectures for parallel Processing, ICA3PP 2024

作者： Cao, Ligang Wang, Qinglin Yang, Shun Xia, Rui Guo, Weihao Liu, Jie Laboratory of Digitizing Software for Frontier Equipment National University of Defense Technology Changsha410073 China National Key Laboratory of Parallel and Distributed Computing National University of Defense Technology Changsha410073 China

ISBN: (纸本)9789819615506

Sparse Matrix-Dense Matrix Multiplication (SpMM) is a crucial kernel used in a wide range of fields including machine learning and linear algebra solvers. Thus, enhancing the performance of SpMM is essential. The uneven distribution of non-zeros in sparse matrices and the tight data dependency between sparse and dense matrices make efficiently running SpMM on various hardware platforms challenging. To address these issues, optimisations are tailored according to the characteristics of the different hardware platforms. In this study, we propose a Long Row Split and Short Row Merge (LSSM) approach on the new MIMD computing platform PEZY-SC3s, utilising the standard Compressed Sparse Row (CSR) format to optimise SpMM. Specifically, LSSM divides the rows of the sparse matrix into short rows (rows with a number of non-zeros less than blockSize) and long rows (rows with a number of non-zeros greater than blockSize), applying splitting to the long rows and merging to the short rows to optimize their computations separately. Additionally, based on the hardware features of PEZY-SC3s, we employed the atomic cache for workload scheduling, SIMD instructions to accelerate computation, and Local Memory to reduce data read-write operations, addressing issues of poor data locality, workload imbalance, and vectorisation during SpMM execution. As the first study of SpMM on PEZY-SC3s, compared with BS-SpMM and RoDe-SpMM implemented on PEZY-SC3s, LSSM-SpMM offers up to 26.17× and 13.59× acceleration on the SuiteSparse and deep learning datasets, respectively, with geometric mean speedups of 1.56× and 1.71×. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

关键词： Deep learning

来源：评论

学校读者我要写书评

暂无评论

Comprehensive Deadlock Prevention for GPU Collective Communication 25

Comprehensive Deadlock Prevention for GPU Collective Communi...

引用

20th European Conference on Computer Systems, EuroSys 2025, co-located 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2025

作者： Pan, Lichen Liu, Juncheng Fu, Yongquan Yuan, Jinhui Zhang, Rongkai Li, Pengze Xiao, Zhen School of Computer Science Peking University China OneFlow Research China National Key Laboratory of Parallel and Distributed Computing College of Computer Science and Technology National University of Defense Technology China

ISBN: (纸本)9798400711961

distributed deep neural network training necessitates efficient GPU collective communications, which are inherently susceptible to deadlocks. GPU collective deadlocks arise easily in distributed deep learning applications when multiple collectives circularly wait for each other. GPU collective deadlocks pose a significant challenge to the correct functioning and efficiency of distributed deep learning, and no general effective solutions are currently available. Only in specific scenarios, ad-hoc methods, making an application invoke collectives in a consistent order across GPUs, can be used to prevent circular collective dependency and deadlocks. This paper presents DFCCL, a novel GPU collective communication library that provides a comprehensive approach for GPU collective deadlock prevention while maintaining high performance. DFCCL achieves preemption for GPU collectives at the bottom library level, effectively preventing deadlocks even if applications cause circular collective dependency. DFCCL ensures high performance with its execution and scheduling methods for collectives. Experiments show that DFCCL effectively prevents GPU collective deadlocks in various situations. Moreover, extensive evaluations demonstrate that DFCCL delivers performance comparable to or superior to NCCL, the state-of-the-art collective communication library highly optimized for NVIDIA GPUs. © 2025 Copyright held by the owner/author(s).

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

A Counterfactual Ultrasound Anti-Interference Self-Supervised Network for B-mode Ultrasound Tongue Extraction

A Counterfactual Ultrasound Anti-Interference Self-Supervise...

引用

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

作者： Jia, Yan Cheng, Yuqing Xu, Kele Dou, Yong Qiao, Peng He, Zhouyu National Key Laboratory of Parallel and Distributed Computing College of Computer Science and Technology National University of Defense Technology Changsha China College of Systems Engineering National University of Defense Technology Changsha China

ISBN: (纸本)9798350368741

B-mode ultrasound tongue imaging is a non-invasive and real-time method for visualizing vocal tract deformation. However, accurately extracting the tongue's surface contour remains a significant challenge due to the low signal-to-noise ratio (SNR) and prevalent speckle noise in ultrasound images. Traditional supervised learning models often require large labeled datasets, which are labor-intensive to produce and susceptible to noise interference. To address these limitations, we present a novel Counterfactual Ultrasound Anti-Interference Self-Supervised Network (CUAI-SSN), which integrates self-supervised learning (SSL) with counterfactual data augmentation, progressively disentangles confounding factors, ensuring that the model generalizes well across varied ultrasound conditions. Our approach leverages causal reasoning to decouple noise from relevant features, enabling the model to learn robust representations that focus on essential tongue structures. By generating counterfactual image-label pairs, our method introduces alternative, noise-independent scenarios that enhance model training. Furthermore, we introduce attention mechanisms to enhance the network's ability to capture fine-grained details even in noisy conditions. Extensive experiments on real ultrasound tongue images demonstrate that CUAI-SSN outperforms existing methods, setting a new benchmark for automated contour extraction in ultrasound tongue imaging. Our code is publicly available at https://***/inexhaustible419/CounterfactualultrasoundAI. © 2025 IEEE.

关键词： contour extraction counterfactual inference self-supervised learning speech production ultrasound image

来源：评论

学校读者我要写书评

暂无评论

MARO: Enabling Full MPI Automatic Refactoring in DSL-Based Programming Framework 24th

MARO: Enabling Full MPI Automatic Refactoring in DSL-Based ...

引用

24th International Conference on Algorithms and Architectures for parallel Processing, ICA3PP 2024

作者： Lei, Tong Chen, Zongjing Che, Yonggang Xu, Chuanfu Laboratory of Digitizing Software for Frontier Equipment National University of Defense Technology Changsha410073 China National Key Laboratory of Parallel and Distributed Computing College of Computer Science and Technology National University of Defense Technology Changsha410073 China

ISBN: (纸本)9789819615445

Currently, the landscape of computer hardware architecture presents the characteristics of heterogeneity and diversity, prompting widespread attention to cross-platform portable parallel programming techniques. Most existing portable programming approaches adopt the "MPI+X" strategy, focusing mainly on node-level "X" parallelism, while inter-node parallelization still relies on manual MPI programming. OP2 is a domain-specific language-based portable parallel programming framework for unstructured mesh applications that supports the automatic generation of MPI parallel code. However, programmers must manually write pre-processing code during code refactoring, such as handling the mapping relationship between global and process-local variables and the initial distribution of process-local data. In this paper, we propose MARO (MPI Automatic Refactoring for OP2), an automatic refactoring method for generating pre-processing code targeting MPI code generation within the OP2 programming framework for unstructured mesh applications. This method automatically generates the pre-processing code required for MPI parallelism through the source-to-source translator, enabling OP2 to generate MPI parallel code without requiring users to write MPI-related code manually. We compare the automatically refactored OP2 applications using the proposed method with those manually refactored to validate its effectiveness. The proposed method enhances OP2’s automatic parallel code generation capability, enabling it to achieve fully automatic MPI parallelization. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

关键词： Automatic programming

来源：评论

学校读者我要写书评

暂无评论

Graph Structure Learning via Transfer Entropy for Multivariate Time Series Anomaly Detection

Graph Structure Learning via Transfer Entropy for Multivaria...

引用

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2025

作者： Liu, Mingyu Wang, Yijie Zhou, Xiaohui Wang, Yongjun National Key Laboratory of Parallel and Distributed Computing College of Computer Science and Technology National University of Defense Technology Changsha China College of Computer Science and Technology National University of Defense Technology Changsha China

ISBN: (纸本)9798350368741

Multivariate time series anomaly detection (MTAD) poses a challenge due to temporal and feature dependencies. The critical aspects of enhancing the detection performance lie in accurately capturing the dependencies between variables within the sliding window and effectively leveraging them. Existing studies rely on domain knowledge to pre-set the window size, and overlook the strength of dependencies while calculating direction based on variable similarity. This paper proposes GSLTE, a graph structure learning method for MTAD. GSLTE employs Fast Fourier Transform to conduct iterative segmentation of the whole series, selecting the dominant Fourier frequency as the window size for each subsequence within the minimum interval. GSLTE quantifies the direction and strength of the dependencies based on variable-lag transfer entropy which is achieved through Dynamic Time Warping method to learn asymmetric links between variables. Extensive experiments show that GNN-based MTAD methods applying GSLTE can further improve anomaly detection performance while outperforming state-of-the-art competitors. © 2025 IEEE.

关键词： Anomaly detection Graph structure learning Multivariate time series Window size selection

来源：评论

学校读者我要写书评

暂无评论

Partial Order-centered Hyperbolic Representation Learning for Few-shot Relation Extraction 31

Partial Order-centered Hyperbolic Representation Learning fo...

引用

31st International Conference on Computational Linguistics, COLING 2025

作者： Hu, Biao Huang, Zhen Hu, Minghao Yang, Pinglv Qiao, Peng Dou, Yong Wang, Zhilin National Key Laboratory of Parallel and Distributed Computing National University of Defense Technology China Center of Information Research Academy of Military Science China College of Meteorology and Oceanology National University of Defense Technology China

ISBN: (纸本)9798891761964

Prototype network-based methods have made substantial progress in few-shot relation extraction (FSRE) by enhancing relation prototypes with relation descriptions. However, the distribution of relations and instances in distinct representation spaces isolates the constraints of relations on instances, making relation prototypes biased. In this paper, we propose an end-to-end partial order-centered hyperbolic representation learning (PO-HRL) framework, which imposes the constraints of relations on instances by modeling partial order in hyperbolic space, so as to effectively learn the distribution of instance representations. Specifically, we develop the hyperbolic supervised contrastive learning based on Lorentzian cosine similarity to align representations of relations and instances, and model the partial order by constraining instances to reside within the Lorentzian entailment cone of their respective relation. Experiments on three benchmark datasets show that PO-HRL outperforms the strong baselines, especially in 1-shot settings lacking relation descriptions. © 2025 Association for Computational Linguistics.

关键词： Contrastive Learning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：