Large-scale deep learning models are trained distributedly due to memory and computing resource *** existing strategy generation approaches take optimal memory minimization as the *** fill in this gap,we propose a nov...
详细信息
Large-scale deep learning models are trained distributedly due to memory and computing resource *** existing strategy generation approaches take optimal memory minimization as the *** fill in this gap,we propose a novel algorithm that generates optimal parallelism strategies with the constraint of minimal memory *** propose a novel redundant memory cost model to calculate the memory overhead of each operator in a given parallel *** generate the optimal parallelism strategy,we formulate the parallelism strategy search problem into an integer linear programming problem and use an efficient solver to find minimal-memory intra-operator parallelism ***,the proposed algorithm has been extended and implemented in a multi-dimensional parallel training framework and is characterized by high throughput and minimal memory *** results demonstrate that our approach achieves memory savings of up to 67%compared to the latest Megatron-LM strategies;in contrast,the gap between the throughput of our approach and its counterparts is not large.
Large-scale models have gained significant attention in a wide range of fields,such as computer vision and natural language processing,due to their effectiveness across various ***,a notable hurdle in training these l...
详细信息
Large-scale models have gained significant attention in a wide range of fields,such as computer vision and natural language processing,due to their effectiveness across various ***,a notable hurdle in training these large-scale models is the limited memory capacity of graphics processing units(GPUs).In this paper,we present a comprehensive survey focused on training large-scale models with limited GPU *** exploration commences by scrutinizing the factors that contribute to the consumption of GPU memory during the training process,namely model parameters,model states,and model *** this analysis,we present an in-depth overview of the relevant research work that addresses these aspects ***,the paper concludes by presenting an outlook on the future of memory optimization in training large-scale language models,emphasizing the necessity for continued research and innovation in this *** survey serves as a valuable resource for researchers and practitioners keen on comprehending the challenges and advancements in training large-scale language models with limited GPU memory.
In computational fluid dynamics(CFD),mesh-smoothing methods are widely used to refine the mesh quality for achieving high-precision numerical ***,optimization-based smoothing is used for high-quality mesh smoothing,bu...
详细信息
In computational fluid dynamics(CFD),mesh-smoothing methods are widely used to refine the mesh quality for achieving high-precision numerical ***,optimization-based smoothing is used for high-quality mesh smoothing,but it incurs significant computational *** works have improved its smoothing efficiency by adopting supervised learning to learn smoothing methods from high-quality ***,they pose difficulties in smoothing the mesh nodes with varying degrees and require data augmentation to address the node input sequence ***,the required labeled high-quality meshes further limit the applicability of the proposed *** this paper,we present graph-based smoothing mesh net(GMSNet),a lightweight neural network model for intelligent mesh *** adopts graph neural networks(GNNs)to extract features of the node’s neighbors and outputs the optimal node *** smoothing,we also introduce a fault-tolerance mechanism to prevent GMSNet from generating negative volume *** a lightweight model,GMSNet can effectively smooth mesh nodes with varying degrees and remain unaffected by the order of input data.A novel loss function,MetricLoss,is developed to eliminate the need for high-quality meshes,which provides stable and rapid convergence during *** compare GMSNet with commonly used mesh-smoothing methods on two-dimensional(2D)triangle *** results show that GMSNet achieves outstanding mesh-smoothing performances with 5%of the model parameters compared to the previous model,but offers a speedup of 13.56 times over the optimization-based smoothing.
Large-scale Language Models(LLMs)have achieved significant breakthroughs in Natural Language processing(NLP),driven by the pre-training and fine-tuning *** this approach allows models to specialize in specific tasks w...
详细信息
Large-scale Language Models(LLMs)have achieved significant breakthroughs in Natural Language processing(NLP),driven by the pre-training and fine-tuning *** this approach allows models to specialize in specific tasks with reduced training costs,the substantial memory requirements during fine-tuning present a barrier to broader ***-Efficient Fine-Tuning(PEFT)techniques,such as Low-Rank Adaptation(LoRA),and parameter quantization methods have emerged as solutions to address these challenges by optimizing memory usage and computational *** these,QLoRA,which combines PEFT and quantization,has demonstrated notable success in reducing memory footprints during fine-tuning,prompting the development of various QLoRA *** these advancements,the quantitative impact of key variables on the fine-tuning performance of quantized LLMs remains *** study presents a comprehensive analysis of these key variables,focusing on their influence across different layer types and depths within LLM *** investigation uncovers several critical findings:(1)Larger layers,such as MLP layers,can maintain performance despite reductions in adapter rank,while smaller layers,like self-attention layers,aremore sensitive to such changes;(2)The effectiveness of balancing factors depends more on specific values rather than layer type or depth;(3)In quantization-aware fine-tuning,larger layers can effectively utilize smaller adapters,whereas smaller layers struggle to do *** insights suggest that layer type is a more significant determinant of fine-tuning success than layer depth when optimizing quantized ***,for the same discount of trainable parameters,reducing the trainable parameters in a larger layer is more effective in preserving fine-tuning accuracy than in a smaller *** study provides valuable guidance for more efficient fine-tuning strategies and opens avenues for further research into optimizing LLM
All-reduce is a widely used communication technique for distributed and parallel applications typically implemented using either a tree-based or ring-based scheme. Each of these approaches has its own limitations: tre...
详细信息
All-reduce is a widely used communication technique for distributed and parallel applications typically implemented using either a tree-based or ring-based scheme. Each of these approaches has its own limitations: tree-based schemes struggle with efficiently exchanging large messages, while ring-based solutions assume constant communication throughput,an unrealistic expectation in modern network communication infrastructures. We present FMCC-RT, an all-reduce approach that combines the advantages of tree-and ring-based implementations while mitigating their drawbacks. FMCC-RT dynamically switches between tree and ring-based implementations depending on the size of the message being processed. It utilizes an analytical model to assess the impact of message sizes on the achieved throughput, enabling the derivation of optimal work partitioning parameters. Furthermore, FMCC-RT is designed with an Open MPI-compatible API, requiring no modification to user code. We evaluated FMCC-RT through micro-benchmarks and real-world application tests. Experimental results show that FMCC-RT outperforms state-of-the-art tree-and ring-based methods, achieving speedups of up to 5.6×.
Self-supervised anomaly detection (AD) methods define transformations and surrogate tasks to deeply learn data "normality", presenting superior performance. Different from most existing work designed for ima...
详细信息
Self-supervised time series anomaly detection (TSAD) demonstrates remarkable performance improvement by extracting high-level data semantics through proxy tasks. Nonetheless, most existing self-supervised TSAD techniq...
详细信息
Sparse Matrix-Dense Matrix Multiplication (SpMM) is a crucial kernel used in a wide range of fields including machine learning and linear algebra solvers. Thus, enhancing the performance of SpMM is essential. The unev...
详细信息
Sparse triangular solve (SpTRSV) is a vital component in various scientific applications, and numerous GPU-based SpTRSV algorithms have been proposed. Synchronization-free SpTRSV is currently the mainstream algorithm ...
详细信息
Motion and appearance cues play a crucial role in Multi-object Tracking (MOT) algorithms for associating objects across consecutive frames. While most MOT methods prioritize accurate motion modeling and distincti...
详细信息
暂无评论