Large-scale models have gained significant attention in a wide range of fields,such as computer vision and natural language processing,due to their effectiveness across various ***,a notable hurdle in training these l...
详细信息
Large-scale models have gained significant attention in a wide range of fields,such as computer vision and natural language processing,due to their effectiveness across various ***,a notable hurdle in training these large-scale models is the limited memory capacity of graphics processing units(GPUs).In this paper,we present a comprehensive survey focused on training large-scale models with limited GPU *** exploration commences by scrutinizing the factors that contribute to the consumption of GPU memory during the training process,namely model parameters,model states,and model *** this analysis,we present an in-depth overview of the relevant research work that addresses these aspects ***,the paper concludes by presenting an outlook on the future of memory optimization in training large-scale language models,emphasizing the necessity for continued research and innovation in this *** survey serves as a valuable resource for researchers and practitioners keen on comprehending the challenges and advancements in training large-scale language models with limited GPU memory.
Community detection is a vital task in many fields,such as social networks and financial analysis,to name a *** Louvain method,the main workhorse of community detection,is a popular heuristic *** apply it to large-sca...
详细信息
Community detection is a vital task in many fields,such as social networks and financial analysis,to name a *** Louvain method,the main workhorse of community detection,is a popular heuristic *** apply it to large-scale graph networks,researchers have proposed several parallel Louvain methods(PLMs),which suffer from two challenges:the latency in the information synchronization,and the community *** tackle these two challenges,we propose an isolate sets based parallel Louvain method(IPLM)and a fusion IPLM with the hashtables based Louvain method(FIPLM),which are based on a novel graph partition *** graph partition algorithm divides the graph network into subgraphs called isolate sets,in which the vertices are relatively decoupled from *** first describe the concepts and properties of the isolate *** we propose an algorithm to divide the graph network into isolate sets,which enjoys the same computation complexity as the breadth-first ***,we propose IPLM,which can efficiently calculate and update vertices information in parallel without latency or community ***,we achieve further acceleration by FIPLM,which maintains a high quality of community detection with a faster speedup than *** two methods are for shared-memory architecture,and we implement our methods on an 8-core PC;the experiments show that IPLM achieves a maximum speedup of 4.62x and outputs higher modularity(maximum 4.76%)than the serial Louvain method on 14 of 18 ***,FIPLM achieves a maximum speedup of 7.26x.
All-reduce is a widely used communication technique for distributed and parallel applications typically implemented using either a tree-based or ring-based scheme. Each of these approaches has its own limitations: tre...
详细信息
All-reduce is a widely used communication technique for distributed and parallel applications typically implemented using either a tree-based or ring-based scheme. Each of these approaches has its own limitations: tree-based schemes struggle with efficiently exchanging large messages, while ring-based solutions assume constant communication throughput,an unrealistic expectation in modern network communication infrastructures. We present FMCC-RT, an all-reduce approach that combines the advantages of tree-and ring-based implementations while mitigating their drawbacks. FMCC-RT dynamically switches between tree and ring-based implementations depending on the size of the message being processed. It utilizes an analytical model to assess the impact of message sizes on the achieved throughput, enabling the derivation of optimal work partitioning parameters. Furthermore, FMCC-RT is designed with an Open MPI-compatible API, requiring no modification to user code. We evaluated FMCC-RT through micro-benchmarks and real-world application tests. Experimental results show that FMCC-RT outperforms state-of-the-art tree-and ring-based methods, achieving speedups of up to 5.6×.
Motion and appearance cues play a crucial role in Multi-object Tracking (MOT) algorithms for associating objects across consecutive frames. While most MOT methods prioritize accurate motion modeling and distincti...
详细信息
The multiplier is an important component of the processor's computing unit. Multiplication, multiplication, addition, and multiplication and subtraction operations are widely used in various signal processing algo...
详细信息
The proliferation of massive datasets has led to significant interests in distributed algorithms for solving large-scale machine learning ***,the communication overhead is a major bottleneck that hampers the scalabili...
详细信息
The proliferation of massive datasets has led to significant interests in distributed algorithms for solving large-scale machine learning ***,the communication overhead is a major bottleneck that hampers the scalability of distributed machine learning *** this paper,we design two communication-efficient algorithms for distributed learning *** first one is named EF-SIGNGD,in which we use the 1-bit(sign-based) gradient quantization method to save the communication ***,the error feedback technique,i.e.,incorporating the error made by the compression operator into the next step,is employed for the convergence *** second algorithm is called LE-SIGNGD,in which we introduce a well-designed lazy gradient aggregation rule to EF-SIGNGD that can detect the gradients with small changes and reuse the outdated ***-SIGNGD saves communication costs both in transmitted bits and communication ***,we show that LE-SIGNGD is convergent under some mild *** effectiveness of the two proposed algorithms is demonstrated through experiments on both real and synthetic data.
Anomalies in time series appear consecutively, forming anomaly segments. Applying the classical point-based evaluation metrics to evaluate the detection performance of segments leads to considerable underestimation, s...
详细信息
The self-attention mechanism is the core component of Transformer, which provides a powerful ability to understand the sequence context. However, the self-attention mechanism also suffers from a large amount of redund...
详细信息
Multivariate time series anomaly detection (MTAD) poses a challenge due to temporal and feature dependencies. The critical aspects of enhancing the detection performance lie in accurately capturing the dependencies be...
详细信息
Large-scale models have demonstrated outstanding performance across various downstream tasks. Pipeline parallelism is essential for fine-tuning large models on commodity GPU servers, as it plays a crucial role in maki...
详细信息
暂无评论