In Artificial Intelligence(AI), training expansive models with billions of parameters necessitates substantial computational resources. This requirement has led to the adoption of parallelcomputing frameworks. Howeve...
详细信息
ISBN:
(纸本)9798350330946;9798350330953
In Artificial Intelligence(AI), training expansive models with billions of parameters necessitates substantial computational resources. This requirement has led to the adoption of parallelcomputing frameworks. However, these frameworks often confront node performance imbalances due to disparities in computational capabilities and network conditions. To address this issue, we introduce the BalanceNet Orchestrator(BNO), a dynamic task allocation algorithm designed to equilibrate workloads in parallel training environments. BalanceNet Orchestrator assesses and adjusts to node-specific performance in real time, facilitating optimal workload distribution and resource utilization. This method significantly enhances training efficiency and accelerates model convergence, presenting an efficient approach for training large-scale AI models within parallel training architecture.
Modern society has put forward higher requirements for the power system, and the reliability and intelligence of the power system have received widespread attention. In this paper, data transmission in the digitalized...
详细信息
Deep Learning (DL) model sizes are increasing at a rapid pace, as larger models typically offer better statistical performance. Modern Large Language Models (LLMs) and image processing models contain billions of train...
详细信息
ISBN:
(纸本)9798350383225
Deep Learning (DL) model sizes are increasing at a rapid pace, as larger models typically offer better statistical performance. Modern Large Language Models (LLMs) and image processing models contain billions of trainable parameters. Training such massive neural networks incurs significant memory requirements and financial cost. Hybrid-parallel training approaches have emerged that combine pipelining with data and tensor parallelism to facilitate the training of large DL models on distributed hardware setups. However, existing approaches to design a hybrid-parallel partitioning and parallelization plan for DL models focus on achieving high throughput and not on minimizing memory usage and financial cost. We introduce CAPTURE, a partitioning and parallelization approach for hybrid parallelism that minimizes peak memory usage. CAPTURE combines a profiling-based approach with statistical modeling to recommend a partitioning and parallelization plan that minimizes the peak memory usage across all the Graphics Processing Units (GPUs) in the hardware setup. Our results show a reduction in memory usage of up to 43.9% compared to partitioners in state-of-the-art hybrid-parallel training systems. The reduced memory footprint enables the training of larger DL models on the same hardware resources and training with larger batch sizes. CAPTURE can also train a given model on a smaller hardware setup than other approaches, reducing the financial cost of training massive DL models.
In response to the high energy consumption of electricity, it has become an important challenge to develop a reasonable grid dispatching strategy in power dispatch to be able to achieve energy saving and emission redu...
详细信息
This study proposes a robust framework for the training of software engineers specializing in parallelcomputing. We first curated essential content for parallelcomputing education based on international standards an...
详细信息
With the progress of smart grid, power systems are faced with more and more complex data challenges, including large-scale data, multi-sample types and low resource integration. This research takes power big data as t...
详细信息
In traditional CPU scheduling systems, it is challenging to customize scheduling policies for datacenter workloads. Therefore, distributed cluster managers can only perform coarse-grained job scheduling rather than fi...
详细信息
Load balancing plays a critical role in large-scale heterogeneous edge computing, aiming to enhance the accuracy of computing resource matching and reduce processing delays caused by imbalanced resource allocation. In...
详细信息
In an effort to offer compute services to a wide range of individuals, cloud computing is a paradigm that involves integrating high-end technology infrastructure into the backend. Although cloud computing was first in...
详细信息
Artificial Intelligence is an interdisciplinary field that combines Machine Learning, Big Data, Cloud computing, and Information Theory. The major advantage of machine learning is that it builds on prior experience, s...
详细信息
暂无评论