Multi-objective neural architecture search (NAS) algorithms aim to automatically search the neural architecture suitable for different computing power platforms by using multi-objective optimization methods. The LEMON...
详细信息
Storing files at the network edge has become a new paradigm of storage systems, which is promising to mitigate network congestion and reduce file retrieval latency. However, the traditional file storage scheme cannot ...
详细信息
Sparse triangular solve (SpTRSV) is a vital component in various scientific applications, and numerous GPU-based SpTRSV algorithms have been proposed. Synchronization-free SpTRSV is currently the mainstream algorithm ...
详细信息
Mixture of Expert (MoE) has received increasing attention for scaling DNN models to extra-large size with negligible increases in computation. The MoE model has achieved the highest accuracy in several domains. Howeve...
Mixture of Expert (MoE) has received increasing attention for scaling DNN models to extra-large size with negligible increases in computation. The MoE model has achieved the highest accuracy in several domains. However, a significant load imbalance occurs in the device during the training of a MoE model, resulting in significantly reduced throughput. Previous works on load balancing either harm model convergence or suffer from high execution overhead. To address these issues, we present Prophet: a fine-grained load balancing method for parallel training of large-scale MoE models, which consists of a planner and a scheduler. Prophet planner first employs a fine-grained resource allocation method to determine the possible scenarios for the expert placement in a fine-grained manner, and then efficiently searches for a well-balanced expert placement to balance the load without introducing additional overhead. Prophet scheduler exploits the locality of the token distribution to schedule the resource allocation operations using a layer-wise fine-grained schedule strategy to hide their overhead. We conduct extensive experiments in four clusters and five representative models. The results indicate that Prophet gains up to 2.3x speedup compared to the state-of-the-art MoE frameworks including Deepspeed-MoE and FasterMoE. Additionally, Prophet achieves a load balancing enhancement of up to 12.06x when compared to FasterMoE.
DSP holds significant potential for important applications in Deep Neural Networks. However, there is currently a lack of research focused on shared-memory CPU-DSP heterogeneous chips. This paper proposes CD-Sched, an...
详细信息
ISBN:
(纸本)9781450399951
DSP holds significant potential for important applications in Deep Neural Networks. However, there is currently a lack of research focused on shared-memory CPU-DSP heterogeneous chips. This paper proposes CD-Sched, an automated scheduling framework that aims to address this gap. By predicting the latency of operators on both CPU and DSP, CD-Sched automatically schedules the computation of operators to the appropriate computing device. This scheduling optimization accelerates the computation of individual operators and ultimately improves the overall training time of neural networks. In end-to-end training tasks, CD-Sched can significantly reduce the overall training time, with an average reduction of approximately 10.77%.
Machine learning is broadly used in many intelligent cybernetic systems. With the burgeoning of the communities of AI, the number of machine learning-based models is rapidly increasing, but picking a suitable and opti...
详细信息
Scalability is a crucial factor determining the performance of massive heterogeneous parallel CFD applications on the multi-GPUs platforms, particularly after the single-GPU implementations have achieved optimal perfo...
详细信息
Motion and appearance cues play a crucial role in Multi-object Tracking (MOT) algorithms for associating objects across consecutive frames. While most MOT methods prioritize accurate motion modeling and distincti...
详细信息
SHA-256 plays an important role in widely used applications, such as data security, data integrity, digital signatures, and cryptocurrencies. However, most of the current optimized implementations of SHA-256 are based...
详细信息
The role-oriented learning approach could improve the performance of multi-agent reinforcement learning by decomposing complex multi-agent tasks into different roles. However, due to the dynamic environment and intera...
详细信息
暂无评论