检索结果-内蒙古大学图书馆

WBSP: Addressing stragglers in distributed machine learning with worker-busy synchronous parallel

parallel COMPUTING 2024年 121卷

作者： Yang, Duo Hu, Bing Liu, An Jin, A-Long Yeung, Kwan L. You, Yang Zhejiang Univ Hangzhou Peoples R China Educ Univ Hong Kong Hong Kong Peoples R China Natl Univ Singapore Singapore Singapore

Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers' computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy synchronous parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%-72% in existing methods to 91%.

关键词： Distributed machine learning Parameter server Stragglers Heterogeneous environment synchronous parallel

来源：评论

学校读者我要写书评

暂无评论

A Hybrid parallelization Approach Based on Workers Grouping Algorithm

A Hybrid Parallelization Approach Based on Workers Grouping ...

引用

作者： Ruoyu Lu SouthWest Jiaotong University

As the volume of model data increases,traditional machine learning is not able to train models efficiently,so distributed machine learning is gradually used in large-scale data ***,commonly used distributed machine learning algorithms are based on data parallelism,and often use an overall synchronous parallel strategy when passing data,but using this strategy makes the overall training speed limited by the computation speed of the slower workers in the *** the asynchronous parallel strategy maximizes the computational speed of the cluster,there is a delay in updating the parameters of the global model,which may lead to excessive computational errors or non-convergence of the *** this paper,the author combines these two data delivery methods by grouping workers together and using synchronous parallelism for the workers in the group and asynchronous parallelism for the components for *** experiment shows that the hybrid parallelism strategy can reduce the training time with guaranteed correctness.

关键词： Deep learning Data parallel Distributed machine learning synchronous parallel Asynchronous parallel

来源：评论

学校读者我要写书评

暂无评论

A synchronous program algebra: a basis for reasoning about shared-memory and event-based concurrency

引用

FORMAL ASPECTS OF COMPUTING 2019年第2期31卷 133-163页

作者： Hayes, Ian J. Meinicke, Larissa A. Winter, Kirsten Colvin, Robert J. Univ Queensland Sch Informat Technol & Elect Engn Brisbane Qld 4072 Australia

In this paper we introduce an abstract algebra for reasoning about concurrent programs, that includes an abstract algebra of atomic steps, with sub-algebras of program and environment steps, and an abstract synchronisation operator. We show how the abstract synchronisation operator can be instantiated as a synchronous parallel operator with interpretations in rely-guarantee concurrency for shared-memory systems, and in process algebras CCS and CSP. It is also instantiated as a weak conjunction operator, an operator that is useful for the specification of rely and guarantee conditions in rely/guarantee concurrency. The main differences between the parallel and weak conjunction instantiations of the synchronisation operator are how they combine individual atomic steps. Lemmas common to these different instantiations are proved once using the axiomatisation of the abstract synchronous operator. Using the sub-algebras of program and environment atomic steps, rely and guarantee conditions, as well as Morgan-style specification commands, are defined at a high-level of abstraction in the program algebra. Lifting these concepts from rely/guarantee concurrency to a higher level of abstraction makes them more widely applicable. We demonstrate the practicality of the algebra by showing how a core law from rely-guarantee theory, the parallel introduction law, can be abstracted and verified easily in the algebra. In addition to proving fundamental properties for reasoning about concurrent shared-variable programs, the algebra is instantiated to prove abstract process synchronisation properties familiar from the process algebras CCS and CSP. The algebra has been encoded in Isabelle/HOL to provide a basis for tool support for concurrent program verification based on the rely/guarantee technique. It facilitates simpler, more general, proofs that allow a higher level of automation than what is possible in low-level, model-specific interpretations.

关键词： Refinement calculus Program algebra Rely guarantee Concurrency Process algebra synchronous parallel

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：