Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers' computing capabilities leads to the issue of stragglers, making parameter s...
详细信息
Parameter server is widely used in distributed machine learning to accelerate training. However, the increasing heterogeneity of workers' computing capabilities leads to the issue of stragglers, making parameter synchronization challenging. To address this issue, we propose a solution called Worker-Busy synchronous parallel (WBSP). This approach eliminates the waiting time of fast workers during the synchronization process and decouples the gradient upload and model download of fast workers into asymmetric parts. By doing so, it allows fast workers to complete multiple steps of local training and upload more gradients to the server, improving computational resource utilization. Additionally, the global model is only updated when the slowest worker uploads the gradients, ensuring the consistency of global models that are pulled down by all workers and the convergence of the global model. Building upon WBSP, we propose an optimized version to further reduce the communication overhead. It enables parallel execution of communication and computation tasks on workers to shorten the global synchronization interval, thereby improving training speed. We conduct theoretical analyses for the proposed mechanisms. Extensive experiments verify that our mechanism can reduce the required time to achieve the target accuracy by up to 60% compared with the fastest method and increase the proportion of computation time from 55%-72% in existing methods to 91%.
In this paper we introduce an abstract algebra for reasoning about concurrent programs, that includes an abstract algebra of atomic steps, with sub-algebras of program and environment steps, and an abstract synchronis...
详细信息
In this paper we introduce an abstract algebra for reasoning about concurrent programs, that includes an abstract algebra of atomic steps, with sub-algebras of program and environment steps, and an abstract synchronisation operator. We show how the abstract synchronisation operator can be instantiated as a synchronous parallel operator with interpretations in rely-guarantee concurrency for shared-memory systems, and in process algebras CCS and CSP. It is also instantiated as a weak conjunction operator, an operator that is useful for the specification of rely and guarantee conditions in rely/guarantee concurrency. The main differences between the parallel and weak conjunction instantiations of the synchronisation operator are how they combine individual atomic steps. Lemmas common to these different instantiations are proved once using the axiomatisation of the abstract synchronous operator. Using the sub-algebras of program and environment atomic steps, rely and guarantee conditions, as well as Morgan-style specification commands, are defined at a high-level of abstraction in the program algebra. Lifting these concepts from rely/guarantee concurrency to a higher level of abstraction makes them more widely applicable. We demonstrate the practicality of the algebra by showing how a core law from rely-guarantee theory, the parallel introduction law, can be abstracted and verified easily in the algebra. In addition to proving fundamental properties for reasoning about concurrent shared-variable programs, the algebra is instantiated to prove abstract process synchronisation properties familiar from the process algebras CCS and CSP. The algebra has been encoded in Isabelle/HOL to provide a basis for tool support for concurrent program verification based on the rely/guarantee technique. It facilitates simpler, more general, proofs that allow a higher level of automation than what is possible in low-level, model-specific interpretations.
As the volume of model data increases,traditional machine learning is not able to train models efficiently,so distributed machine learning is gradually used in large-scale data ***,commonly used distributed machine le...
详细信息
As the volume of model data increases,traditional machine learning is not able to train models efficiently,so distributed machine learning is gradually used in large-scale data ***,commonly used distributed machine learning algorithms are based on data parallelism,and often use an overall synchronous parallel strategy when passing data,but using this strategy makes the overall training speed limited by the computation speed of the slower workers in the *** the asynchronous parallel strategy maximizes the computational speed of the cluster,there is a delay in updating the parameters of the global model,which may lead to excessive computational errors or non-convergence of the *** this paper,the author combines these two data delivery methods by grouping workers together and using synchronous parallelism for the workers in the group and asynchronous parallelism for the components for *** experiment shows that the hybrid parallelism strategy can reduce the training time with guaranteed correctness.
暂无评论