检索结果-内蒙古大学图书馆

Gradient Coding With Dynamic Clustering for Straggler-Tolerant distributed Learning

IEEE TRANSACTIONS ON COMMUNICATIONS 2023年第6期71卷 3317-3332页

作者： Buyukates, Baturalp Ozfatura, Emre Ulukus, Sennur Gunduz, Deniz Univ Maryland Dept Elect & Comp Engn College Pk MD 20742 USA Univ Southern Calif Dept Elect & Comp Engn Los Angeles CA 90007 USA Imperial Coll London Dept Elect & Elect Engn London SW7 2BX England Univ Modena & Reggio Emilia Unimore Dept Engn Enzo Ferrari I-41121 Modena Italy

distributed implementations are crucial in speeding up large scale machine learning applications. distributed gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is straggling workers. coded distributed computation techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers. In this paper, we introduce a novel paradigm of dynamic coded computation, which assigns redundant data to workers to acquire the flexibility to dynamically choose from among a set of possible codes depending on the past straggling behavior. In particular, we propose gradient coding (GC) with dynamic clustering, called GC-DC, and regulate the number of stragglers in each cluster by dynamically forming the clusters at each iteration. With time-correlated straggling behavior, GC-DC adapts to the straggling behavior over time;in particular, at each iteration, GC-DC aims at distributing the stragglers across clusters as uniformly as possible based on the past straggler behavior. For both homogeneous and heterogeneous worker models, we numerically show that GC-DC provides significant improvements in the average per-iteration completion time without an increase in the communication load compared to the original GC scheme.

关键词： distributed coded computation gradient descent straggler mitigation gradient coding clustering

来源：评论

学校读者我要写书评

暂无评论

Stream distributed coded Computing

IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY

引用

IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY 2021年第3期2卷 1025-1040页

作者： Cohen, Alejandro Thiran, Guillaume Esfahanizadeh, Homa Medard, Muriel MIT Elect Res Lab Cambridge MA 02139 USA Catholic Univ Louvain ICTEAM B-1348 Louvain La Neuve Belgium Fonds Rech Sci FNRS B-1000 Brussels Belgium

The emerging large-scale and data-hungry algorithms require the computations to be delegated from a central server to several worker nodes. One major challenge in the distributed computations is to tackle delays and failures caused by the stragglers. To address this challenge, introducing efficient amount of redundant computations via distributed coded computation has received significant attention. Recent approaches in this area have mainly focused on introducing minimum computational redundancies to tolerate certain number of stragglers. To the best of our knowledge, the current literature lacks a unified end-to-end design in a heterogeneous setting where the workers can vary in their computation and communication capabilities. The contribution of this paper is to devise a novel framework for joint scheduling-coding, in a setting where the workers and the arrival of stream computational jobs are based on stochastic models. In our initial joint scheme, we propose a systematic framework that illustrates how to select a set of workers and how to split the computational load among the selected workers based on their differences in order to minimize the average in-order job execution delay. Through simulations, we demonstrate that the performance of our framework is dramatically better than the performance of naive method that splits the computational load uniformly among the workers, and it is close to the ideal performance.

关键词： distributed coded computation stragglers large matrix-matrix multiplication large matrix-vector multiplication ultra-reliable low-latency in-order execution delay queuing theory

来源：评论

学校读者我要写书评

暂无评论

Factored LT and Factored Raptor Codes for Large-Scale distributed Matrix Multiplication

IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY

引用

IEEE JOURNAL ON SELECTED AREAS IN INFORMATION THEORY 2021年第3期2卷 893-906页

作者： Pradhan, Asit Kumar Heidarzadeh, Anoosheh Narayanan, Krishna R. Texas A&M Univ Dept Elect & Comp Engn College Stn TX 77843 USA

We propose two coding schemes for distributed matrix multiplication in the presence of stragglers. These coding schemes are adaptations of Luby Transform (LT) codes and Raptor codes to distributed matrix multiplication and are termed Factored LT (FLT) codes and Factored Raptor (FRT) codes. We show that all nodes in the Tanner graph of a randomly sampled code have a tree-like neighborhood with high probability. This ensures that the density evolution analysis gives a reasonable estimate of the average recovery threshold of FLT codes. The recovery threshold of the proposed FLT codes is asymptotically optimal when the output degree distribution is Soliton. Empirically, we show that FRT codes have an excellent recovery threshold while the number of worker nodes is moderately large. In addition, using Azuma-Hoeffding inequality, we derive concentration results to show that the recovery threshold of a randomly chosen FLT code is close to the ensemble average. FLT and FRT codes have better recovery thresholds when compared to Product codes and they are expected to have better numerical stability when compared to Polynomial codes, while they can also be decoded with a low-complexity decoding algorithm. Finally, the proposed codes are better matched to the practically important case of sparse matrix-matrix multiplication as compared to many previous schemes.

关键词： distributed coded computation matrix multiplication straggler mitigation numerical stability FLT codes FRT codes

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：