gradient codes use data replication to mitigate the effect of straggling machines in distributed machine learning. Approximate gradient codes consider codes where the data replication factor is too low to recover the ...
详细信息
gradient codes use data replication to mitigate the effect of straggling machines in distributed machine learning. Approximate gradient codes consider codes where the data replication factor is too low to recover the full gradient exactly. Our work is motivated by the challenge of designing approximate gradient codes that simultaneously work well in both the adversarial and random straggler models. We introduce novel approximate gradient codes based on expander graphs. We analyze the decoding error both for random and adversarial stragglers, when optimal decoding coefficients are used. With random stragglers, our codes achieve an error to the gradient that decays exponentially in the replication factor. With adversarial stragglers, the error is smaller than any existing code with similar performance in the random setting. We prove convergence bounds in both settings for coded gradient descent under standard assumptions. With random stragglers, our convergence rate improves upon rates obtained via black-box approaches. With adversarial stragglers, we show that gradient descent converges down to a noise floor that scales linearly with the adversarial error to the gradient. We demonstrate empirically that our codes achieve near-optimal error with random stragglers and converge faster than algorithms that do not use optimal decoding coefficients.
gradient coding allows a master node to derive the aggregate of the partial gradients, calculated by some worker nodes over the local data sets, with minimum communication cost, and in the presence of stragglers. In t...
详细信息
gradient coding allows a master node to derive the aggregate of the partial gradients, calculated by some worker nodes over the local data sets, with minimum communication cost, and in the presence of stragglers. In this paper, for gradient coding with linear encoding, we characterize the optimum communication cost for heterogeneous distributed systems with arbitrary data placement, with s is an element of N stragglers and a is an element of N adversarial nodes. In particular, we show that the optimum communication cost, normalized by the size of the gradient vectors, is equal to (r - s - 2a)(-1), where r is an element of N is the minimum number that a data partition is replicated. In other words, the communication cost is determined by the data partition with the minimum replication, irrespective of the structure of the placement. The proposed achievable scheme also allows us to target the computation of a polynomial function of the aggregated gradient matrix. It also allows us to borrow some ideas from approximation computing and propose an approximate gradient coding scheme for the cases when the repetition in data placement is smaller than what is needed to meet the restriction imposed on communication cost or when the number of stragglers appears to be more than the presumed value in the system design.
We consider distributed computation of a sequence of J gradients {g(0), . . . , g(J - 1)}. Each worker node computes a fraction of g(t) in round-t and attempts to communicate the result to a master. Master is required...
详细信息
We consider distributed computation of a sequence of J gradients {g(0), . . . , g(J - 1)}. Each worker node computes a fraction of g(t) in round-t and attempts to communicate the result to a master. Master is required to obtain the full gradient g(t) by the end of round-(t+T). The goal here is to finish all the J gradient computations, keeping the cumulative processing time as short as possible. Delayed availability of results from individual workers causes bottlenecks in this setting. These delays can be due to factors such as processing delay of workers and packet losses. gradient coding (GC) framework introduced by Tandon et al. uses coding theoretic techniques to mitigate the effect of delayed responses from workers. In this paper, we primarily target mitigating communication-level delays. In contrast to the classical GC approach which performs coding only across workers (T = 0), the proposed sequential gradient coding framework is more general, as it allows for coding across workers as well as time. We present a new sequential gradient coding scheme which offers improved resiliency against communication-level delays compared to the GC scheme, without increasing computational load. Our experimental results establish performance improvement offered by the new coding scheme.
In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is pre...
详细信息
ISBN:
(纸本)9781665403122
In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is presence of stragglers. These are nodes that are terribly slow in performing their task which results in under-utilization of the training data that is stored in them. Towards this, gradient coding mitigates the impact of stragglers by adding sufficient redundancy in the data. gradient coding and other straggler mitigation schemes assume that the straggler behavior of the worker nodes is identical. Our experiments on the Amazon AWS cluster however suggest otherwise and we see that there is a correlation in the straggler behavior across iterations. To model this, we introduce a heterogeneous straggler model where nodes are categorized into two classes, slow and active. To better utilize training data stored with slow nodes, we modify the existing gradient coding schemes with shuffling of the training data among workers. Our results (both simulation and cloud experiments) suggest remarkable improvement with shuffling over existing schemes. We perform theoretical analysis for the proposed models justifying their utility.
gradient descent algorithms are widely used in machine learning. In order to deal with huge volume of data, we consider the implementation of gradient descent algorithms in a distributed computing setting where multip...
详细信息
ISBN:
(纸本)9781728125190
gradient descent algorithms are widely used in machine learning. In order to deal with huge volume of data, we consider the implementation of gradient descent algorithms in a distributed computing setting where multiple workers compute the gradient over some partial data and the master node aggregates their results to obtain the gradient over the whole data. However, its performance can be severely affected by straggler workers. Recently, some coding-based approaches are introduced to mitigate the straggler problem, but they are efficient only when the workers are homogeneous, i.e., having the same computation capabilities. In this paper, we consider that the workers are heterogenous which are common in modern distributed systems. We propose a novel heterogeneity-aware gradient coding scheme which can not only tolerate a predetermined number of stragglers but also fully utilize the computation capabilities of heterogenous workers. We show that this scheme is optimal when the computation capabilities of workers are estimated accurately. A variant of this scheme is further proposed to improve the performance when the estimations of the computation capabilities are not so accurate. We conduct our schemes for gradient descent based image classification on QingCloud clusters. Evaluation results show that our schemes can reduce the whole computation time by up to 3x compared with a state-of-the-art coding scheme.
Linear regression is a fundamental and primitive problem in supervised machine learning, with applications ranging from epidemiology to finance. In this work, we propose methods for speeding up distributed linear regr...
详细信息
Linear regression is a fundamental and primitive problem in supervised machine learning, with applications ranging from epidemiology to finance. In this work, we propose methods for speeding up distributed linear regression. We do so by leveraging randomized techniques, while also ensuring security and straggler resiliency in asynchronous distributed computing systems. Specifically, we randomly rotate the basis of the system of equations and then subsample blocks, to simultaneously secure the information and reduce the dimension of the regression problem. In our setup, the basis rotation corresponds to an encoded encryption in an approximate gradient coding scheme, and the subsampling corresponds to the responses of the non-straggling servers in the centralized coded computing framework. This results in a distributive iterative stochastic approach for matrix compression and steepest descent.
gradient coding is a coding theoretic framework to provide robustness against slow or unresponsive machines, known as stragglers, in distributed machine learning applications. Recently, Kadhe et al. (2019) proposed a ...
详细信息
gradient coding is a coding theoretic framework to provide robustness against slow or unresponsive machines, known as stragglers, in distributed machine learning applications. Recently, Kadhe et al. (2019) proposed a gradient code based on a combinatorial design, called balanced incomplete block design (BIBD), which is shown to outperform many existing gradient codes in worst-case adversarial straggling scenarios. However, parameters for which such BIBD constructions exist are very limited (Colbourn and Dinitz, 2006). In this paper, we aim to overcome such limitations and construct gradient codes which exist for a wide range of system parameters while retaining the superior performance of BIBD gradient codes. Two such constructions are proposed, one based on a probabilistic construction that relax the stringent BIBD gradient code constraints, and the other based on taking the Kronecker product of existing gradient codes. The proposed gradient codes allow flexible choices of system parameters while retaining comparable error performance.
Distributed gradient descent is an optimization algorithm that is used to solve a minimization problem distributed over a network through minimizing local functions that sum up to form the overall objective function. ...
详细信息
ISBN:
(纸本)9781538665961
Distributed gradient descent is an optimization algorithm that is used to solve a minimization problem distributed over a network through minimizing local functions that sum up to form the overall objective function. These local functions f(i)(.) contribute to local gradients adding up incrementally to form the overall gradient. Recently, the gradient coding paradigm was introduced for networks with a centralized fusion center to resolve the problem of straggler nodes. Through introducing some kind of redundancy on each node, such coding schemes are utilized to form new coded local functions g(i) from the original local functions f(i). In this work, we consider a distributed network with a defined network topology and no fusion center. At each node, linear combinations of the local coded gradients del(g) over bari can be constructed to form the overall gradient. Our iterative method, referred to as Code-Based Distributed gradient Descent (CDGD), updates each node's local estimate by applying an adequate weighing scheme. This scheme adapts the coded local gradient descent step along with local estimates from neighboring nodes. We provide the convergence analysis for CDGD and we analytically show that we enhance the convergence rate by a scaling factor over conventional incremental methods without any predefined tuning. Furthermore, we demonstrate through numerical results significant performance and enhancements for convergence rates.
暂无评论