As the scale and complexity of deep learning models continues to grow, model training is becoming an expensive job and only a small number of well-financed organizations can afford. Are the resources in commodity clus...
详细信息
ISBN:
(纸本)9781450382175
As the scale and complexity of deep learning models continues to grow, model training is becoming an expensive job and only a small number of well-financed organizations can afford. Are the resources in commodity clusters well utilized for training? or how much potential space are still there for further improving the training efficiency in commodity clusters? is an urgent question to answer. In this paper, we review the processing of distributed learning training (DDL) in commodity GPU clusters and find that the current resource utilization is not only low but also imbalanced. We observe two features that can be exploited for further improving the training efficiency: partial predictable training and unified CPU-GPU training. Based on the observations, we present AITurbo, a novel resource scheduler that treats predictable and unpredictable jobs separately, but allocates heterogeneous CPU-GPU resource in a unified way. For predictable jobs, AITurbo designs a predicting model to estimate their performance under various heterogeneous resource allocations. For unpredictable jobs, it schedules them following the least-attained-service-first manner. AITurbo further designs a Borda-count based multi-level feedback queue method to combine them together. AITurbo demonstrates that there is still significant space for improving the training efficiency in commodity clusters. We evaluate AITurbo using jobs from Tensorflow benchmarks, which are submitted following the real trace of three production systems. Experimental results show that, compared with the state-of-the-art, AITurbo can reduce the average job completion time of DDL jobs by 3x.
We study the effect of noise on the n-party beeping model. In this model, in every round, each party may decide to either 'beep' or not. All parties hear a beep if and only if at least one party beeps. The bee...
详细信息
ISBN:
(纸本)9781450375825
We study the effect of noise on the n-party beeping model. In this model, in every round, each party may decide to either 'beep' or not. All parties hear a beep if and only if at least one party beeps. The beeping model is becoming increasingly popular, as it offers a very simple abstraction of wireless networks and is very well suited for studying biological phenomena. Still, the noise resilience of the beeping model is yet to be understood. Our main result is a lower bound, showing that making protocols in the beeping model resilient to noise may have a large performance overhead. Specifically, we give a protocol that works over the (noiseless) beeping model, and prove that any scheme that simulates this protocol over the beeping model with correlated stochastic noise will blow up the number of rounds by an O(logn) multiplicative factor. We complement this result by a matching upper bound, constructing a noise-resilient simulation scheme with O(logn) overhead for any noiseless beeping protocol.
Local differential privacy (LDP) is a model where users send privatized data to an untrusted central server whose goal it to solve some data analysis task. In the non-interactive version of this model the protocol con...
详细信息
ISBN:
(纸本)9781450369794
Local differential privacy (LDP) is a model where users send privatized data to an untrusted central server whose goal it to solve some data analysis task. In the non-interactive version of this model the protocol consists of a single round in which a server sends requests to all users then receives their responses. This version is deployed in industry due to its practical advantages and has attracted significant research interest. Our main result is an exponential lower bound on the number of samples necessary to solve the standard task of learning a large-margin linear separator in the non-interactive LDP model. Via a standard reduction this lower bound implies an exponential lower bound for stochastic convex optimization and specifically, for learning linear models with a convex, Lipschitz and smooth loss. These results answer the questions posed by Smith, Thakurta, and Upadhyay (IEEE symposium on Security and Privacy 2017) and Daniely and Feldman (NeurIPS 2019). Our lower bound relies on a new technique for constructing pairs of distributions with nearly matching moments but whose supports can be nearly separated by a large margin hyperplane. These lower bounds also hold in the model where communication from each user is limited and follow from a lower bound on learning using non-adaptive statistical queries.
Cloud resources have become a preferred operational model distributed Database Management Systems (DBMS) by offering the elasticity and virtually unlimited scalability, but increase the risk of failures with increasin...
详细信息
ISBN:
(纸本)9781450368667
Cloud resources have become a preferred operational model distributed Database Management Systems (DBMS) by offering the elasticity and virtually unlimited scalability, but increase the risk of failures with increasing cluster sizes. While distributed DBMS provide high-availability mechanisms, it is currently an open research question to what extent they are able to provide availability and performance guarantees in case of cloud resource failures. Especially as existing DBMS benchmarks do not consider availability. We present a comprehensive methodology for evaluating the availability of distributed DBMS in case of cloud resource failures. Based on this methodology, we introduce a novel framework that automates the full evaluation process, including the failure injection, and emphasizes reproducibility. The framework is validated by 16 diverse availability evaluations. The results show that distributed DBMS are not necessary available even if sufficient replicas are available and clients can experience significant downtimes.
In this paper, we consider the load balancing problem as a bi-objective problem which includes minimization of jobs' response time and utilization imbalance among servers. We consider both the objectives in an int...
详细信息
ISBN:
(纸本)9781450368667
In this paper, we consider the load balancing problem as a bi-objective problem which includes minimization of jobs' response time and utilization imbalance among servers. We consider both the objectives in an integrated manner and formulated as an optimization model. We cast into the game-theoretic setting. To derive the solution of the game, we propose a co-evolutionary computing framework, called CO-evolutionary framework based on Differential Evolution (CODE). Simulation results show that the CODE not only minimizes the jobs' response time but also significantly reduces the utilization imbalance among servers.
As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance critical. Existing state-of-the-art methods like Check-Freq and Elastic Horovod need to...
详细信息
ISBN:
(纸本)9798400700156
As the size of deep learning models gets larger and larger, training takes longer time and more resources, making fault tolerance critical. Existing state-of-the-art methods like Check-Freq and Elastic Horovod need to back up a copy of the model state in memory, which is costly for large models and leads to non-trivial overhead. This paper presents Swift, a novel failure recovery design for distributed deep neural network training that significantly reduces the failure recovery overhead without affecting training throughput and model accuracy. Instead of making an additional copy of the model state, Swift resolves the inconsistencies of the model state caused by the failure and exploits replicas of the model state in data parallelism for failure recovery. We propose a logging-based approach when replicas are unavailable, which records intermediate data and replays the computation to recover the lost state upon a failure. Evaluations show that Swift significantly reduces the failure recovery time and achieves similar or better training throughput during failure-free execution compared to state-of-the-art methods without degrading final model accuracy.
Coordination is required in order to solve a multi robot navigation problem and allow an efficient and fast search of a solution while avoiding any possible collisions. Planning with a fleet of robots can rely on Mult...
详细信息
ISBN:
(纸本)9781450368667
Coordination is required in order to solve a multi robot navigation problem and allow an efficient and fast search of a solution while avoiding any possible collisions. Planning with a fleet of robots can rely on Multi-agent Markovian Decision Processes (MMDPs) model This assumes that it is possible to share the local perceptions of robots every time. However the computation of a distributed policy is not necessarily distributable between robots as with multiple path planning, where the movement of one robot depends on all the other's paths. The global search space would be of exponential size (in the number of robots) in most of multi-robot scenarios. distributed planning over the robots would allow each robot to plan its own policy while taking advantage of parallel computing. In this paper, this problem is addressed by presenting an approach consisting in starting from a simplified model which can be distributed, then by adding robots interactions constraints while maintaining the model distributable. The results of the experimentations with different configurations highlight some of the strength and limitations of the current approach.
This paper describes a distributed approach for autonomous cooperative transportation in a dynamic multi-robot environment. The proposed approach forms an optimal coalition at runtime for cooperative transportation an...
详细信息
ISBN:
(纸本)9781450368667
This paper describes a distributed approach for autonomous cooperative transportation in a dynamic multi-robot environment. The proposed approach forms an optimal coalition at runtime for cooperative transportation and assigns a group of robots for the task. Explicit communication is used to acquire information as the robots do not have a global knowledge of the environment, i.e., no robot knows the location and state of another robot. For cooperative transportation, such information is essential as objects may arrive at any time and at any location. The proposed approach deals with on-demand missions, where the number of robots required to solve the problem is not known a priori. The applicability of the approach is demonstrated on a road clearance scenario in a realistic urban search and rescue simulation environment. The experimental results validate the correctness of the approach.
Edge computing (EC) has been promising in providing support for Deep Neural Network (DNN) applications in IoT environments. However, resources in each edge are limited as response time increases. To reduce the latency...
详细信息
ISBN:
(纸本)9781450391658
Edge computing (EC) has been promising in providing support for Deep Neural Network (DNN) applications in IoT environments. However, resources in each edge are limited as response time increases. To reduce the latency, we propose to distribute the computation of multiple DNN models to nearby IoT devices. In particular, we propose a piece-wise multilevel partitioning and scheduling algorithm to improve the completion time of DNN inference.
The combination of edge and cloud in the fog computing paradigm enables a new breed of data-intensive applications. These applications, however, have to face a number of fog-specific challenges which developers have t...
详细信息
ISBN:
(纸本)9781450368667
The combination of edge and cloud in the fog computing paradigm enables a new breed of data-intensive applications. These applications, however, have to face a number of fog-specific challenges which developers have to repetitively address for every single application. In this paper, we derive a set of requirements for a replication service that aims to simplify the development of data-intensive fog applications which are caused by the highly distributed and heterogeneous operation environment. Furthermore, we propose the design for such a service which addresses our requirements.
暂无评论