In recent years, distributed deep learning is becoming popular in industry and academia. Although researchers want to use distributed systems for training, it has been reported that the communication cost for synchron...
详细信息
ISBN:
(纸本)9781450382946
In recent years, distributed deep learning is becoming popular in industry and academia. Although researchers want to use distributed systems for training, it has been reported that the communication cost for synchronizing gradients can be a bottleneck. Using low-precision gradients is a promising technique for reducing the bandwidth requirement. In this work, we propose Auto Precision Scaling (APS), an algorithm that can improve the accuracy when we communicate gradients by low-precision floating-point values. APS can improve the accuracy for all precisions with a trivial communication cost. Our experimental results show that for both image classification and segmentation, applying APS can train the state-of-the-art models by 8-bit floating-point gradients with no or only a tiny accuracy loss (<0.05%). Furthermore, we can avoid any accuracy loss by designing a hybrid-precision technique. Finally, we propose a performance model to evaluate the proposed method. Our experimental results show that APS can get a significant speedup over the state-of-the-art method. To make it available to researchers and developers, we design and implement a high-performance system for customized precision Deep Learning(CPD), which can simulate the training process using an arbitrary low-precision customized floating-point format. We integrate CPD into PyTorch and make it open-source to the public1.
FPGAs and other 2D and 3D spatial computing fabrics share several common characteristics e.g. a deterministic computing model with distributed memories, but also differ along important dimensions e.g. granularity and ...
详细信息
ISBN:
(纸本)9781450391498
FPGAs and other 2D and 3D spatial computing fabrics share several common characteristics e.g. a deterministic computing model with distributed memories, but also differ along important dimensions e.g. granularity and communication infrastructure. This talk will position Groq's Tensor Streaming Processing (TSP) chip relative to other spatial computing fabrics and highlight common aspects of programming models for such spatial computing engines as well as highlighting some of the unique characteristics of the TSP architecture and its programming model One common characteristic this talk will focus on is \em determinism and specifically the use of static scheduling to know at compile time exactly how many cycles a program will take to execute regardless of the specific values of the input data to the TSP chip. Groq's TSP architecture is based on a parallel collection of coarse grain processing units including matrix multiplication or convolution (MXM), vector operations (VXM), arbitrary vector transformations (SXM) and various numerical operations. Data originates from a memory reads (MEM) and is then chained through a tensor pipeline through one or more computing blocks and terminated by a memory write (MEM). The chip can operate on in-flight tensors as they are produced and consumed, aggressively exploiting data-flow locality. A variety of deterministic computing models will be surveyed as we seek abstractions that support rapid multiplexing between multiple workloads which can be co-resident on TSP chips, with scheduling and coordination facilitated by determinism.
We develop randomized distributed algorithms for many of the most fundamental communication problems in the wireless SINR model, including (multi-message) broadcast, local broadcast, coloring, MIS, and aggregation. Th...
详细信息
ISBN:
(纸本)9781450362177
We develop randomized distributed algorithms for many of the most fundamental communication problems in the wireless SINR model, including (multi-message) broadcast, local broadcast, coloring, MIS, and aggregation. The complexity of the algorithms is optimal up to polylogarithmic preprocessing time. It shows - contrary to expectation - that the plain vanilla SINR model is just as powerful and fast (modulo the preprocessing) as various extensions studied, including power control, carrier sense, collision detection, free acknowledgements, and GPS location. A key component of the algorithms is an efficient simulation of Congest algorithms on a constant-density SINR backbone.
We prove a few new lower bounds on the randomized competitive ratio for the k-server problem and other related problems, resolving some long-standing conjectures. In particular, for metrical task systems (MTS) we asym...
详细信息
ISBN:
(纸本)9781450399135
We prove a few new lower bounds on the randomized competitive ratio for the k-server problem and other related problems, resolving some long-standing conjectures. In particular, for metrical task systems (MTS) we asympotically settle the competitive ratio and obtain the first improvement to an existential lower bound since the introduction of the model 35 years ago (in 1987). More concretely, we show: (1) There exist (k+1)-point metric spaces in which the randomized competitive ratio for the k-server problem is Ω(log2 k). This refutes the folklore conjecture (which is known to hold in some families of metrics) that in all metric spaces with at least k+1 points, the competitive ratio is Θ(logk). (2) Consequently, there exist n-point metric spaces in which the randomized competitive ratio for MTS is Ω(log2 n). This matches the upper bound that holds for all metrics. The previously best existential lower bound was Ω(logn) (which was known to be tight for some families of metrics). (3) For all kdistributed paging, and metric allocation. These lower bounds share a common thread, and other than the third bound, also a common construction.
Transient computing has become popular in public cloud environments for running delay-insensitive batch and data processing applications at low cost. Since transient cloud servers can be revoked at any time by the clo...
详细信息
ISBN:
(纸本)9781450370523
Transient computing has become popular in public cloud environments for running delay-insensitive batch and data processing applications at low cost. Since transient cloud servers can be revoked at any time by the cloud provider, they are considered unsuitable for running interactive application such as web services. In this paper, we present VM deflation as an alternative mechanism to server preemption for reclaiming resources from transient cloud servers under resource pressure. Using real traces from top-tier cloud providers, we show the feasibility of using VM deflation as a resource reclamation mechanism for interactive applications in public clouds. We show how current hypervisor mechanisms can be used to implement VM deflation and present cluster deflation policies for resource management of transient and on-demand cloud VMs. Experimental evaluation of our deflation system on a Linux cluster shows that microservice-based applications can be deflated by up to 50% with negligible performance overhead. Our cluster-level deflation policies allow overcommitment levels as high as 50%, with less than a 1% decrease in application throughput, and can enable cloud platforms to increase revenue by 30%.
We present a new deterministic algorithm for distributed weighted all pairs shortest paths (APSP) in both undirected and directed graphs. Our algorithm runs in (O) over tilde (n(4/3)) rounds in the Congest models on g...
详细信息
ISBN:
(纸本)9781450369350
We present a new deterministic algorithm for distributed weighted all pairs shortest paths (APSP) in both undirected and directed graphs. Our algorithm runs in (O) over tilde (n(4/3)) rounds in the Congest models on graphs with arbitrary edge weights, and it improves on the previous (O) over tilde (n(3/2)) bound of Agarwal et al. [3]. The main components of our new algorithm are a new faster technique for constructing blocker set deterministically and a new pipelined method for deterministically propagating distance values from source nodes to the blocker set nodes in the network. Both of these techniques have potential applications to other distributed algorithms. Our new deterministic algorithm for computing blocker set adapts the NC approximate hypergraph set cover algorithm in [5] to the distributed construction of a blocker set. It follows the two-step process of first designing a randomized algorithm that uses only pairwise independence, and then derandomizes this algorithm using a sample space of linear size. This algorithm runs in almost the same number of rounds as the initial step in our APSP algorithm that computes h-hops shortest paths. This result significantly improves on the deterministic blocker set algorithms in [1, 3] by removing an additional n center dot vertical bar Q vertical bar term in the round bound, where Q is the blocker set. The other new component in our APSP algorithm is a deterministic pipelined approach to propagate distance values from source nodes to blocker nodes. We use a simple natural round-robin method for this step, and we show using a suitable progress measure that it achieve the (O) over tilde (n(4/3)) bound on the number of rounds. It appears that the standard deterministic methods for efficiently broadcasting multiple values, and for sending or receiving messages using the routing schedule in an undirected APSP algorithm [13, 16] do not apply to this setting.
Many big data processing applications rely on a top-k retrieval building block, which selects (or approximates) the k highest-scoring data items based on an aggregation of features. In web search, for instance, a docu...
详细信息
ISBN:
(纸本)9781450368186
Many big data processing applications rely on a top-k retrieval building block, which selects (or approximates) the k highest-scoring data items based on an aggregation of features. In web search, for instance, a document's score is the sum of its scores for all query terms. Top-k retrieval is often used to sift through massive data and identify a smaller subset of it for further analysis. Because it filters out the bulk of the data, it often constitutes the main performance bottleneck. Beyond the rise in data sizes, today's data processing scenarios also increase the number of features contributing to the overall score. In web search, for example, verbose queries are becoming mainstream, while state-of-the-art algorithms fail to process long queries in real-time. We present Sparta, a practical parallel algorithm that exploits multi-core hardware for fast (approximate) top-k retrieval. Thanks to lightweight coordination and judicious context sharing among threads, Sparta scales both in the number of features and in the searched index size. In our web search case study on 50M documents, Sparta processes 12-term queries more than twice as fast as the state-of-the-art. On a tenfold bigger index, Sparta processes queries at the same speed, whereas the average latency of existing algorithms soars to be an order-of-magnitude larger than Sparta's.
Simulations of Social Dynamics play an important role in public policy analysis and design as they help to estimate the effects that alternative policies can have on society. We argue for a tight integration of a nove...
详细信息
ISBN:
(纸本)9781665433266
Simulations of Social Dynamics play an important role in public policy analysis and design as they help to estimate the effects that alternative policies can have on society. We argue for a tight integration of a novel generation of social dynamics meta-simulation tools in a policy development framework. Such tools provide a common, agent- based modeling and concurrent execution environment for creating and running such simulations. Furthermore, they enable reasoning about simulation outcomes by examining and ranking alternative policy objectives based on a set of criteria supplied by the designer. We describe our implementation of such an idea in Politika, our Elixir-based web framework for interactive, meta-simulation-based policy design and provide appropriate examples as part of the PolicyCLOUD EU research project.
Consider a computer network that consists of a path with n nodes. The nodes are labeled with inputs from a constant-sized set, and the task is to find output labels from a constant-sized set subject to some local cons...
详细信息
ISBN:
(纸本)9781450362177
Consider a computer network that consists of a path with n nodes. The nodes are labeled with inputs from a constant-sized set, and the task is to find output labels from a constant-sized set subject to some local constraints-more formally, we have an LCL (locally checkable labeling) problem. How many communication rounds are needed (in the standard LOCAL model of computing) to solve this problem? It is well known that the answer is always either O(1) rounds, or Theta(log* n) rounds, or Theta(n) rounds. In this work we show that this question is decidable (albeit PSPACE-hard): we present an algorithm that, given any LCL problem defined on a path, outputs the distributed computational complexity of this problem and the corresponding asymptotically optimal algorithm.
暂无评论