SoC-Cluster, a novel server architecture composed of massive mobile system-on-chips (SoCs), is gaining popularity in industrial edge computing due to its energy efficiency and compatibility with existing mobile applic...
详细信息
ISBN:
(纸本)9798400703720
SoC-Cluster, a novel server architecture composed of massive mobile system-on-chips (SoCs), is gaining popularity in industrial edge computing due to its energy efficiency and compatibility with existing mobile applications. However, we observe that the deployed SoC-Cluster servers are not fully utilized, because the hosted workloads are mostly usertriggered and have significant tidal phenomena. To harvest the free cycles, we propose to co-locate deep learning tasks on them. We present SoCFlow, the first framework that can efficiently train deep learning models on SoC-Cluster. To deal with the intrinsic inadequacy of commercial SoC-Cluster servers, SoCFlow incorporates two novel techniques: (1) the group-wise parallelism with delayed aggregation that can train deep learning models fast and scalably without being influenced by the network bottleneck;(2) the data-parallel mixed-precision training algorithm that can fully unleash the heterogeneous processors' capability of mobile SoCs. We have fully implemented SoCFlow and demonstrated its effectiveness through extensive experiments. The experiments show that SoCFlow significantly and consistently outperforms all baselines regarding the training speed while preserving the convergence accuracy, e.g., 1.6x-740x convergence speedup with 32 SoCs. Compared to commodity GPU (NVIDIA V100) under the same power budget, SoCFlow achieves comparable training speed but reduces energy consumption by 2.31x-10.23x with the same convergence accuracy.
computing the posterior distribution of a probabilistic program is a hard task for which no one-fit-for-all solution exists. We propose Gaussian Semantics, which approximates the exact probabilistic semantics of a bou...
详细信息
computing the posterior distribution of a probabilistic program is a hard task for which no one-fit-for-all solution exists. We propose Gaussian Semantics, which approximates the exact probabilistic semantics of a bounded program by means of Gaussian mixtures. It is parametrized by a map that associates each program location with the moment order to be matched in the approximation. We provide two main contributions. The first is a universal approximation theorem stating that, under mild conditions, Gaussian Semantics can approximate the exact semantics arbitrarily closely. The second is an approximation that matches up to second-order moments analytically in face of the generally difficult problem of matching moments of Gaussian mixtures with arbitrary moment order. We test our second-order Gaussian approximation (SOGA) on a number of case studies from the literature. We show that it can provide accurate estimates in models not supported by other approximation methods or when exact symbolic techniques fail because of complex expressions or non-simplified integrals. On two notable classes of problems, namely collaborative filtering and programs involving mixtures of continuous and discrete distributions, we show that SOGA significantly outperforms alternative techniques in terms of accuracy and computational time.
The growing demands across various scientific fields have led to a significant shift in applications that consume data at the edge of the computing continuum. These applications require unified programmingmodels for ...
详细信息
Clients rely on database systems to be correct, which requires the system not only to implement transactions’ semantics correctly but also to provide isolation guarantees for the transactions. This paper presents a c...
详细信息
Current programmingmodels face challenges in dealing with modern supercomputers' growing parallelism and heterogeneity. Emerging programmingmodels, like the task-based programming model found in the asynchronous...
详细信息
The actor programming model supports the development of concurrent applications by encapsulating state and behavior into independent actors. Each actor is a computational entity with strictly private state and behavio...
详细信息
ISBN:
(纸本)9798400704000
The actor programming model supports the development of concurrent applications by encapsulating state and behavior into independent actors. Each actor is a computational entity with strictly private state and behavior. Actors communicate via asynchronous messaging and, in this way, require neither shared memory nor locking. This makes the actor model suitable not only for parallel programming but also for distributed applications engineering. The Rust programming language is a statically-typed language that gained a lot of attention in the past years due to its efficient, economical and safe memory management. To ease the development of parallel applications, several actor model frameworks have been built for Rust. However, no actively maintained Rust actor framework provides the necessary features to write distributed applications. For this reason, we propose an extension for Rust's Actix library, called ActixTelepathy, that enables remote messaging and offers clustering support. It allows developers to setup remote actors that can communicate across a computer network with the help of a straight forward and easy to understand interface. Our evaluation demonstrates that Actix-Telepathy competes well in remote messaging performance and memory consumption with other actor libraries, such as Scala's popular Akka library.
As inference on Large Language models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient GPU deployment. Quantization not ...
详细信息
ISBN:
(纸本)9798400714436
As inference on Large Language models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains a key open question whether speedups are achievable also in batched settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. In this paper, we resolve this question positively by introducing a new design for Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be practically supported with close to maximum (4x) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to significant end-to-end LLM inference speedups (of up to 2.8x) when integrated with the popular vLLM opensource serving engine. Finally, we show that MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.
Swarm behaviour engineering is an area of research that seeks to investigate methods for coordinating computation and action within groups of simple agents to achieve complex global goals like collective movement, clu...
详细信息
ISBN:
(纸本)9783031353604;9783031353611
Swarm behaviour engineering is an area of research that seeks to investigate methods for coordinating computation and action within groups of simple agents to achieve complex global goals like collective movement, clustering, and distributed sensing. Despite recent progress in the study and engineering of swarms (of drones, robots, vehicles), there is still need for general design and implementation methods that can be used to define complex swarm coordination in a principled way. To face this need, this paper proposes a new field-based coordination approach, called MacroSwarm, to design fully composable and reusable blocks of swarm behaviour. Based on the macroprogramming approach of aggregate computing, it roots on the idea of modelling each block of swarm behaviour by a purely functional transformation of sensing fields into actuation description fields, typically including movement vectors. We showcase the potential of MacroSwarm as a framework for collective intelligence by simulation, in a variety of scenarios including flocking, morphogenesis, and collective decision-making.
Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On...
详细信息
ISBN:
(纸本)9798400703850
Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price of sub-optimal model training performance. On the other hand, practitioners propose various approaches to improving the training efficiency by sacrificing some of the flexibility, ranging from making the graph static for more thorough optimization (e.g., XLA) to customizing optimization towards large-scale distributed training (e.g., DeepSpeed and Megatron-LM). In this paper, we aim to address the tension between usability and training efficiency through separation of concerns. Inspired by DL compilers that decouple the platform-specific optimizations of a tensor-level operator from its arithmetic definition, this paper proposes a schedule language, Slapo, to decouple model execution from definition. Specifically, Slapo works on a PyTorch model and uses a set of schedule primitives to convert the model for common model training optimizations such as high-performance kernels, effective 3D parallelism, and efficient activation checkpointing. Com- pared to existing optimization solutions, Slapo progressively optimizes the model "as-needed" through high-level primi- tives, and thus preserving programmability and debuggabil- ity for users to a large extent. Our evaluation results show that by scheduling the existing hand-crafted optimizations in a systematic way using Slapo, we are able to improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs, and by up to 1.41x on multiple machines with up to 64 GPUs, when compared to the out-of-the-box performance of DeepSpeed and Megatron-LM.
Training modern large neural networks (NNs) requires a combination of parallelization strategies, including data, model, or optimizer sharding. To address the growing complexity of these strategies, we introduce PartI...
详细信息
ISBN:
(纸本)9798400706981
Training modern large neural networks (NNs) requires a combination of parallelization strategies, including data, model, or optimizer sharding. To address the growing complexity of these strategies, we introduce PartIR, a hardware-and-untime agnostic NN partitioning system. PartIR is: 1) Expressive: It allows for the composition of multiple sharding strategies, whether user-defined or automatically derived;2) Decoupled: the strategies are separate from the ML implementation;and 3) Predictable: It follows a set of welldefined general rules to partition the NN. PartIR utilizes a schedule-like API that incrementally rewrites the ML program intermediate representation (IR) after each strategy, allowing simulators and users to verify the strategy's performance. PartIR has been successfully used both for training large models and across diverse model architectures, demonstrating its predictability, expressiveness, and performance.
暂无评论