检索结果-内蒙古大学图书馆

29th ACM International Conference on Architectural Support for programming languages and Operating Systems (ASPLOS)

作者： Xu, Daliang Xu, Mengwei Lou, Chiheng Zhang, Li Huang, Gang Jin, Xin Liu, Xuanzhe Peking Univ Minist Educ Key Lab High Confidence Software Technol Beijing Peoples R China Peking Univ Sch Comp Sci Beijing Peoples R China State Key Lab Networking & Switching Technol Beijing Peoples R China Natl Key Lab Data Space Technol & Syst Beijing Peoples R China

ISBN: (纸本)9798400703720

SoC-Cluster, a novel server architecture composed of massive mobile system-on-chips (SoCs), is gaining popularity in industrial edge computing due to its energy efficiency and compatibility with existing mobile applications. However, we observe that the deployed SoC-Cluster servers are not fully utilized, because the hosted workloads are mostly usertriggered and have significant tidal phenomena. To harvest the free cycles, we propose to co-locate deep learning tasks on them. We present SoCFlow, the first framework that can efficiently train deep learning models on SoC-Cluster. To deal with the intrinsic inadequacy of commercial SoC-Cluster servers, SoCFlow incorporates two novel techniques: (1) the group-wise parallelism with delayed aggregation that can train deep learning models fast and scalably without being influenced by the network bottleneck;(2) the data-parallel mixed-precision training algorithm that can fully unleash the heterogeneous processors' capability of mobile SoCs. We have fully implemented SoCFlow and demonstrated its effectiveness through extensive experiments. The experiments show that SoCFlow significantly and consistently outperforms all baselines regarding the training speed while preserving the convergence accuracy, e.g., 1.6x-740x convergence speedup with 32 SoCs. Compared to commodity GPU (NVIDIA V100) under the same power budget, SoCFlow achieves comparable training speed but reduces energy consumption by 2.31x-10.23x with the same convergence accuracy.

关键词： SoC-Cluster distributed machine learning mixed precision training

来源：评论

学校读者我要写书评

暂无评论

Inference of Probabilistic Programs with Moment-Matching Gaussian Mixtures

引用

proceedings OF THE ACM ON programming languages-PACMPL 2024年第POPL期8卷 1882-1912页

作者： Randone, Francesca Bortolussi, Luca Incerto, Emilio Tribastone, Mirco IMT Sch Adv Studies Lucca Lucca LU Italy Univ Trieste Trieste Italy

computing the posterior distribution of a probabilistic program is a hard task for which no one-fit-for-all solution exists. We propose Gaussian Semantics, which approximates the exact probabilistic semantics of a bounded program by means of Gaussian mixtures. It is parametrized by a map that associates each program location with the moment order to be matched in the approximation. We provide two main contributions. The first is a universal approximation theorem stating that, under mild conditions, Gaussian Semantics can approximate the exact semantics arbitrarily closely. The second is an approximation that matches up to second-order moments analytically in face of the generally difficult problem of matching moments of Gaussian mixtures with arbitrary moment order. We test our second-order Gaussian approximation (SOGA) on a number of case studies from the literature. We show that it can provide accurate estimates in models not supported by other approximation methods or when exact symbolic techniques fail because of complex expressions or non-simplified integrals. On two notable classes of problems, namely collaborative filtering and programs involving mixtures of continuous and discrete distributions, we show that SOGA significantly outperforms alternative techniques in terms of accuracy and computational time.

关键词： probabilistic programming inference Gaussian mixtures

来源：评论

学校读者我要写书评

暂无评论

Applying a Task-Based Approach to distributed Machine Learning Workflows

Applying a Task-Based Approach to Distributed Machine Learni...

引用

2024 Workshops of the International Conference for High Performance computing, Networking, Storage and Analysis, SC Workshops 2024

作者： Vazquez-Novoa, Fernando Lezzi, Daniele Lordan, Francesc Baghdadi, Fatemeh Cirillo, Davide Barcelona Supercomputing Center Department of Computer Sciences Barcelona Spain Barcelona Supercomputing Center Department of Life Sciences Barcelona Spain

ISBN: (纸本)9798350355543

The growing demands across various scientific fields have led to a significant shift in applications that consume data at the edge of the computing continuum. These applications require unified programming models for the composition of components and coordinating the execution of computational workloads, including training machine learning (ML) models on distributed resources. Personalized healthcare often leverages data generated from wearable devices used to train ML models, can be benefited from distributed computing approaches. Specifically, stroke care can be greatly benefited from distributed ML with modifiable risk factors that can be monitored using wearable devices. In this work, we present an implementation that leverages distributed techniques for large-scale ML workflows using electrocardiogram (ECG) recordings for atrial fibrillation (AF) classification. The application was evaluated using the PhysioNet database, showcasing the potential of distributed, ML in stroke care, opening the way for future creation of more advanced models embedded in edge devices. © 2024 IEEE.

关键词： ecg machine learning pycompss

来源：评论

学校读者我要写书评

暂无评论

Checking Observational Correctness of Database Systems

引用

proceedings of the ACM on programming languages 2025年第1期9卷 1661-1688页

作者： Pick, Lauren Xu, Amanda Desai, Ankush Seshia, Sanjit A. Albarghouthi, Aws The Chinese University of Hong Kong Hong Kong University of Wisconsin-Madison United States Amazon Web Services United States University of California Berkeley United States

Clients rely on database systems to be correct, which requires the system not only to implement transactions’ semantics correctly but also to provide isolation guarantees for the transactions. This paper presents a client-centric technique for checking both semantic correctness and isolation-level guarantees for black-box database systems based on observations collected from running transactions on these systems. Our technique verifies observational correctness with respect to a given set of transactions and observations for them, which holds iff there exists a possible correct execution of the transactions under a given isolation level that could result in these observations. Our technique relies on novel symbolic encodings of (1) the semantic correctness of database transactions in the presence of weak isolation and (2) isolation-level guarantees. These are used by the checker to query a Satisfiability Modulo Theories solver. We applied our tool Troubadour to verify observational correctness of several database systems, including PostgreSQL and an industrial system under development, in which the tool helped detect two new bugs. We also demonstrate that Troubadour is able to find known semantic correctness bugs and detect isolation-related anomalies. © 2025 Copyright held by the owner/author(s).

关键词： distributed database systems

来源：评论

学校读者我要写书评

暂无评论

Speeding-Up LULESH on HPX: Useful Tricks and Lessons Learned using a Many-Task-Based Approach

Speeding-Up LULESH on HPX: Useful Tricks and Lessons Learned...

引用

2024 Workshops of the International Conference for High Performance computing, Networking, Storage and Analysis, SC Workshops 2024

作者： Kalkhof, Torben Koch, Andreas Technical University of Darmstadt Embedded Systems and Applications Group Darmstadt Germany

ISBN: (纸本)9798350355543

Current programming models face challenges in dealing with modern supercomputers' growing parallelism and heterogeneity. Emerging programming models, like the task-based programming model found in the asynchronous many-task HPX programming framework, offer new ways to express parallelism, enhance scalability, and mask synchronization and communication latency on multi-core and distributed systemsRegular high-performance computing benchmarks are often unsuitable for comparing different programming models due to their limited code complexity. However, real-world scientific applications are usually too complex. As a middle ground, proxy applications model the behavior of actual scientific problems, while reducing code complexityIn our research on using HPX to program machines with heterogeneous compute units (e.g., GPU and FPGA/AI Engines), we have also substantially optimized a pure HPX-based software baseline of the LULESH proxy application. This paper discusses the techniques we applied yielding single-node speed-ups of 1.33x to 2.25x for different problem sizes relative to the LULESH OpenMP reference implementation. © 2024 IEEE.

关键词： HPC HPX LULESH task-based programming

来源：评论

学校读者我要写书评

暂无评论

Actix-Telepathy 10

Actix-Telepathy

引用

10th ACM SIGPLAN International Workshop on Reactive and Event-Based languages and Systems (REBLS)

作者： Wenig, Phillip Papenbrock, Thorsten Univ Potsdam Hasso Plattner Inst Potsdam Germany Philipps Univ Marburg Marburg Germany

ISBN: (纸本)9798400704000

The actor programming model supports the development of concurrent applications by encapsulating state and behavior into independent actors. Each actor is a computational entity with strictly private state and behavior. Actors communicate via asynchronous messaging and, in this way, require neither shared memory nor locking. This makes the actor model suitable not only for parallel programming but also for distributed applications engineering. The Rust programming language is a statically-typed language that gained a lot of attention in the past years due to its efficient, economical and safe memory management. To ease the development of parallel applications, several actor model frameworks have been built for Rust. However, no actively maintained Rust actor framework provides the necessary features to write distributed applications. For this reason, we propose an extension for Rust's Actix library, called ActixTelepathy, that enables remote messaging and offers clustering support. It allows developers to setup remote actors that can communicate across a computer network with the help of a straight forward and easy to understand interface. Our evaluation demonstrates that Actix-Telepathy competes well in remote messaging performance and memory consumption with other actor libraries, such as Scala's popular Akka library.

关键词： Actor Model distributed computing Rust

来源：评论

学校读者我要写书评

暂无评论

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language models 25

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference o...

引用

30th Symposium on Principles and Practice of Parallel programming

作者： Frantar, Elias Castro, Roberto L. Chen, Jiale Hoefler, Torsten Alistarh, Dan IST Austria Klosterneuburg Austria Univ A Coruna CITIC La Coruna Spain Swiss Fed Inst Technol Zurich Switzerland Neural Mag Inc Somerville NJ USA

ISBN: (纸本)9798400714436

As inference on Large Language models (LLMs) emerges as an important workload in machine learning applications, model weight quantization has become a standard technique for efficient GPU deployment. Quantization not only reduces model size, but has also been shown to yield substantial speedups for single-user inference, due to reduced memory movement, with low accuracy impact. Yet, it remains a key open question whether speedups are achievable also in batched settings with multiple parallel clients, which are highly relevant for practical serving. It is unclear whether GPU kernels can be designed to remain practically memory-bound, while supporting the substantially increased compute requirements of batched workloads. In this paper, we resolve this question positively by introducing a new design for Mixed-precision Auto-Regressive LINear kernels, called MARLIN. Concretely, given a model whose weights are compressed via quantization to, e.g., 4 bits per element, MARLIN shows that batchsizes up to 16-32 can be practically supported with close to maximum (4x) quantization speedup, and larger batchsizes up to 64-128 with gradually decreasing, but still significant, acceleration. MARLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining, and bespoke quantization support. Our experiments show that MARLIN's near-optimal performance on individual LLM layers across different scenarios can also lead to significant end-to-end LLM inference speedups (of up to 2.8x) when integrated with the popular vLLM opensource serving engine. Finally, we show that MARLIN is extensible to further compression techniques, like NVIDIA 2:4 sparsity, leading to additional speedups.

关键词： Large language model (LLM) inference GPU programming Batch parallelism

来源：评论

学校读者我要写书评

暂无评论

MACROSWARM: A Field-Based Compositional Framework for Swarm programming 25th

MACROSWARM: A Field-Based Compositional Framework for Swarm ...

引用

25th International Conference on Coordination models and Language (COORDINATION)

作者： Aguzzi, Gianluca Casadei, Roberto Viroli, Mirko Univ Bologna Alma Mater Studiorum Cesena Italy

ISBN: (纸本)9783031353604;9783031353611

Swarm behaviour engineering is an area of research that seeks to investigate methods for coordinating computation and action within groups of simple agents to achieve complex global goals like collective movement, clustering, and distributed sensing. Despite recent progress in the study and engineering of swarms (of drones, robots, vehicles), there is still need for general design and implementation methods that can be used to define complex swarm coordination in a principled way. To face this need, this paper proposes a new field-based coordination approach, called MacroSwarm, to design fully composable and reusable blocks of swarm behaviour. Based on the macroprogramming approach of aggregate computing, it roots on the idea of modelling each block of swarm behaviour by a purely functional transformation of sensing fields into actuation description fields, typically including movement vectors. We showcase the potential of MacroSwarm as a framework for collective intelligence by simulation, in a variety of scenarios including flocking, morphogenesis, and collective decision-making.

关键词： Swarm Behaviours Field-based Coordination Aggregate computing Collective Intelligence distributed computing DSLs

来源：评论

学校读者我要写书评

暂无评论

Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training 24

Slapo: A Schedule Language for Progressive Optimization of L...

引用

29th ACM International Conference on Architectural Support for programming languages and Operating Systems (ASPLOS)

作者： Chen, Hongzheng Yu, Cody Hao Zheng, Shuai Zhang, Zhen Zhang, Zhiru Wang, Yida Cornell Univ Ithaca NY 14850 USA Boson AI Inc Santa Clara CA USA Amazon Web Serv Santa Clara CA USA Amazon Seattle WA USA

ISBN: (纸本)9798400703850

Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price of sub-optimal model training performance. On the other hand, practitioners propose various approaches to improving the training efficiency by sacrificing some of the flexibility, ranging from making the graph static for more thorough optimization (e.g., XLA) to customizing optimization towards large-scale distributed training (e.g., DeepSpeed and Megatron-LM). In this paper, we aim to address the tension between usability and training efficiency through separation of concerns. Inspired by DL compilers that decouple the platform-specific optimizations of a tensor-level operator from its arithmetic definition, this paper proposes a schedule language, Slapo, to decouple model execution from definition. Specifically, Slapo works on a PyTorch model and uses a set of schedule primitives to convert the model for common model training optimizations such as high-performance kernels, effective 3D parallelism, and efficient activation checkpointing. Com- pared to existing optimization solutions, Slapo progressively optimizes the model "as-needed" through high-level primi- tives, and thus preserving programmability and debuggabil- ity for users to a large extent. Our evaluation results show that by scheduling the existing hand-crafted optimizations in a systematic way using Slapo, we are able to improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs, and by up to 1.41x on multiple machines with up to 64 GPUs, when compared to the out-of-the-box performance of DeepSpeed and Megatron-LM.

关键词： Schedule Language distributed Training Compiler Optimization Deep Learning Large Language models

来源：评论

学校读者我要写书评

暂无评论

PartIR: Composing SPMD Partitioning Strategies for Machine Learning 25

PartIR: Composing SPMD Partitioning Strategies for Machine L...

引用

30th International Conference on Architectural Support for programming languages and Operating Systems-ASPLOS

作者： Alabed, Sami Belov, Daniel Chrzaszcz, Bart Franco, Juliana Grewe, Dominik Maclaurin, Dougal Molloy, James Natan, Tom Norman, Tamara Pan, Xiaoyue Paszke, Adam Rink, Norman A. Schaarschmidt, Michael Sitdikov, Timur Swietlik, Agnieszka Vytiniotis, Dimitrios Wee, Joel Google DeepMind London England Google DeepMind Warsaw Poland Isomorph Labs London England

ISBN: (纸本)9798400706981

Training modern large neural networks (NNs) requires a combination of parallelization strategies, including data, model, or optimizer sharding. To address the growing complexity of these strategies, we introduce PartIR, a hardware-and-untime agnostic NN partitioning system. PartIR is: 1) Expressive: It allows for the composition of multiple sharding strategies, whether user-defined or automatically derived;2) Decoupled: the strategies are separate from the ML implementation;and 3) Predictable: It follows a set of welldefined general rules to partition the NN. PartIR utilizes a schedule-like API that incrementally rewrites the ML program intermediate representation (IR) after each strategy, allowing simulators and users to verify the strategy's performance. PartIR has been successfully used both for training large models and across diverse model architectures, demonstrating its predictability, expressiveness, and performance.

关键词： distributed systems

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：