distributed large model inference is still in a dilemma where balancing cost and effect. The online scenarios demand intraoperator parallelism to achieve low latency and intensive communications makes it costly. Conve...
详细信息
ISBN:
(纸本)9798400704352
distributed large model inference is still in a dilemma where balancing cost and effect. The online scenarios demand intraoperator parallelism to achieve low latency and intensive communications makes it costly. Conversely, the inter-operator parallelism can achieve high throughput with much fewer communications, but it fails to enhance the effectiveness. In this paper, we present Liger, a distributed large model inference runtime system that is of capability to achieve low latency at high throughput on the multi-GPU architecture. The key idea lies in the novel interleaved parallelism, which interleaves the computation and communication across requests. Liger enables this parallelism by carefully scheduling computation and communication kernels across requests onto multiple streams of multiple GPUs. It achieves precise control of kernel execution order efficiently by mixing use the CPU-GPU synchronization and the inter-stream synchronization. To prevent scheduling failures caused by resource contention, Liger introduces a contention factor strategy to anticipate the penalty of contention. It enables a higher degree of overlap by decomposing lengthy kernels into smaller, more manageable units at runtime. Extensive evaluations show that Liger, in most cases, outperforms existing parallelism approaches across models and devices, presenting the best latency and throughput results. In a 4-device case, Liger reduces the average latency by 36.0% while maintaining the same throughput compared to the inter-operator approach. Meanwhile, it improves the throughput by 1.34x with improved average latency compared to the intra-operator approach.
The proceedings contain 47 papers. The topics discussed include: ElasticRoom: multi-tenant DNN inference engine via co-design with resource-constrained compilation and strong priority scheduling;efficient all-to-all c...
ISBN:
(纸本)9798400704130
The proceedings contain 47 papers. The topics discussed include: ElasticRoom: multi-tenant DNN inference engine via co-design with resource-constrained compilation and strong priority scheduling;efficient all-to-all collective communication schedules for direct-connect topologies;ESG: pipeline-conscious efficient scheduling of DNN workflows on serverless platforms with shareable GPUs;ETS: deep learning training iteration time prediction based on execution trace sliding window;IDT: intelligent data placement for multi-tiered main memory with reinforcement learning;FaaSKeeper: learning from building serverless services with ZooKeeper as an example;accelerating function-centric applications by discovering, distributing, and retaining reusable context in workflow systems;and Faast: an efficient serverless framework made snapshot-based function response fast.
The dual of a planar graph G is a planar graph G∗ that has a vertex for each face of G and an edge for each pair of adjacent faces of G. The profound relationship between a planar graph and its dual has been the algor...
详细信息
The proceedings contain 3 papers. The topics discussed include: analysis and evaluation of load management strategies in a decentralized FaaS environment: a simulation-based framework;live migration of multi-container...
ISBN:
(纸本)9798400706479
The proceedings contain 3 papers. The topics discussed include: analysis and evaluation of load management strategies in a decentralized FaaS environment: a simulation-based framework;live migration of multi-container Kubernetes pods in multi-cluster serverless edge systems;and comparing actor-critic and neuroevolution approaches for traffic offloading in FaaS-powered edge systems.
With over 95% of the top one million websites not being fully accessible, teaching digital accessibility to computing students is crucial. In this study, conducted over three spring semesters within a Software Enginee...
详细信息
ISBN:
(纸本)9798400705328
With over 95% of the top one million websites not being fully accessible, teaching digital accessibility to computing students is crucial. In this study, conducted over three spring semesters within a Software Engineering project course, we introduced dedicated sprints focused on accessibility. During these sprints, students were taught about various types of disabilities, accessibility principles, web content accessibility guidelines (WCAG), and various automated and manual techniques to test for accessibility. Students utilized automated tools to identify accessibility issues in their web projects and subsequently dedicated another sprint to address and resolve these issues. We systematically documented common accessibility mistakes students make, highlighting the WCAG 'success criteria' that require careful instructional focus. Additionally, we identify which accessibility issues are frequently and easily resolved by students and which challenges persist despite their efforts. This comprehensive analysis provides valuable insights, enabling the computing education community to effectively integrate and emphasize accessibility instruction in their curricula, ultimately fostering more inclusive and accessible web development practices.
We consider leader election in clique networks, where n nodes are connected by point-to-point communication links. For the synchronous clique under simultaneous wake-up, i.e., where all nodes start executing the algor...
详细信息
ISBN:
(纸本)9798400701214
We consider leader election in clique networks, where n nodes are connected by point-to-point communication links. For the synchronous clique under simultaneous wake-up, i.e., where all nodes start executing the algorithm in round 1, we show a tradeoff between the number of messages and the amount of time. The previous lower bound side of such a tradeoff, in the seminal paper of Afek and Gafni (1991), was shown only assuming adversarial wake-up. Interestingly, our new tradeoff also improves the previous lower bounds for a large part of the spectrum, even under simultaneous wake-up. More specifically, we show that any deterministic algorithm with a message complexity of n f (n) requires Omega(log n/log f(n) + 1) rounds, for f(n) > 1. Our result holds even if the node IDs are chosen from a relatively small set of size Theta(n log n), as we are able to avoid using Ramsey's theorem, in contrast to many existing lower bounds for deterministic algorithms. We also give an upper bound that improves over the previously-best tradeoff achieved by the algorithm of Afek and Gafni. Our second contribution for the synchronous clique under simultaneous wake-up is to show that Omega(n log n) is in fact a lower bound on the message complexity that holds for any deterministic algorithm with a termination time T (n) (i.e., any function of n), for a sufficiently large ID space. We complement this result by giving a simple deterministic algorithm that achieves leader election in sublinear time while sending only o(n log n) messages, if the ID space is of at most linear size. We also show that Las Vegas algorithms (that never fail) require Theta(n) messages. This exhibits a gap between Las Vegas and Monte Carlo algorithms. For the synchronous clique under adversarial wake-up, we show that Omega(n(3/2)) is a lower bound for 2-round algorithms. Our result is the first superlinear lower bound for randomized leader election algorithms in the clique. We also give a simple algorithm that matche
The proceedings contain 3 papers. The topics discussed include: ECO-LLM: LLM-based edge cloud optimization;toward using representation learning for cloud resource usage forecasting;and MPIrigen: MPI code generation th...
ISBN:
(纸本)9798400706523
The proceedings contain 3 papers. The topics discussed include: ECO-LLM: LLM-based edge cloud optimization;toward using representation learning for cloud resource usage forecasting;and MPIrigen: MPI code generation through domain-specific language models.
The proceedings contain 43 papers. The topics discussed include: provably good randomized strategies for data placement in distributed key-value stores;provably fast and space-efficient parallel biconnectivity;practic...
ISBN:
(纸本)9798400700156
The proceedings contain 43 papers. The topics discussed include: provably good randomized strategies for data placement in distributed key-value stores;provably fast and space-efficient parallel biconnectivity;practically and theoretically efficient garbage collection for multiversioning;fast and scalable channels in Kotlin coroutines;high-performance GPU-to-CPU transpilation and optimization via high-level parallel constructs;lifetime-based optimization for simulating quantum circuits on a new Sunway supercomputer;merchandiser: data placement on heterogeneous memory for task-parallel HPC applications with load-balance awareness;visibility algorithms for dynamic dependence analysis and distributed coherence;Block-STM: scaling blockchain execution by turning ordering curse to a performance blessing;TDC: towards extremely efficient CNNs on GPUs via hardware-aware tucker decomposition;and improving energy saving of one-sided matrix decompositions on CPU-GPU heterogeneous systems.
The scale of deep learning models has grown tremendously in recent years. State-of-the-art models have reached billions of parameters and terabyte-scale model sizes. Training of these models demands memory bandwidth a...
详细信息
ISBN:
(纸本)9798400703874
The scale of deep learning models has grown tremendously in recent years. State-of-the-art models have reached billions of parameters and terabyte-scale model sizes. Training of these models demands memory bandwidth and capacity that can only be accommodated distributively over hundreds to thousands of GPUs. However, large-scale distributed training suffers from GPU memory inefficiency, such as memory under-utilization and out-of-memory events (OOMs). There is a lack of understanding of actual GPU memory behavior of distributed training on terabyte-size models, which hinders the development of effective solutions to such inefficiency. In this paper, we present a systematic analysis of GPU memory behavior of large-scale distributed training jobs in production at Meta. Our analysis is based on offline training jobs of multi-terabyte Deep Learning Recommendation Models from one of Meta's largest production clusters. We measure GPU memory inefficiency;characterize GPU memory utilization, and provide fine-grained GPU memory usage analysis. We further show how to build on the understanding to develop a practical GPU provisioning system in production.
Program Wars is a web-based card game for teaching fundamental concepts of programming and cybersecurity. We propose Program Wars v.3.0, an extension to Program Wars v.2.0, to explore the effectiveness of teaching fun...
详细信息
暂无评论