this paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of in...
详细信息
ISBN:
(纸本)9781939133403
this paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs). By harnessing the substantial near-GPU storage and memory capacities of inference servers, ServerlessLLM achieves effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading. the design of ServerlessLLM features three core contributions: (i) fast multi-tier checkpoint loading, featuring a new loading-optimized checkpoint format and a multi-tier loading system, fully utilizing the bandwidth of complex storage hierarchies on GPU servers;(ii) efficient live migration of LLM inference, which enables newly initiated inferences to capitalize on local checkpoint storage while ensuring minimal user interruption;and (iii) startup-time-optimized model scheduling, which assesses the locality statuses of checkpoints on each server and schedules the model onto servers that minimize the time to start the inference. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems, reducing latency by 10 - 200X across various LLM inference workloads.
On-device learning is an emerging technique to pave the last mile of enabling edge intelligence, which eliminates the limitations of conventional in-cloud computing where dozens of computational capacities and memorie...
详细信息
ISBN:
(纸本)9781939133236
On-device learning is an emerging technique to pave the last mile of enabling edge intelligence, which eliminates the limitations of conventional in-cloud computing where dozens of computational capacities and memories are needed. A high-performance on-device learning system requires breaking the constraints of limited resources and alleviating computational overhead. In this paper, we show that employing the 8-bit fixed-point (INT8) quantization in both forward and backward passes over a deep model is a promising way to enable tiny on-device learning in practice. the key to an efficient quantization-aware training method is to exploit the hardware-level enabled acceleration while preserving the training quality in each layer. However, off-the-shelf quantization methods cannot handle the on-device learning paradigm of fixed-point processing. To overcome these challenges, we propose a novel INT8 training method, which optimizes the computation of forward and backward passes via the delicately designed Loss-aware Compensation (LAC) and Parameterized Range Clipping (PRC), respectively. Specifically, we build a new network component, the compensation layer, to automatically counteract the quantization error of tensor arithmetic. We implement our method in Octo, a lightweight cross-platform system for tiny on-device learning. Evaluation on commercial AI chips shows that Octo holds higher training efficiency over state-of-the-art quantization training methods, while achieving adequate processing speedup and memory reduction over the full-precision training.
In classical machine virtualization, a hypervisor runs multiple operatingsystems simultaneously, each on its own virtual machine. In nested virtualization, a hypervisor can run multiple other hypervisors withtheir a...
详细信息
ISBN:
(纸本)9781931971799
In classical machine virtualization, a hypervisor runs multiple operatingsystems simultaneously, each on its own virtual machine. In nested virtualization, a hypervisor can run multiple other hypervisors withtheir associated virtual machines. As operatingsystems gain hypervisor functionality-Microsoft Windows 7 already runs Windows XP in a virtual machine-nested virtualization will become necessary in hypervisors that wish to host them. We present the design, implementation, analysis, and evaluation of high-performance nested virtualization on Intel x86-based systems. the Turtles project, which is part of the Linux/KVM hypervisor, runs multiple unmodified hypervisors (e.g., KVM and VMware) and operatingsystems (e.g., Linux and Windows). Despite the lack of architectural support for nested virtualization in the x86 architecture, it can achieve performance that is within 6-8% of single-level (non-nested) virtualization for common workloads, through multi-dimensional paging for MMU virtualization and multi-level device assignment for I/O virtualization.
the proceedings contain 26 papers. the topics discussed include: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language;Everest: scaling down peak loads through I/O Off...
ISBN:
(纸本)9781931971652
the proceedings contain 26 papers. the topics discussed include: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language;Everest: scaling down peak loads through I/O Off-loading;improving MapReduce performance in heterogeneous environments;Corey: an operating system for many cores;redline: first class support for interactivity in commodity operatingsystems;network imprecision: a new consistency metric for scalable monitoring;and lightweight, high-resolution monitoring for troubleshooting production systems.
this paper proposes a troubling diagnosing system for analyzing signal integrity and describes a complete architecture for the system withthe crucial software for scanning and controlling. the software is developed w...
详细信息
this paper proposes a troubling diagnosing system for analyzing signal integrity and describes a complete architecture for the system withthe crucial software for scanning and controlling. the software is developed w...
详细信息
ISBN:
(纸本)9781424421923
this paper proposes a troubling diagnosing system for analyzing signal integrity and describes a complete architecture for the system withthe crucial software for scanning and controlling. the software is developed withthe software LabVIEW 8.5 from the National Instruments. the paper also describes the implementation of the data processing and displaying the radiation graph. Moreover, the result is resented from scanning a part of an actual main board from an operating computer withthe system in the end.
Operational continuity of data centers faces challenges by experienced cyber attackers and occasional natural disasters. Assessment of a data center's resilience for complex and realistic scenarios is very importa...
详细信息
ISBN:
(纸本)9781424435548
Operational continuity of data centers faces challenges by experienced cyber attackers and occasional natural disasters. Assessment of a data center's resilience for complex and realistic scenarios is very important for various reasons such as: system specification, design and enhancement. Yet data center resilience evaluation is a demanding process because of the complexity of its systems and the multidimensional aspects required for a resilient system. this paper illustrates data center resilience evaluation test-bed and its monitoring system. this test-bed can provide a realistic testing environment and capable to implement multiple operating and attacking scenarios.
暂无评论