the proceedings contain 68 papers. the topics discussed include: distributed computation of the mode;sublogarithimic distributed MIS algorithm for sparse graphs using Nash-Williams decomposition;a log-star distributed...
ISBN:
(纸本)9781595939890
the proceedings contain 68 papers. the topics discussed include: distributed computation of the mode;sublogarithimic distributed MIS algorithm for sparse graphs using Nash-Williams decomposition;a log-star distributed maximal independent set algorithm for growth-bounded graphs;a jamming-resistant MAC protocol for single-hop wireless networks;failure detectors in loosely named systems;every problem has a weakest failure detector;sharing is harder than agreeing;collaborative enforcement of firewall policies in virtual private networks;secure communication over radio channels;on tradeoff between network connectivity, phase complexity and communication complexity of reliable communication tolerating mixed adversary;distributed algorithms for ultrasparse spanners and linear size skeletons;and efficient distributed approximation algorithms via probabilistic tree embeddings.
Bounded timestamping systems (Israeli and Li in Proceedings of the 28thannual IEEE symposium on Foundations of Computer Science (FOCS), pp 371-382, 1987;Dolev and Shavit in SIAM J Comput 26 (2):418-455, 1997) allow a...
详细信息
Bounded timestamping systems (Israeli and Li in Proceedings of the 28thannual IEEE symposium on Foundations of Computer Science (FOCS), pp 371-382, 1987;Dolev and Shavit in SIAM J Comput 26 (2):418-455, 1997) allow a temporal ordering of events in executions of concurrent algorithms. they are a fundamental and well-studied building block used in many shared-memory algorithms (Haldar and Vit & aacute;nyi in J acm, 49 (1):101-126, 2002;Afek et al. in acm Trans Program Lang Syst 16:939-953, 1994;Abrahamson in Proceedings of the 7thacmsymposium on principles of distributedcomputing (PODC), pp 291-302, 1988;Bashari and Woelfel in Proceedings of the 40thacmsymposium on principles of distributedcomputing (PODC), pp 545-555, 2021). A concurrent bounded timestamping system keeps track of m timestamps, which is usually greater or equal to the number of processes in the system, n. A process may, at any point, obtain a new timestamp, and later determine a total order of all process's most recent timestamps. Known bounded timestamping algorithms (Dolev and Shavit in SIAM J Comput 26(2):418-455, 1997;Dwork and Waarts in J acm 46(5):633-666, 1999;Dwork et al. in SIAM J Comput 28(5):1848-1874, 1999;Gawlick et al. in theory of computing and systems (ISTCS), pp 171-183, 1992;Israeli and Pinhasov in distributed algorithms, pp 95-109, 1992;Haldar and Vit & aacute;nyi in J acm 49(1):101-126, 2002) do not scale well in the number of processes as getting a new timestamp takes at least Omega(n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Omega (n)$$\end{document} steps. Moreover, a lower bound by Israeli and Li (Proceedings of the 28thannual IEEE symposium on foundations of computer science (FOCS), pp 371-382, 1987) implies that timestamps need to be represented by Omega(m)\documentclass[12pt]{minim
the carbon and water footprint of large-scale computing systems poses serious environmental sustainability risks. In this study, we discover that, unfortunately, carbon and water sustainability are at odds with each o...
详细信息
ISBN:
(纸本)9798400714436
the carbon and water footprint of large-scale computing systems poses serious environmental sustainability risks. In this study, we discover that, unfortunately, carbon and water sustainability are at odds with each other - and, optimizing one alone hurts the other. Toward that goal, we introduce, WaterWise, a novel job scheduler for parallel workloads that intelligently co-optimizes carbon and water footprint to improve the sustainability of geographically distributed data centers.
Deterministic parallelism is a key building block for distributed and fault-tolerant systems that offers substantial performance benefits while guaranteeing determinism. By studying existing deterministically parallel...
详细信息
ISBN:
(纸本)9798400714436
Deterministic parallelism is a key building block for distributed and fault-tolerant systems that offers substantial performance benefits while guaranteeing determinism. By studying existing deterministically parallel systems (DPS), we identify certain design pitfalls, such as batched execution and inefficient runtime synchronization, that preclude them from meeting the demands of mu s-scale and high-throughput distributed systems deployed in modern datacenters. We present DORADD, a deterministically parallel runtime with low latency and high throughput, designed for modern datacenter services. DORADD introduces a hybrid scheduling scheme that effectively decouples request dispatching from execution. It employs a single dispatcher to deterministically construct a dynamic dependency graph of incoming requests and worker pools that can independently execute requests in a work-conserving and synchronization-free manner. Furthermore, DORADD overcomes the single-dispatcher throughput bottleneck based on core pipelining. We use DORADD to build an in-memory database and compare it with Caracal, the current state-of-the-art deterministic database, via the YCSB and TPC-C benchmarks. Our evaluation shows up to 2.5x better throughput and more than 150x and 300x better tail latency in non-contended and contended cases, respectively. We also compare DO-RADD with Caladan, the state-of-the-art non-deterministic remote procedure call (RPC) scheduler, and demonstrate that determinism in DORADD does not incur any performance overhead.
Training large language models (LLMs) has become increasingly expensive due to the rapid expansion in model size. Pipeline parallelism is a widely used distributed training technique. However, as LLMs with larger cont...
详细信息
ISBN:
(纸本)9798400714436
Training large language models (LLMs) has become increasingly expensive due to the rapid expansion in model size. Pipeline parallelism is a widely used distributed training technique. However, as LLMs with larger context become prevalent and memory optimization techniques advance, traditional PP methods encounter greater communication challenges due to the increased size of activations and gradients of activations. To address this issue, we introduce weight-pipeline parallelism (WeiPipe) that transitions from an activation-passing pipeline to a weight-passing pipeline. WeiPipe reduces communication costs and achieves a more balanced utilization by transmitting only weights and their gradients between workers in a pipeline manner. WeiPipe does not rely on collective communication primitives, thus ensuring scalability. We present four variations of WeiPipe parallelism, including WeiPipe-Interleave, which emphasizes communication efficiency, and WeiPipe-zero-bubble, discussing the potential for minimal bubble ratios. Our implementation of WeiPipe-Interleave, performed on up to 32 GPUs and tested in various model configurations, including large-context LLM training, demonstrates a significant improvement in throughput compared to state-of-the-art pipeline parallelism and fully sharded data parallelism with different underlying infrastructures, including NVLink connections within cluster with Ethernet among cluster, and PCIe within cluster and Ethernet among cluster. Additionally, WeiPipe also shows greater scalability in communicationconstrained scenarios compared to state-of-art strategies.
there are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., withthe shortest running time), and the optimal strategy depend...
详细信息
ISBN:
(纸本)9798400714436
there are several strategies to parallelize graph neural network (GNN) training over multiple GPUs. We observe that there is no consistent winner (i.e., withthe shortest running time), and the optimal strategy depends on the graph dataset, GNN model, training algorithm, and hardware configurations. As such, we design the APT system to automatically select efficient parallelization strategies for GNN training tasks. To this end, we analyze the trade-offs of the strategies and design simple yet effective cost models to compare their execution time and facilitate strategy selection. Moreover, we also propose a general abstraction of the strategies, which allows to implement a unified execution engine that can be configured to run different strategies. Our experiments show that APT usually chooses the optimal or a close to optimal strategy, and the training time can be reduced by over 2x compared with always using a single strategy. APT is open-source at https://***/kaihaoma/APT.
Online GNN inference has been widely explored by applications such as online recommendation and financial fraud detection systems, where even minor delays can result in significant financial impact. Real-time dynamic ...
详细信息
ISBN:
(纸本)9798400714436
Online GNN inference has been widely explored by applications such as online recommendation and financial fraud detection systems, where even minor delays can result in significant financial impact. Real-time dynamic graph sampling enables online GNN inference to reflect the latest graph updates in real-world graphs. However, online GNN inference typically demands millisecond-level latency Service Level Objectives (SLOs) as its performance guarantees, which poses great challenges for existing dynamic graph sampling approaches based on graph databases. the issues mainly arise from two aspects: long tail latency due to imbalanced data-dependent sampling and large communication overhead incurred by distributed sampling. To address these issues, we propose Helios, an efficient distributed dynamic graph sampling service to meet the stringent latency SLOs. the key ideas of Helios are 1) pre-sampling the dynamic graph in an event-driven approach, and 2) maintaining a query-aware sample cache to build the complete K-hop sampling results locally for inference requests. Experiments on multiple datasets show that Helios achieves up to 67x higher serving throughput and up to 32x lower P99 query latency compared to baselines.
this conference proceedings contains 27 papers. the main subjects are shared memory multiprocessors, achievement of causal consistency, semantics of distributed services, storage management, message-passing systems, c...
详细信息
this conference proceedings contains 27 papers. the main subjects are shared memory multiprocessors, achievement of causal consistency, semantics of distributed services, storage management, message-passing systems, clock synchronization algorithms, high speed network control, analysis of communication protocols, reasoning about probabilistic algorithms, and totally asynchronous systems.
this conference proceedings contains 27 papers. the main topics discussed are: distributed computer systems, computer networks protocols, distributed database systems, and some aspects of automata theory. Some general...
详细信息
ISBN:
(纸本)0897911431
this conference proceedings contains 27 papers. the main topics discussed are: distributed computer systems, computer networks protocols, distributed database systems, and some aspects of automata theory. Some general applications of concurrent programming, and the deadlock detection and resolution are also presented.
暂无评论