Agent-to-agent communications is an important operation in multi-agent systems and their simulation. Given the data-centric nature of agent-simulations, direct agent-to-agent communication is generally an orthogonal o...
详细信息
The proceedings contain 22 papers. The topics discussed include: spying on the floating point behavior of existing, unmodified scientific applications;high accuracy matrix computations on neural engines: a study of QR...
ISBN:
(纸本)9781450370523
The proceedings contain 22 papers. The topics discussed include: spying on the floating point behavior of existing, unmodified scientific applications;high accuracy matrix computations on neural engines: a study of QR factorization and its applications;space-efficient k-d tree-based storage format for sparse tensors;modeling the temporally constrained preemptions of transient cloud VMs;cloud-scale VM-deflation for running interactive applications on transient servers;funcX: a federated function serving fabric for science;towards HPC I/O performance prediction through large-scale log analysis;significantly improving lossy compression for HPC datasets with second-order prediction and parameter optimization;DCDB Wintermute: enabling online and holistic operational data analytics on HPC systems;and FFT-based gradient sparsification for the distributed training of deep neural networks.
The emerging edge computing paradigm is enabling a series of real-time, location-aware applications. The data produced by applications like autonomous driving, collaborative machine learning, and real-time video proce...
详细信息
The proceedings contain 28 papers. The topics discussed include: Triggerflow: trigger-based orchestration of serverless workflows;leaving stragglers at the window: low-latency stream sampling with accuracy guarantees;...
ISBN:
(纸本)9781450380287
The proceedings contain 28 papers. The topics discussed include: Triggerflow: trigger-based orchestration of serverless workflows;leaving stragglers at the window: low-latency stream sampling with accuracy guarantees;EdgeScaler: effective elastic scaling for graph stream processing systems;mechanisms for outsourcing computation via a decentralized market;FaaSdom: a benchmark suite for serverless computing;ByzGame: Byzantine generals game;the Kaiju project: enabling event-driven observability;DeepMatch: deep matching for in-vehicle presence detection in transportation;Hermes: enabling energy-efficient IoT networks with generalized deduplication;and doctoral symposium: trade-off analysis of thermal-constrained scheduling strategies in multi-core systems.
Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored per...
详细信息
ISBN:
(纸本)9781450368186
Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from within a GPU kernel itself. However, these approaches typically involve the CPU in the critical path, which leads to high latency and ineicient utilization of network and/or GPU resources. In this work, we introduce GPU Initiated OpenSHMEM (GIO), a new intra-kernel PGAS programming model and runtime that enables GPUs to communicate directly with a NIC without the intervention of the CPU. We accomplish this by exploring the GPU's coarse-grained memory model and correcting semantic mismatches when GPUs wish to directly interact with the network. GIO also reduces latency by relying on a novel template-based design to minimize the overhead of initiating a network operation. We illustrate that for structured applications like a Jacobi 2D stencil, GIO can improve application performance by up to 40% compared to traditional kernel-boundary networking. Furthermore, we demonstrate that on irregular applications like Sparse Triangular Solve (SpTS), GIO provides up to 44% improvement compared to existing intra-kernel networking schemes.
In this paper, we present the first comprehensive performance characterization and optimization of ARM barriers on both mobile and server platforms. We draw a set of observations through several abstracted models and ...
详细信息
ISBN:
(纸本)9781450368186
In this paper, we present the first comprehensive performance characterization and optimization of ARM barriers on both mobile and server platforms. We draw a set of observations through several abstracted models and validate them in scenarios where barriers are intensively used. We find that (1) order-preserving approaches without involving the bus significantly outperform other approaches, and (2) the tremendous overhead mostly comes from barriers strictly following remote memory references. Usually, such barriers are inserted when threads are exchanging data, and they are used to ensure the relative order between storing the data to a shared buffer and setting a flag to inform the receiver. Based on the observations, we propose a new mechanism, Pilot, to remove such barriers by leveraging the single-copy atomicity to piggyback the flag with the data. Applying Pilot only requires minor changes to applications and provides 10%-360% performance improvements in multiple benchmarks, which are close to the ideal performance without barriers.
Differential privacy is a mathematically rigorous definition of privacy tailored to statistical analysis of large datasets. Differentially private algorithms are equipped with a parameter which controls the formal mea...
详细信息
ISBN:
(纸本)9781450385480
Differential privacy is a mathematically rigorous definition of privacy tailored to statistical analysis of large datasets. Differentially private algorithms are equipped with a parameter which controls the formal measure of privacy loss. All algorithms have utility/privacy tradeoffs, and the goal of algorithmic research in differential privacy is to optimize this *** privacy is most widely studied in the centralized model, in which a trusted and trustworthy curator has access to raw data. Deployment in industry has focused on the local model, where privacy is "rolled into" the data on the client via randomization before being collected. There is a separation between the power of the centralized and local *** a brief recap of differential privacy and its properties, we will survey a few highlights of differential privacy in a variety of distributed settings that lie between the local and centralized models, and conclude with suggestions for future research.
Container has emerged as a new technology in clouds to replace virtual machines (VM) for distributed applications deployment and operation. With the increasing number of new cloud-focused applications, such as deep le...
详细信息
ISBN:
(纸本)9781450370523
Container has emerged as a new technology in clouds to replace virtual machines (VM) for distributed applications deployment and operation. With the increasing number of new cloud-focused applications, such as deep learning and high performance applications, started to reply on the high computing throughput of GPUs, efficiently supporting GPU in container cloud becomes essential. While GPU virtualization has been extensively studied for VM, limited work has been done for containers. One of the key challenges is the lack of support for GPU sharing between multiple concurrent containers. This limitation leads to low resource utilization when a GPU device cannot be fully utilized by a single application due to the burstiness of GPU workload and the limited memory bandwidth. To overcome this issue, we designed and implemented KubeShare, which extends Kubernetes to enable GPU sharing with fine-grained allocation. KubeShare is the first solution for Kubernetes to make GPU device as a first class resources for scheduling and allocations. Using real deep learning workloads, we demonstrated KubeShare can significantly increase GPU utilization and overall system throughput around 2x with less than 10% performance overhead during container initialization and execution.
Variational quantum algorithm (VQA), which is comprised of a classical optimizer and a parameterized quantum circuit, emerges as one of the most promising approaches for harvesting the power of quantum computers in th...
详细信息
ISBN:
(纸本)9781450386104
Variational quantum algorithm (VQA), which is comprised of a classical optimizer and a parameterized quantum circuit, emerges as one of the most promising approaches for harvesting the power of quantum computers in the noisy intermediate scale quantum (NISQ) era. However, the deployment of VQAs on contemporary NISQ devices often faces considerable system and time-dependant noise and prohibitively slow training speeds. On the other hand, the expensive supporting resources and infrastructure make quantum computers extremely keen on high utilization. In this paper, we propose a virtualized way of building up a quantum backend for variational quantum algorithms: rather than relying on a single physical device which tends to introduce everchanging device-specifc noise with less reliable performance as time-since-calibration grows, we propose to constitute a quantum ensemble, which dynamically distributes quantum tasks asynchronously across a set of physical devices, and adjusts the ensemble confguration with respect to machine status. In addition to reduced machine-dependant noise, the ensemble can provide signifcant speedups for VQA training. With this idea, we build a novel VQA training framework called EQC-a distributed gradientbased processor-performance-aware optimization system-that comprises: (i) a system architecture for asynchronous parallel VQA cooperative training;(ii) an analytical model for assessing the quality of a circuit output concerning its architecture, transpilation, and runtime conditions;(iii) a weighting mechanism to adjust the quantum ensemble's computational contribution according to the systems' current performance. Evaluations comprising 500K times' circuit evaluations across 10 IBMQ NISQ devices using a VQE and a QAOA applications demonstrate that EQC can attain error rates very close to the most performant device of the ensemble, while boosting the training speed by 10.5× on average (up to 86× and at least 5.2×). EQC is available at https://
暂无评论