The proceedings contain 153 papers. The topics discussed include: transaction data management optimization based on multi-partitioning in blockchain systems;semi-asynchronous federated learning optimized for NON-IID d...
ISBN:
(纸本)9798350329223
The proceedings contain 153 papers. The topics discussed include: transaction data management optimization based on multi-partitioning in blockchain systems;semi-asynchronous federated learning optimized for NON-IID data communication based on tensor decomposition;HKTGNN: hierarchical knowledge transferable graph neural network-based supply chain risk assessment;DQR-TTS: semi-supervised text-to-speech synthesis with dynamic quantized representation;deep reinforcement learning-based network moving target defense in DPDK;iNUMAlloc: towards intelligent memory allocation for AI accelerators with NUMA;and predictive queue-based low latency congestion detection in data center networks.
Erasure-coded storage systems can achieve highly reliable data storage with low storage overhead. However, updating data blocks necessitates updating parity blocks, and updating multiple data blocks incurs heavy I/O a...
详细信息
In traditional CPU scheduling systems, it is challenging to customize scheduling policies for datacenter workloads. Therefore, distributed cluster managers can only perform coarse-grained job scheduling rather than fi...
详细信息
In the exascale computing era, optimizing MPI collective performance in high-performance computing (HPC) applications is critical. Current algorithms face performance degradation due to system call overhead, page faul...
详细信息
ISBN:
(纸本)9798400701559
In the exascale computing era, optimizing MPI collective performance in high-performance computing (HPC) applications is critical. Current algorithms face performance degradation due to system call overhead, page faults, or data-copy latency, affecting HPC applications' efficiency and scalability. To address these issues, we propose PiP-MColl, a Process-in-Process-based Multi-object Interprocess MPI Collective design that maximizes small message MPI collective performance at scale. PiP-MColl features efficient multiple sender and receiver collective algorithms and leverages Process-in-Process shared memory techniques to eliminate unnecessary system call, page fault overhead, and extra data copy, improving intra- and inter-node message rate and throughput. Our design also boosts performance for larger messages, resulting in comprehensive improvement for various message sizes. Experimental results show that PiP-MColl outperforms popular MPI libraries, including OpenMPI, MVAPICH2, and Intel MPI, by up to 4.6X for MPI collectives like MPI_Scatter and MPI_Allgather.
In this poster, the energy and carbon management problem in Data Center Microgrid (DCMG) is modeled as a Decentralized Partially Observable Markov Decision Process, and the Multi-Agent Deep Deterministic Policy Gradie...
详细信息
Emerging technologies, such as cloud computing and artificial intelligence, significantly arouse concern about data security and privacy. Homomorphic encryption (HE) is a promising invention, which enables computation...
详细信息
Researchers conduct post-processing on the simulation results by running an interactive data analysis tool on a High-Performance Computing (HPC) system installed at an HPC center and retrieving the post-processed resu...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
Researchers conduct post-processing on the simulation results by running an interactive data analysis tool on a High-Performance Computing (HPC) system installed at an HPC center and retrieving the post-processed results. Certain data analysis scenarios require to transfer the simulation results directly from the center. in such scenarios, a portion of the data would usually be streamed over the network to achieve interactivity. However, there still exist two challenges in maintaining interactivity: (1) limited network bandwidth and (2) long network latency. To tackle these challenges, we propose a system to enable interactive array analysis over the network. We employ error-bounded lossy compression to increase the effective network bandwidth. Furthermore, we employ multi-level caching to hide the network latency and combine prefetching to improve the cache hit ratio. The cache replacement and prefetching policies are designed considering the data access pattern of interactive analysis. We compared our proposed system with TileDB, one of the state-of-the-art array databases, by measuring the average latency for various access patterns. Compared to TileDB, the proposed system reduces the average latency by up to 91.6% by allowing 10% of error because the cache hit ratio was improved by more than 40% due to the proper cache replacement and prefetching policy and network transfer time was reduced more than 75% by using lossy compression.
With the rapid development of deep learning, the parameters of modern neural network models, especially in the field of Natural Language Processing (NLP) are extremely huge. When the parameters of the model are larger...
详细信息
Deep Neural Networks (DNNs) process big datasets achieving high accuracy on incredibly complex tasks. However, this progress has led to a scalability impasse, as DNNs require massive amounts of processing power and lo...
详细信息
ISBN:
(纸本)9781665455800
Deep Neural Networks (DNNs) process big datasets achieving high accuracy on incredibly complex tasks. However, this progress has led to a scalability impasse, as DNNs require massive amounts of processing power and local memory to be trained, making them impossible or impractical to be used on a single device. This situation has led to the design of distributed training architectures, where the DNN and the training data can be split among multiple processors. How to choose the appropriate distributed training architecture, however, remains an open question. To help bring insights into this debate, in this work we design a distributed Training Simulator (DTS) that estimates the training time of a DNN in a distributed architecture through a mathematical model of the distributed architecture and resource-allocation heuristics. We illustrate the power of the proposed DTS through the implementation of five different distributed architectures, Pipeline Learning, Federated Learning, Split Learning, parallel Split Learning, and Federated Split Learning, and we validate the accuracy of the training estimates using three different datasets of varying complexity and two different DNNs. Finally, we present a trade-off analysis to demonstrate the coherence of DTS estimates for diverse high-performance computing scenarios by comparing these estimates with the behaviors of a real computer cluster.
The proceedings contain 59 papers. The topics discussed include: rethinking design paradigm of graph processing system with a CXL-like memory semantic fabric;a case study of data management challenges presented in lar...
ISBN:
(纸本)9798350301199
The proceedings contain 59 papers. The topics discussed include: rethinking design paradigm of graph processing system with a CXL-like memory semantic fabric;a case study of data management challenges presented in large-scale machine learning workflows;an asynchronous dataflow-driven execution model for distributed accelerator computing;an empirical study of container image configurations and their impact on start times;how workflow engines should talk to resource managers: a proposal for a common workflow scheduling interface;Layercake: efficient inference serving with cloud and mobile resources;optimal sizing of a globally distributed low carbon cloud federation;predicting the performance-cost trade-off of applications across multiple systems;a deep learning pipeline parallel optimization method;HDFL: a heterogeneity and client dropout-aware federated learning framework;heterogeneous federated learning using dynamic model pruning and adaptive gradient;and measuring the impact of gradient accumulation on cloud-based distributed training.
暂无评论