The rapid rise in spatial data volumes from diverse sources necessitate efficient spatial data processing capability. Although most relational databases support spatial extensions of SQL query features, they offer lim...
详细信息
Temporal Graph Neural Networks (TGNNs) extend the success of Graph Neural Networks to dynamic graphs. distributed TGNN training requires efficiently tackling temporal dependency, which often leads to excessive cross-d...
详细信息
ISBN:
(纸本)9798400701559
Temporal Graph Neural Networks (TGNNs) extend the success of Graph Neural Networks to dynamic graphs. distributed TGNN training requires efficiently tackling temporal dependency, which often leads to excessive cross-device communication that generates significant redundant data. However, existing systems are unable to remove the redundancy in data reuse and transfer, and suffer from severe communication overhead in a distributed setting. This paper presents Sven, an algorithm and system co-designed TGNN training library for the end-to-end performance optimization on multi-node multi-GPU systems. Exploiting dependency patterns of TGNN models and characteristics of dynamic graph datasets, we design redundancy-free data organization and load-balancing partitioning strategies that mitigate the redundant data communication and evenly partition dynamic graphs at the vertex level. Furthermore, we develop a hierarchical pipeline mechanism integrating data prefetching, micro-batch pipelining, and asynchronous pipelining to mitigate the communication overhead. As the first scaling study on the memory-based TGNNs training, experiments conducted on an HPC cluster of 64 GPUs show that Sven can achieve up to 1.7x-3.3x speedup over the state-of-art approaches and a factor of up to 5.26x communication efficiency improvement.
Hardware-assisted enclaves with memory encryption have been widely adopted in the prevailing architectures, e.g., Intel SGX/TDX, AMD SEV, ARM CCA, etc. However, existing enclave designs fall short in supporting effici...
详细信息
ISBN:
(纸本)9781665476522
Hardware-assisted enclaves with memory encryption have been widely adopted in the prevailing architectures, e.g., Intel SGX/TDX, AMD SEV, ARM CCA, etc. However, existing enclave designs fall short in supporting efficient cooperation among cross-node enclaves (i.e., multi-machines) because the range of hardware memory protection is within a single node. A naive approach is to leverage cryptography at the application level and transfer data between nodes through secure channels (e.g., SSL). However, it incurs orders of magnitude costs due to expensive encryption/decryption, especially for distributed applications with large data transfer, e.g., MapReduce and graph computing. A secure and efficient mechanism of distributed secure memory is necessary but still missing. This paper proposes Migratable Merkle Tree (MMT), a design enabling efficient distributed secure memory to support distributed confidential computing. MMT sets up an integrity forest for distributed memory on multiple nodes. It allows an enclave to securely delegate an MMT closure, which contains both data and metadata of a subtree, to a remote enclave. By reusing the memory encryption mechanisms of existing enclaves, our design achieves secure data transfer without software re-encryption. We have implemented a prototype of MMT and a trusted firmware for management, and further applied MMT to real-world distributed applications. The evaluation results show that compared with existing systems using the AES-NI instruction, MMT can achieve up to 13x speedup on data transferring, and gain 12%similar to 58% improvement on the end-to-end performance of MapReduce and PageRank.
In an era where digital commerce continues to burgeon, the conventional supply chain confronts challenges of inefficiency, fraud, and a dearth of transparency. Blockchain, renowned for its decentralized and immutable ...
详细信息
Memory caching has long been used to fill up the performance gap between processor and disk for reducing the data access time of data-intensive computations. Previous studies on caching mostly focus on optimizing the ...
详细信息
ISBN:
(纸本)9781665497473
Memory caching has long been used to fill up the performance gap between processor and disk for reducing the data access time of data-intensive computations. Previous studies on caching mostly focus on optimizing the hit rate of a single machine. But in this paper, we argue that the caching decision of a distributed memory system should be performed in a cooperative manner for the parallel data analytic applications, which are commonly used by emerging technologies, such as Big Data and AI (Artificial Intelligence), to perform data mining and sophisticated analytics on larger data volume in a shorter time. A parallel data analytic job consists of multiple parallel tasks. Hence, the completion time of a job is bounded by its slowest task, meaning that the job cannot benefit from caching until all inputs of its tasks are cached. To address the problem, we proposed a cooperative caching design that periodically rearranges the cache placement among nodes according to the data access pattern while taking the task dependency and network locality into account. Our approach is evaluated by a trace-driven simulator using both synthetic workload and real-world traces. The results show that we can reduce the average completion times up to 33% compared to a non-collaborative caching polices and 25% compared to other start-of-the-art collaborative caching policies.
parallel high-performance computing relies on cache -efficient, branch-free algorithms that are often expressed as imperative computations over multi-dimensional arrays. Numerous problem domains, spanning from image p...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
parallel high-performance computing relies on cache -efficient, branch-free algorithms that are often expressed as imperative computations over multi-dimensional arrays. Numerous problem domains, spanning from image processing to graph analytics, and from state space exploration in combinatorial optimization to computer Chess, require carefully crafted algorithms that capitalize on patterns inherent in the underlying problem structure. A renowned technique, SIMD-Within-A-Register (SWAR), harnesses integer arithmetic to attain significant hardware parallelism. However, this approach typically demands labor-intensive efforts from domain experts with specialized knowledge of the underlying hardware architecture. We therefore present a compiler-driven approach that automates the transformation of conventional array -based C-code into highly tuned integer arithmetic, exploiting SWAR parallelism without the requirement of tedious manual optimization efforts. Our approach achieves substantial performance improvements, exhibiting an average speedup of 30x compared to conventional array-based implementations.
In today's Big Data era, data scientists require modern workflows to quickly analyze large-scale datasets using complex codes to maintain the rate of scientific progress. These scientists often rely on available c...
详细信息
ISBN:
(纸本)9781665497473
In today's Big Data era, data scientists require modern workflows to quickly analyze large-scale datasets using complex codes to maintain the rate of scientific progress. These scientists often rely on available campus resources or off-the-shelf computational systems for their applications. Unified infrastructure or over-provisioned servers can quickly become bottlenecks for specific tasks, wasting time and resources. Composable infrastructure helps solve these problems by providing users with new ways to increase resource utilization. Composable infrastructure disaggregates a computer's components - CPU, GPU (accelerators), storage and networking - into fluid pools of resources, but typically relies upon infrastructure engineers to architect individual machines. Infrastructure is either managed with specialized command-line utilities, user interfaces or specification files. These management models are cumbersome and difficult to incorporate into data-science workflows. We developed a high-level software API, Composastructure, which, when integrated into modern workflows, can be used by infrastructure engineers as well as data scientists to reorganize composable resources on demand. Composastructure enables infrastructures to be programmable, secure, persistent and reproducible. Our API composes machines, frees resources, supports multi-rack operations, and includes a Python module for Jupyter Notebooks.
The matching problem formulated as Maximum Cardinality Matching in General Graphs (MCMGG) finds the largest matching on graphs without restrictions. The Micah-Vazirani algorithm has the best asymptotic complexity for ...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
The matching problem formulated as Maximum Cardinality Matching in General Graphs (MCMGG) finds the largest matching on graphs without restrictions. The Micah-Vazirani algorithm has the best asymptotic complexity for solving MCMGG when the graphs are sparse. parallelizing matching in general graphs on the GPU is difficult for multiple reasons. First, the augmenting path procedure is highly recursive, and NVID1A GPUs use registers to store kernel arguments, which eventually spill into cached device memory, with a performance penalty. Second, extracting parallelism from the matching process requires partitioning the graph to avoid any overlapping augmenting paths. We propose an implementation of the Micali-Vazirani algorithm which identifies bridge edges using threadparallel breadth-first search, followed by block-parallel path augmentation and blossom contraction. Augmenting path and Union find methods were implemented as stack-based iterative methods, with a stack allocated in shared memory. Our experimentation shows that compared to the serial implementation, our approach results in up to 15-fold speed-up for very sparse regular graphs, up to 5 -fold slowdown for denser regular graphs, and finally a 50-fold slowdown for power-law distributed Kronecker graphs. This implementation has been open-sourced for further research on developing combinatorial graph algorithms on GPUs.
Modern distributed storage systems with massive data and storage nodes pose higher requirements to the data placement strategy. Furthermore, with emerged new storage devices, heterogeneous storage architecture has bec...
详细信息
ISBN:
(纸本)9781665481069
Modern distributed storage systems with massive data and storage nodes pose higher requirements to the data placement strategy. Furthermore, with emerged new storage devices, heterogeneous storage architecture has become increasingly common and popular. However, traditional strategies expose great limitations in the face of these requirements, especially do not well consider distinct characteristics of heterogeneous storage nodes yet, which will lead to suboptimal performance. In this paper, we present and evaluate the RLRP, a deep reinforcement learning (RL) based replica placement strategy. RLRP constructs placement and migration agents through the Deep-Q-Network (DQN) model to achieve fair distribution and adaptive data migration. Besides, RLRP provides optimal performance for heterogeneous environment by an attentional Long Short-term Memory (LSTM) model. Finally, RLRP adopts Stagewise Training and Model fine-tuning to accelerate the training of RL models with large-scale state and action space. RLRP is implemented on Park and the evaluation results indicate RLRP is a highly efficient data placement strategy for modern distributed storage systems. RLRP can reduce read latency by 10%similar to 50% in heterogeneous environment compared with existing strategies. In addition, RLRP is used in the real-world system Ceph, which improves the read performance of Ceph by 30%similar to 40%.
distributed deep learning systems commonly use synchronous data parallelism to train models. However, communication overhead can be costly in distributed environments with limited communication bandwidth. To reduce co...
详细信息
暂无评论