distributed shared memory (DSM) systems can handle data-intensive applications and recently receiving more attention. A majority of existing DSM implementations are based on write-invalidation (WI) protocols, which ac...
详细信息
distributed shared memory (DSM) systems can handle data-intensive applications and recently receiving more attention. A majority of existing DSM implementations are based on write-invalidation (WI) protocols, which achieve sub-optimal performance when the cache size is small. Specifically, the vast majority of invalidation messages become useless when evictions are frequent. The problem is troublesome regarding scarce memory resources in data centers. To this end, we propose a self-invalidation protocol Falcon to eliminate invalidation messages. It relies on per-operation timestamps to achieve the global memory order required by sequential consistency (SC). Furthermore, we conduct a comprehensive discussion on the two protocols with an emphasis on the cache size impact. We also implement both protocols atop a recent DSM system, Grappa. The evaluation shows that the optimal protocol can improve the performance of a KV database by 27% and a graph processing application by 71.4% against the vanilla cache-free scheme.
The concept of memory disaggregation has recently been gaining traction in research. With memory disaggregation, data center compute nodes can directly access memory on adjacent nodes and are therefore able to overcom...
详细信息
ISBN:
(数字)9781665497473
ISBN:
(纸本)9781665497480
The concept of memory disaggregation has recently been gaining traction in research. With memory disaggregation, data center compute nodes can directly access memory on adjacent nodes and are therefore able to overcome local memory restrictions, introducing a new data management paradigm for distributed computing. This paper proposes and demonstrates a memory disaggregated in-memory object store framework for big data applications by leveraging the newly introduced Thymes-isFlow memory disaggregation system. The framework extends the functionality of the pre-existing Apache Arrow Plasma object store framework to distributedsystems by enabling clients to easily and efficiently produce and consume data objects across multiple compute nodes. This allows big data applications to increasingly leverage parallel processing at reduced development costs. In addition, the paper includes latency and throughput measurements that indicate only a modest performance penalty is incurred for remote disaggregated memory access as opposed to local (~6.5 vs ~5.75 GiB/s). The results can be used to guide the design of future systems that leverage memory disaggregation as well as the newly presented framework. This work is open-source and publicly accessible at https://***/10.5281/zenodo.6368998.
The constant growth of social media, unconventional web technologies, mobile applications, and Internet of Things (IoT) devices, create challenges for cloud data systems in order to support huge datasets and very high...
详细信息
ISBN:
(纸本)9781665417303
The constant growth of social media, unconventional web technologies, mobile applications, and Internet of Things (IoT) devices, create challenges for cloud data systems in order to support huge datasets and very high request rates. NoSQL distributeddatabases such as Cassandra have been used for unstructured data storage and to increase horizontal scalability and high availability. In this paper, we evaluated Cassandra on a low-power low-cost cluster of commodity Single Board Computers (SBC). The cluster has 15 Raspberry Pi v3 nodes with Docker Swarm orchestration tool for Cassandra service deployment and ingress load balancing over SBCs. Experimental results demonstrated that hardware limitations impacted workload throughput, but read and write latencies were comparable to results from other works on high-end or virtualized platforms. Despite the observed limitations, the results show that a low-cost SBC cluster can support cloud serving goals such as scale-out, elasticity and high availability.
Variational quantum algorithm (VQA), which is comprised of a classical optimizer and a parameterized quantum circuit, emerges as one of the most promising approaches for harvesting the power of quantum computers in th...
详细信息
ISBN:
(纸本)9781450386104
Variational quantum algorithm (VQA), which is comprised of a classical optimizer and a parameterized quantum circuit, emerges as one of the most promising approaches for harvesting the power of quantum computers in the noisy intermediate scale quantum (NISQ) era. However, the deployment of VQAs on contemporary NISQ devices often faces considerable system and time-dependant noise and prohibitively slow training speeds. On the other hand, the expensive supporting resources and infrastructure make quantum computers extremely keen on high utilization. In this paper, we propose a virtualized way of building up a quantum backend for variational quantum algorithms: rather than relying on a single physical device which tends to introduce ever-changing device-specific noise with less reliable performance as time-since-calibration grows, we propose to constitute a quantum ensemble, which dynamically distributes quantum tasks asynchronously across a set of physical devices, and adjusts the ensemble configuration with respect to machine status. In addition to reduced machine-dependant noise, the ensemble can provide significant speedups for VQA training. With this idea, we build a novel VQA training framework called EQC - a distributed gradient-based processor-performance-aware optimization system - that comprises: (i) a system architecture for asynchronous parallel VQA cooperative training;(ii) an analytical model for assessing the quality of a circuit output concerning its architecture, transpilation, and runtime conditions;(iii) a weighting mechanism to adjust the quantum ensemble's computational contribution according to the systems' current performance. Evaluations comprising 500K times' circuit evaluations across 10 IBMQ NISQ devices using a VQE and a QAOA applications demonstrate that EQC can attain error rates very close to the most performant device of the ensemble, while boosting the training speed by 10.5x on average (up to 86x and at least 5.2x). EQC is available at
Graph neural networks (GNNs) operate on data represented as graphs, and are useful for a wide variety of tasks from chemical reaction and protein structure prediction to content recommendation systems. However, traini...
详细信息
ISBN:
(数字)9781665497473
ISBN:
(纸本)9781665497480
Graph neural networks (GNNs) operate on data represented as graphs, and are useful for a wide variety of tasks from chemical reaction and protein structure prediction to content recommendation systems. However, training for large graphs and improving training performance remain significant challenges. Existing distributed training systems partition a graph among all compute nodes to train for large graphs; however, this results in a communication overhead to degrade training performance. In this study, to solve these two problems, we propose a scalable data-paralleldistributed GNN training system designed to partition a graph redundantly. It is implemented using remote direct memory access (RDMA) and nonblocking active messages to efficiently utilize network performance and hide communication overhead by overlapping with the training computation. Experimental results are presented to show the strong scalability of the proposed approach, which achieved parallel efficiencies of 0.93 using eight compute nodes for the ogbn-products dataset in the Open Graph Benchmark (OGB) and 0.95 based on two compute nodes using 32 compute nodes for the ogbn-papers100M dataset. The proposed system exhibited a training performance 18.9% better than the state-of-the-art DistDGL, even with only a single compute node. The results demonstrate that the proposed approach may be considered a promising method to achieve scalable training performance for large graphs.
Modern distributed storage systems with massive data and storage nodes pose higher requirements to the data placement strategy. Furthermore, with emerged new storage devices, heterogeneous storage architecture has bec...
详细信息
Modern distributed storage systems with massive data and storage nodes pose higher requirements to the data placement strategy. Furthermore, with emerged new storage devices, heterogeneous storage architecture has become increasingly common and popular. However, traditional strategies expose great limitations in the face of these requirements, especially do not well consider distinct characteristics of heterogeneous storage nodes yet, which will lead to suboptimal performance. In this paper, we present and evaluate the RLRP, a deep reinforcement learning (RL) based replica placement strategy. RLRP constructs placement and migration agents through the Deep-Q-Network (DQN) model to achieve fair distribution and adaptive data migration. Besides, RLRP provides optimal performance for heterogeneous environment by an attentional Long Short-term Memory (LSTM) model. Finally, RLRP adopts Stagewise Training and Model fine-tuning to accelerate the training of RL models with large-scale state and action space. RLRP is implemented on Park and the evaluation results indicate RLRP is a highly efficient data placement strategy for modern distributed storage systems. RLRP can reduce read latency by 10%∼50% in heterogeneous environment compared with existing strategies. In addition, RLRP is used in the real-world system Ceph, which improves the read performance of Ceph by 30%∼40%.
To amortize the cost of MPI communications, distributedparallel HPC applications can overlap network communications with computations in the hope that it improves global application performance. When using this techn...
详细信息
ISBN:
(数字)9781665497473
ISBN:
(纸本)9781665497480
To amortize the cost of MPI communications, distributedparallel HPC applications can overlap network communications with computations in the hope that it improves global application performance. When using this technique, both computations and communications are running at the same time. But computation usually also performs some data movements. Since data for computations and for communications use the same memory system, memory contention may occur when computations are memory-bound and large messages are transmitted through the network at the same time. In this paper we propose a model to predict memory band-width for computations and for communications when they are executed side by side, according to data locality and taking contention into account. Elaboration of the model allowed to better understand locations of bottleneck in the memory system and what are the strategies of the memory system in case of contention. The model was evaluated on many platforms with different characteristics, and showed a prediction error in average lower than 4 %.
The use of accelerators such as GPUs has become mainstream to achieve high performance on modern computing systems. GPUs come with their own (limited) memory and are connected to the main memory of the machine through...
详细信息
The use of accelerators such as GPUs has become mainstream to achieve high performance on modern computing systems. GPUs come with their own (limited) memory and are connected to the main memory of the machine through a bus (with limited bandwidth). When a computation is started on a GPU, the corresponding data needs to be transferred to the GPU before the computation starts. Such data movements may become a bottleneck for performance, especially when several GPUs have to share the communication bus. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is able to choose which task to allocate to which GPU and to reorder tasks so as to minimize data movements. We focus on this problem of partitioning and ordering tasks that share some of their input data. We present a novel dynamic strategy based on data selection to efficiently allocate tasks to GPUs and a custom eviction policy, and compare them to existing strategies using either a well-known graph partitioner or standard scheduling techniques in runtime systems. We also improved an offline scheduler recently proposed for a single GPU, by adding load balancing and task stealing capabilities. All strategies have been implemented on top of the STARPU runtime, and we show that our dynamic strategy achieves better performance when scheduling tasks on multiple GPU s with limited memory.
Several trends in the IT industry are driving an increasing specialization of the hardware layers. On the one hand, demanding workloads, large data volumes, diversity in data types, etc. are all factors contributing t...
详细信息
ISBN:
(纸本)9781450382175
Several trends in the IT industry are driving an increasing specialization of the hardware layers. On the one hand, demanding workloads, large data volumes, diversity in data types, etc. are all factors contributing to make general purpose computing too inefficient. On the other hand, cloud computing and its economies of scale allow vendors to invest on specialized hardware for particular tasks that otherwise would be too expensive or consume resources needed elsewhere. In this talk I will discuss the shift towards hardware acceleration and show with several examples why specialized systems are here to stay and are likely to dominate the computer landscape for years to come. I will also discuss Enzian, an open research platform developed at ETH to enable the exploration of hardware acceleration and present some preliminary results achieved with it.
Graph streaming has received substantial attention for the past 10+ years to cope with large-scale graph computation. Two major approaches, one using conventional data-streaming tools and the other accessing graph dat...
详细信息
暂无评论