Balancing robustness and computational efficiency in machine learning models is challenging, especially in settings with limited resources like mobile and IoT devices. This study introduces Adaptive and Localized Adve...
详细信息
Remote direct memory access (RDMA) supports zero-copy networking by transferring data from clients directly to host memory, eliminating the need to copy data between clients' memory and the data buffers in the hos...
详细信息
ISBN:
(纸本)9798400701559
Remote direct memory access (RDMA) supports zero-copy networking by transferring data from clients directly to host memory, eliminating the need to copy data between clients' memory and the data buffers in the hosting server. However, the hosting server must design efficient memory management schemes to handle incoming clients' data. In this paper, we propose a high-performance host memory management scheme called HM2 for RDMA-enabled distributed systems. We present a new buffer structure for incoming data from clients. In addition, we propose efficient data processing methods to reduce network transfers between clients and servers. We conducted a preliminary experiment to evaluate HM2, and the results showed HM2 achieved higher throughput than existing schemes, including L5 and FaRM.
distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at ...
详细信息
ISBN:
(纸本)9798350304817
distributed data-parallel (DDP) training improves overall application throughput as multiple devices train on a subset of data and aggregate updates to produce a globally shared model. The periodic synchronization at each iteration incurs considerable overhead, exacerbated by the increasing size and complexity of state-of-the-art neural networks. Although many gradient compression techniques propose to reduce communication cost, the ideal compression factor that leads to maximum speedup or minimum data exchange remains an open-ended problem since it varies with the quality of compression, model size and structure, hardware, network topology and bandwidth. We propose GraVAC, a framework to dynamically adjust compression factor throughout training by evaluating model progress and assessing gradient information loss associated with compression. GraVAC works in an online, black-box manner without any prior assumptions about a model or its hyperparameters, while achieving the same or better accuracy than dense SGD (i.e., no compression) in the same number of iterations/epochs. As opposed to using a static compression factor, GraVAC reduces end-to-end training time for ResNet101, VGG16 and LSTM by 4.32x, 1.95x and 6.67x respectively. Compared to other adaptive schemes, our framework provides 1.94x to 5.63x overall speedup.
In this paper, we present a novel unified framework that seamlessly integrates distributedcomputing and high-density graph computing. Our approach leverages a hybrid architecture that combines the strengths of both p...
详细信息
This article examines directions and mechanisms for increasing data reliability in computer networks. Currently, the rapid development of information technologies, the rapid growth of data flow, high-quality data proc...
详细信息
Serverless computing has shown vast potential for big data analytics applications, especially involving machine learning algorithms. Nevertheless, little consideration has been given in the literature to cloud-agnosti...
详细信息
Community detection is a fundamental operation in graph mining, and by uncovering hidden structures and patterns within complex systems it helps solve fundamental problems pertaining to social networks, such as inform...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
Community detection is a fundamental operation in graph mining, and by uncovering hidden structures and patterns within complex systems it helps solve fundamental problems pertaining to social networks, such as information diffusion, epidemics, and recommender systems. Scaling graph algorithms for massive networks becomes challenging on modern distributed-memory multi-GPU (Graphics Processing Unit) systems due to limitations such as irregular memory access patterns, load imbalances, higher communication-computation ratios, and cross-platform support. We present a novel algorithm HiPDPL-GPU (distributedparallel Louvain) to address these challenges. We conduct experiments involving different partitioning techniques to achieve an optimized performance of HiPDPL-GPU on the two largest supercomputers: Frontier and Summit. Remarkably, HiPDPL-GPU processes a graph with 4.2 billion edges in less than 3 minutes using 1024 GPUs. Qualitatively, the performance of HiPDPL-GPU is similar or better compared to other state-of-the-art CPU- and GPU-based implementations. While prior GPU implementations have predominantly employed CUDA, our first-of-its-kind implementation for community detection is cross-platform, accommodating both AMD and NVIDIA GPUs.
The serverless computing model has been on the rise in recent years due to a lower barrier to entry and elastic scalability. However, our experimental evidence suggests that multiple serverless computing platforms suf...
详细信息
ISBN:
(纸本)9798400701559
The serverless computing model has been on the rise in recent years due to a lower barrier to entry and elastic scalability. However, our experimental evidence suggests that multiple serverless computing platforms suffer from serious performance inefficiencies when a high number of concurrent function instances are invoked, which is a desirable capability for parallel applications. To mitigate this challenge, this paper introduces ProPack, a novel solution that provides higher performance and yields cost savings for end users running applications with high concurrency. ProPack leverages insights obtained from experimental study to build a simple and effective analytical model that mitigates the scalability bottleneck. Our evaluation on multiple serverless platforms including AWS Lambda and Google confirms that ProPack can improve average performance by 85% and save cost by 66%. ProPack provides significant improvement (over 50%) over the state-of-the-art serverless workload manager such as Pywren, and is also, effective at mitigating the concurrency bottleneck for FuncX, a recent on-premise serverless execution platform for parallel applications.
Big data technology is increasingly penetrating various industries, bringing unprecedented opportunities to enterprises and society with its powerful data processing and analysis capabilities. At the same time, the ra...
详细信息
Recent advances in imaging and computing technology generate tremendous image data daily. Searching image collections has been made easier with the introduction of some content-based image retrieval (CBIR) approaches....
详细信息
暂无评论