Checkpointing is an I/O intensive operation increasingly used by High-Performance computing (HPC) applications to revisit previous intermediate datasets at scale. Unlike the case of resilience, where only the last che...
详细信息
ISBN:
(纸本)9798400701559
Checkpointing is an I/O intensive operation increasingly used by High-Performance computing (HPC) applications to revisit previous intermediate datasets at scale. Unlike the case of resilience, where only the last checkpoint is needed for application restart and rarely accessed to recover from failures, in this scenario, it is important to optimize frequent reads and writes of an entire history of checkpoints. State-of-the-art checkpointing approaches often rely on asynchronous multi-level techniques to hide I/O overheads by writing to fast local tiers (e.g. an SSD) and asynchronously flushing to slower, potentially remote tiers (e.g. a parallel file system) in the background, while the application keeps running. However, such approaches have two limitations. First, despite the fact that HPC infrastructures routinely rely on accelerators (e.g. GPUs), and therefore a majority of the checkpoints involve GPU memory, efficient asynchronous data movement between the GPU memory and host memory is lagging behind. Second, revisiting previous data often involves predictable access patterns, which are not exploited to accelerate read operations. In this paper, we address these limitations by proposing a scalable and asynchronous multi-level checkpointing approach optimized for both reading and writing of an arbitrarily long history of checkpoints. Our approach exploits GPU memory as a first-class citizen in the multi-level storage hierarchy to enable informed caching and prefetching of checkpoints by leveraging foreknowledge about the access order passed by the application as hints. Our evaluation using a variety of scenarios under I/O concurrency shows up to 74x faster checkpoint and restore throughput as compared to the state-of-art runtime and optimized unified virtual memory (UVM) based prefetching strategies and at least 2x shorter I/O wait time for the application across various workloads and configurations.
Deep learning models have a wide spectrum of GPU execution time and memory size. When running distributed training jobs, however, their GPU execution time and memory size have not been taken into account, which leads ...
详细信息
ISBN:
(数字)9781665481373
ISBN:
(纸本)9781665481373
Deep learning models have a wide spectrum of GPU execution time and memory size. When running distributed training jobs, however, their GPU execution time and memory size have not been taken into account, which leads to the high variance of job completion time (JCT). Moreover, the jobs often run into the GPU out-of-memory (OoM) problem so that the unlucky job has to restart all over. To address the problems, we propose Xonar to profile the deep learning jobs and order them in the queue. The experiments show that Xonar with TensorFlow v1.6 reduces the tail JCT by 44% with the OoM problem eliminated.
The multiplier is an important component of the processor's computing unit. Multiplication, multiplication, addition, and multiplication and subtraction operations are widely used in various signal processing algo...
详细信息
In this study, we offer a distributed Intelligent Video Surveillance (DIVS) system that is set up in an environment of edge computing and is based on Deep Learning (DL). For the DIVS system, we developed a distributed...
详细信息
Traditional graph-processing algorithms have been widely used in Graph Neural Networks (GNNs). This combination has shown state-of-the-art performance in many real-world network mining tasks. Current approaches to gra...
详细信息
Deep Neural Network (DNN) models have been widely utilized in various applications. However, the growing complexity of DNNs has led to increased challenges and prolonged training durations. Despite the availability of...
详细信息
Blockchain and federated learning, as two key technologies for trusted and privacy-preserving collaboration in distributed environments, have been intensively studied in recent years. Federated learning aims to train ...
详细信息
Large models have achieved impressive performance in many downstream tasks. Using pipeline parallelism to fine-tune large models on commodity GPU servers is an important way to make the excellent performance of large ...
详细信息
We present two algorithms in the Quantum CONGEST-CLIQUE model of distributed computation that succeed with high probability;one for producing an approximately optimal Steiner Tree, and one for producing an exact direc...
详细信息
暂无评论