Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. Checkpoin...
详细信息
Checkpoint is defined as a designated place in a program at which normal processing is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. checkpointing is the process of saving the status information. This paper surveys the algorithms which have been reported in the literature for checkpointing parallel/distributed systems. It has been observed that most of the algorithms published for checkpointing in message passing systems are based on the seminal article by Chandy and Lamport. A large number of articles have been published in this area by relaxing the assumptions made in this paper and by extending it to minimise the overheads of coordination and context saving. checkpointing for shared memory systems primarily extend cache coherence protocols to maintain a consistent memory. All of them assume that the main memory is safe for storing the context. Recently algorithms have been published for distributed shared memory systems, which extend the cache coherence protocols used in shared memory systems. They however also include methods for storing the status of distributed memory in stable storage. Most of the algorithms assume that there is no knowledge about the programs being executed. It is however felt that in development of parallel programs the user has to do a fair amount of work in distributing tasks and this information can be effectively used to simplify checkpointing and rollback recovery.
Replication is useful for supporting fault-tolerance, reliable and recovery oriented distributed systems. Popular application areas include databases, P2P systems, web services and Internet of Things. In this study, w...
详细信息
ISBN:
(纸本)9781538616291
Replication is useful for supporting fault-tolerance, reliable and recovery oriented distributed systems. Popular application areas include databases, P2P systems, web services and Internet of Things. In this study, we propose utilizing the checkpointing concept for improving the efficiency of the well-known primary-backup replication protocol in distributed systems. We developed a software framework based on an in-memory replicated key-value store to evaluate various checkpointing algorithms. Using the framework over geographically distributed nodes of the PlanetLab platform, we performed extensive experiments and analysis with several different metrics, including blocking time, checkpointing time, checkpoint size and recovery time. Experimental scenarios consist of using the well-known benchmarking tool, YCSB, performing realistic read/update queries through exemplary workloads. Our findings indicate that incremental checkpointing combined with a periodic usage is the most efficient approach with having up to 30-times better system throughput and 50% decrease in average blocking times compared to traditional primary-backup replication and other checkpointing algorithms.
Autonomic Computing Systems are oriented to prevente the human intervention and to enable distributed systems to manage themselves. One of their challenges is the efficient monitoring at runtime oriented to collect in...
详细信息
ISBN:
(纸本)9781479942497
Autonomic Computing Systems are oriented to prevente the human intervention and to enable distributed systems to manage themselves. One of their challenges is the efficient monitoring at runtime oriented to collect information from which the system can automatically repair itself in case of failure. Quasi-Synchronous checkpointing is a well-known technique, which allows processes to recover in spite of failures. Based on this technique, several checkpointing algorithms have been developed. According to the checkpoint properties detected and ensured, they are classified into: Strictly Z-Path Free (SZPF), Z-Path Free (ZPF) and Z-Cycle Free (ZCF). In the literature, the simulation has been the method adopted for the performance evaluation of checkpointing algorithms. However, few works have been designed to validate their correctness. In this paper, we propose a validation approach based on graph transformation oriented to automatically detect the previous mentioned checkpointing properties. To achieve this, we take the vector clocks resulting from the algorithm execution, and we model it into a causal graph. Then, we design and use transformation rules oriented to verify if in such a causal graph, the algorithm is exempt from non desirable patterns, such as Z-paths or Z-cycles, according to the case.
Replication is useful for supporting fault-tolerance, reliable and recovery oriented distributed systems. Popular application areas include databases, P2P systems, web services and Internet of Things. In this study, w...
详细信息
Replication is useful for supporting fault-tolerance, reliable and recovery oriented distributed systems. Popular application areas include databases, P2P systems, web services and Internet of Things. In this study, we propose utilizing the checkpointing concept for improving the efficiency of the well-known primary-backup replication protocol in distributed systems. We developed a software framework based on an in-memory replicated key-value store to evaluate various checkpointing algorithms. Using the framework over geographically distributed nodes of the PlanetLab platform, we performed extensive experiments and analysis with several different metrics, including blocking time, checkpointing time, checkpoint size and recovery time. Experimental scenarios consist of using the well-known benchmarking tool, YCSB, performing realistic read/update queries through exemplary workloads. Our findings indicate that incremental checkpointing combined with a periodic usage is the most efficient approach with having up to 30-times better system throughput and 50% decrease in average blocking times compared to traditional primary-backup replication and other checkpointing algorithms.
The adjoint code of nonlinear computer model calculates gradients along trajectory that has to be known at integration time. When the storage of the whole trajectory requires too large an amount of memory, the calcula...
详细信息
The adjoint code of nonlinear computer model calculates gradients along trajectory that has to be known at integration time. When the storage of the whole trajectory requires too large an amount of memory, the calculation of the adjoint code is split and is done part by part from restart points called checkpoints. Griewank proposed checkpointing method named Revolve, which provides an optimal logarithmic behavior with respect to time and memory requirement. In this work, some checkpointing schedules are proposed. Some of them correspond to special cases of Revolve. The user's preference is essential to choose between time and memory requirements. This is key point for adjoint codes of temporal models such's the meteorological model Meso-NH that may be used for weather forecasts. When the computational time is the top priority, particular checkpointing scheme allows computation of the adjoint code with at most one extra integration of the model. The memory requirement behaves then as the square root of the number of iterations of the model. checkpointing schemes are tested on adjoint simulations of Meso-NH.
We present, in this paper, a hybrid algorithm which makes use of Time Warp between clusters of LPs and a sequential algorithm within the cluster. Time Warp is, of course, traditionally implemented between individual L...
详细信息
ISBN:
(纸本)9780818671203
We present, in this paper, a hybrid algorithm which makes use of Time Warp between clusters of LPs and a sequential algorithm within the cluster. Time Warp is, of course, traditionally implemented between individual LPs. The algorithm was implemented in a digital logic simulator, and its performance compared to that of Time *** upon this platform we develop a family of three checkpointing algorithms, each of which occupies a different point in the spectrum of possible trade-offs between memory usage and execution time. The algorithms were implemented on several digital logic circuits and their speed, number of states saved and maximal memory consumption were compared to those of Time Warp. One of the algorithms saved between 35 and 50% of the maximal memory consumed by Time Warp (depending upon the number of processors used), while the other two decreased the maximal usage up to 30%. The latter two algorithms exhibited a speed comparable to Time Warp, while the first algorithm was 30-60% *** algorithms are also simpler to implement than optimal checkpointing algorithms.
暂无评论