Existing IEEE softwarereliability standards do not address the characteristics of distributedsystems. including client-server systems. Furthermore, these standards were issued before the widespread application of CO...
详细信息
Existing IEEE softwarereliability standards do not address the characteristics of distributedsystems. including client-server systems. Furthermore, these standards were issued before the widespread application of COTS and safety-critical systems. In addition, these standards do not take into account the influence on reliability of such process improvement measures as inspections, reuse, and object-oriented design paradigms. Lastly, these standards do not consider both hardware and softwarereliability nor do they include availability and maintainability. To be of value, the next generation of dependability standards must address these deficiencies. With the active participation of the audience, the panel will identify and debate the future direction of dependability standards.
The new challenging era of scientific data management in the coming decade named "Big Data" requires giant complexes for distributed computing and corresponding grid-cloud internet services. Known common app...
详细信息
ISBN:
(纸本)9781467399418
The new challenging era of scientific data management in the coming decade named "Big Data" requires giant complexes for distributed computing and corresponding grid-cloud internet services. Known common approaches to softwarereliability based on the probability theory or on considering software as an open non-equilibrium dynamic system cannot conform to advanced grid-cloud software management systems. Therefore to provide the optimality and reliability of such sophisticated systems.we choose the imitative simulation method oriented on a knowledge of dynamics of the system functioning. A new grid and cloud service simulation system was developed in the JINR Dubna laboratory of information technologies which focused on improving the efficiency and reliability of the grid-cloud systems.development by using work quality indicators of some real system to design and predict its evolution. For these purposes the simulation program is combined with real monitoring system of the grid cloud service through a special database. Some examples of the program applications to simulate a sufficiently general cloud structure, which can be used for more common purposes, are given.
The design and implementation of an experimental fault-tolerant distributeddatabase management system is described. The system provides a logically integrated view of data with distribution transparency and a control...
详细信息
ISBN:
(纸本)0818608757
The design and implementation of an experimental fault-tolerant distributeddatabase management system is described. The system provides a logically integrated view of data with distribution transparency and a controlled data replication. A commitment protocol used to guarantee atomicity of update operations is discussed. Efficient algorithms used to recover a site from a failure and restore data consistency are described. Recovery can be interleaved with the processing of regular database transactions and does not seriously limit the availability of data. The proposed solutions to the problems of fault recovery are designed to take advantage of the properties of a high-bandwidth local area network.
This paper presents an instance based approach to diagnosing failures in computing systems. Owing to the fact that a large portion of occurred failures are repeated ones, our method takes advantage of past experiences...
详细信息
ISBN:
(纸本)9780769542508
This paper presents an instance based approach to diagnosing failures in computing systems. Owing to the fact that a large portion of occurred failures are repeated ones, our method takes advantage of past experiences by storing historical failures in a database and retrieving similar instances in the occurrence of failure. We extract the system `invariants' by modeling consistent dependencies between system attributes during the operation, and construct a network graph based on the learned invariants. When a failure happens, the status of invariants network, i.e., whether each invariant link is broken or not, provides a view of failure characteristics. We use a high dimensional binary vector to store those failure evidences, and develop a novel algorithm to efficiently retrieve failure signatures from the database. Experimental results in a web based system have demonstrated the effectiveness of our method in diagnosing the injected failures.
The proposed coordinator log transaction execution protocol centralizes logging on a per-transaction basis and exploits piggybacking to provide the semantics of a distributed atomic commit without the associated costs...
详细信息
The proposed coordinator log transaction execution protocol centralizes logging on a per-transaction basis and exploits piggybacking to provide the semantics of a distributed atomic commit without the associated costs. This protocol eliminates two rounds of messages (one phase) from the presumed commit protocol and dramatically reduces the number of log forces needed for distributed atomic commit. The authors compare the coordinator log transaction execution protocol with existing protocols, describe when it is desirable, and discuss how it affects the write-ahead log protocol and the database crash recovery algorithm.
distributed deep reinforcement learning(DDRL) has been used in distributedsystems.to better improve the adaptability. However, DDRL-based systems.are also inevitably under the threat of Byzantine workers. There is an...
详细信息
ISBN:
(纸本)9781665451321
distributed deep reinforcement learning(DDRL) has been used in distributedsystems.to better improve the adaptability. However, DDRL-based systems.are also inevitably under the threat of Byzantine workers. There is an urgent need to enhance the resilience of the DDRL-based system against Byzantine failures. This paper proposes a resilient mechanism for mitigating the influence of Byzantine workers on DDRL-based systems. First, we formalize the DDRL-based system as a multi-armed bandit model for well capturing the collective effect of workers on the whole learning process, and then transforming the resilient mechanism design problem into the sampling policy optimization problem. Second, we propose a self-adaptation process for filtering out the harmful data generated by Byzantine workers and theoretically give a mathematical analysis of the understanding, demonstrating its effectiveness under ideal conditions. Third, based on a typical DDRL-based system (i.e., Asynchronous Advantage Actor-Critic, A3C), we implement a resilient distributed A3C (ReD-A3C). With extensive experiments on the DDRL benchmark tasks, we show that ReD-A3C outperforms available Byzantine tolerant approaches.
Redundant array of independent disks (RAID) has been widely used to address the reliability and performance issues of storage systems. As the scale of modern storage systems.continue growing, disk failure becomes the ...
详细信息
ISBN:
(纸本)9781728142227
Redundant array of independent disks (RAID) has been widely used to address the reliability and performance issues of storage systems. As the scale of modern storage systems.continue growing, disk failure becomes the norm. With the ever-increasing disk capacity, RAID recovery based on disk rebuild becomes more and more costly, which causes significant performance degradation and even unavailability of storage systems. Declustered data layout enables parallel RAID reconstruction by shuffling data and parity blocks among all drives (including spares) in a RAID group. However, the reliability and performance of declustered RAID in real-world storage environments have not been thoroughly studied. With the popularity of ZFS file system and software RAID used in production data centers, in this paper, we extensively evaluate declustered RAID with regard to the RAID recovery time and I/O performance on an high performance storage platform at Los Alamos National Laboratory. Our empirical study reveals advantages and disadvantages of the declustered RAID technology. We qualitatively characterize the recovery performance of declustered RAID and compare with that of ZFS RAIDZ under various I/O workloads and access patterns. The experimental results show that the speedup of declustered RAID over traditional RAID is sublinear to the parallelism of recovery I/O. Furthermore, we formally model and analyze the reliability of declustered RAID in terms of the mean-time-to-data-loss (MTTDL) and discover that the improved recovery performance leads to a higher storage reliability compared with the traditional RAID.
distributed synchronization for data sharing is discussed, and the design of a distributed lock manager for the Camelot transaction facility is presented. The lock manager is a component of a proposed implementation o...
详细信息
distributed synchronization for data sharing is discussed, and the design of a distributed lock manager for the Camelot transaction facility is presented. The lock manager is a component of a proposed implementation of data sharing in the Camelot environment. A number of experiments that demonstrate the correct operation of the lock manager are reported and its performance is described. The performance metrics indicate that distributed lock management should not reduce the feasibility of data sharing in this environment. The similarity between the caching and synchronization strategies appropriate for locks and data suggests that protocols developed for distributed locks will be applicable to data sharing.
The authors propose two protocols for transaction processing in quasi-partitioned databases. The protocols are pessimistic in that they permit the execution of update transactions in exactly one partition. The first p...
详细信息
ISBN:
(纸本)0818608757
The authors propose two protocols for transaction processing in quasi-partitioned databases. The protocols are pessimistic in that they permit the execution of update transactions in exactly one partition. The first protocol is defined for a fully partition-replicated database in which every partition contains a copy of every data object. The second protocol is defined for a partially partition-replicated database in which some objects have no copies in some partitions. Both protocols improve their major performance measures linearly with the backup link speed but are not visibly affected by either duration of the partitioning or the database size. This is a desirable property, since the backup link speed is the only controllable parameter.
One property that makes failures difficult to handle in programs is that the actions of a failed component may occur asynchronously with respect to execution of the program. In this study, an approach to dealing with ...
详细信息
ISBN:
(纸本)0818606908
One property that makes failures difficult to handle in programs is that the actions of a failed component may occur asynchronously with respect to execution of the program. In this study, an approach to dealing with this asynchrony is presented. It is based on treating a failure as an event in a concurrent system of processes, and then integrating failure handling mechanisms into distributed programming languages. The technique is illustrated by considering the class of failures suffered by fail-stop processors, and proposing extensions of the Synchronizing Resources (SR) distributed programming language to handle such failures. Two SR programs using these mechanisms are presented.
暂无评论