As softwaredistributed Shared Memory(DSM) systems.become attractive on larger clusters, the focus of attention moves toward improving the reliability of systems. In this paper, we propose a lightweight logging scheme...
详细信息
ISBN:
(纸本)0769520693
As softwaredistributed Shared Memory(DSM) systems.become attractive on larger clusters, the focus of attention moves toward improving the reliability of systems. In this paper, we propose a lightweight logging scheme, called remote logging, and a recovery protocol for home-based DSM. Remote logging stores coherence-related data to the volatile memory of a remote node. The logging overhead can be moderated with high-speed system area network and user-level DMA operations supported by modern communication protocols. Remote logging tolerates multiple failures if the backup nodes of failed nodes are alive. It makes the reliability of DSM grow much higher. Experimental results show that our fault-tolerant DSM has low overhead compared to conventional stable logging and it can be effectively recovered from some concurrent failures.
The following topics are dealt with: real-time distributed programming systems.architecture and interconnection schemes;fault tolerance;reliability estimation and performance modeling;performance analysis;intercommuni...
详细信息
ISBN:
(纸本)0818607491
The following topics are dealt with: real-time distributed programming systems.architecture and interconnection schemes;fault tolerance;reliability estimation and performance modeling;performance analysis;intercommunication protocols;operative systems.dynamic and distributed scheduling;task allocation and load balancing;real-time operating system for nuclear power plant computer;real-time juggling robot;and real-time direct kinematics on a VLSI chip. 30 papers were presented, all of which are published in full in the present proceedings. Abstracts of individual papers can be found under the classification codes in this or other issues.
The proceedings contains 27 papers from the 1996 IEEE Real-Time Technology and Applications symposium. Topics discussed include case studies and applications of real time systems.databasesystems.and concurrency cont...
详细信息
The proceedings contains 27 papers from the 1996 IEEE Real-Time Technology and Applications symposium. Topics discussed include case studies and applications of real time systems.databasesystems.and concurrency control, software engineering, data communication systems. real time system development and analysis tools, formal methods and processing scheduling, and operating systems.and distributedsystems.
Modern stream processing systems.need to process large volumes of data in real-time. Various stream processing frameworks have been developed and messaging systems.are widely applied to transfer streaming data among d...
详细信息
ISBN:
(纸本)9781728198705
Modern stream processing systems.need to process large volumes of data in real-time. Various stream processing frameworks have been developed and messaging systems.are widely applied to transfer streaming data among different applications. As a distributed messaging system with growing popularity, Apache Kafka processes streaming data in small batches for efficiency. However, the robustness of Kafka's batching method against variable operating conditions is not known. In this paper we study the impact of the batch size on the performance of Kafka. Both configuration parameters, the spatial and temporal batch size, are considered. We build a Kafka testbed using Docker containers to analyze the distribution of Kafka's end-to-end latency. The experimental results indicate that evaluating the mean latency only is unreliable in the context of real-time systems. In the experiments where network faults are injected, we find that the batch size affects the message loss rate in the presence of an unstable network connection. However, allocating resources for message processing and delivery that will violate the reliability requirements implemented as latency constraints of a real-time system is inefficient To address these challenges we propose a reactive batching strategy. We evaluate our batching strategy in both good and poor network conditions. The results show that the strategy is powerful enough to meet both latency and throughput constraints even when network conditions are variable.
A fault-tolerant mutual exclusion algorithm for distributedsystems.is presented. The algorithm uses a distributed queue strategy and maintains alternative paths at each site to provide a high degree of fault toleranc...
详细信息
A fault-tolerant mutual exclusion algorithm for distributedsystems.is presented. The algorithm uses a distributed queue strategy and maintains alternative paths at each site to provide a high degree of fault tolerance. However, owing to these alternative paths, the algorithm must use reverse messages to avoid the occurrence of directed cycles, which may form when the direction of edges is reversed after the token passes through. If there is no alternative path, the total number of the messages exchanged is O(2 × log N) in light traffic and two messages in heavy traffic;however, in this case the system cannot tolerate even a single communication link or site failure. If there are alternative paths between sites, the system can achieve a higher degree of fault tolerance at the expense of increased message traffic (owing to reverse messages). Thus, there is a tradeoff between efficiency and reliability, and a system can be designed to balance these two criteria properly. A recovery procedure for restoring a recovering site consistently into the system is also presented.
In cost conscious industries, such as automotive, it is imperative for designers to adhere to policies that reduce system resources to the extent feasible, even for safety-critical sub-systems. However, the overall re...
详细信息
ISBN:
(纸本)3540410554
In cost conscious industries, such as automotive, it is imperative for designers to adhere to policies that reduce system resources to the extent feasible, even for safety-critical sub-systems. However, the overall reliability requirement, typically in the order of 10(-9) faults/hour, must be both analysable and met. Faults can be hardware, software or timing faults. The latter being handled by hard-real time schedulability analysis, which is used to prove that no timing violations will occur. However, from a reliability and cost perspective there is a tradeoff between timing guarantees, the level of hardware and software faults, and the per-unit cost for meeting the overall reliability requirement. This paper outlines a reliability analysis method that considers the effect of faults on schedulability analysis and its impact on the reliability estimation of the system. The ideas have general applicability, but the method has been developed with modeling of external interferences of automotive CAN buses in mind. We illustrate the method using the example of a distributed braking system.
The proceedings contain 6 papers. The topics discussed include: sharing a connectivity via informed proxy selection;message-oriented middleware for edge computing applications;continuous experimentation for software d...
ISBN:
(纸本)9781450351997
The proceedings contain 6 papers. The topics discussed include: sharing a connectivity via informed proxy selection;message-oriented middleware for edge computing applications;continuous experimentation for software developers;end-to-end regression testing for distributedsystems.towards a framework for orchestrated distributeddatabase evaluation in the cloud;toward software updates in IoT environments: why existing P2P systems.are not enough;and towards accelerating synchrophasor based linear state estimation of power grid systems.
The goal of checkpointing in database management systems.is to save database states on a separate secure device so that the database can be recovered when errors and failures occur. Recently, the possibility of having...
详细信息
ISBN:
(纸本)0818607491
The goal of checkpointing in database management systems.is to save database states on a separate secure device so that the database can be recovered when errors and failures occur. Recently, the possibility of having a checkpointing mechanism which does not interfere with the transaction processing has been studied. The property of noninterference is highly desirable in real-time applications, where restricting transaction activity during the checkpointing operation is in many cases not feasible. The practicality of a noninterfering checkpointing algorithm is studied here by analyzing the extra workload of the algorithm. The noninterfering checkpointing results in some overhead on the one hand, and increases the system availability on the other hand. For the applications where the ability of continuous processing of transactions is so critical that the blocking of transaction processing for checkpointing is not feasible, it is believed that noninterfering checkpointing provides a practical solution to the problem of constructiing globally consistent states in distributeddatabasesystems.
The reliability issue in deduplication-based storage systems.has not received adequate attention. Existing approaches introduce data redundancy after files have been deduplicated, either by replication on critical dat...
详细信息
ISBN:
(纸本)9781538683019
The reliability issue in deduplication-based storage systems.has not received adequate attention. Existing approaches introduce data redundancy after files have been deduplicated, either by replication on critical data chunks, i.e., chunks with high reference count, or RAID schemes on unique data chunks, which means that these schemes are based on individual unique data chunks rather than individual files. This can leave individual files vulnerable to losses, particularly in the presence of transient and unrecoverable data chunk errors such as latent sector errors. To address this file reliability issue, this paper proposes a Per-File Parity (short for PFP) scheme to improve the reliability of deduplication-based storage systems. PFP computes the XOR parity within parity groups of data chunks of each file after the chunking process but before the data chunks are deduplicated. Therefore, PFP can provide parity redundancy protection for all files by intra-file recovery and a higher-level protection for data chunks with high reference counts by inter-file recovery. Our reliability analysis and extensive data-driven, failure-injection based experiments conducted on a prototype implementation of PFP show that PFP significantly outperforms the existing redundancy solutions, DTR and RCR, in system reliability, tolerating multiple data chunk failures and guaranteeing file availability upon multiple data chunk failures. Moreover, a performance evaluation shows that PFP only incurs an average of 5.7% performance degradation to the deduplication-based storage system.
We describe a reliability algorithm being considered for DDM, a distributeddatabase system under development at Computer Corporation of America. The algorithm is designed to tolerate clean site failures in which site...
详细信息
ISBN:
(纸本)0897910974
We describe a reliability algorithm being considered for DDM, a distributeddatabase system under development at Computer Corporation of America. The algorithm is designed to tolerate clean site failures in which sites simply stop running. The algorithm allows the system to reconfigure itself to run correctly as sites fail and recover. The algorithm solves the subproblems of atomic commit and replicated data handling in an integrated manner.
暂无评论