Highly reliable and effective failure detection and isolation (FDI) software is crucial in modern avionics systems.that tolerate hardware failures in real time. The FDI function is an excellent opportunity for applyin...
详细信息
ISBN:
(纸本)0818608757
Highly reliable and effective failure detection and isolation (FDI) software is crucial in modern avionics systems.that tolerate hardware failures in real time. The FDI function is an excellent opportunity for applying the principal of software design diversity to the fullest, i.e., algorithm diversity, in order to provide gains in functional performance as well as potentially enhancing the reliability of the software. The authors examine algorithm diversity applied to the redundancy management software for a hardware fault-tolerant sensor array. Results of an experiment are presented that show the performance gains that can be provided by utilizing the consensus of three diverse algorithms for sensor FDI.
A recursive algorithm for computing a lower bound on the all-terminal reliability of an n-dimensional hypercube is presented. The recursive step decomposes an n-dimensional hypercube into lower dimension hypercubes th...
详细信息
A recursive algorithm for computing a lower bound on the all-terminal reliability of an n-dimensional hypercube is presented. The recursive step decomposes an n-dimensional hypercube into lower dimension hypercubes that are linked together. As an illustration of the effectiveness and power of this method, a lower bound is computed on the all-terminal reliability of the 16-dimensional hypercube (Connection Machine architecture) whose links number 219. The notation and assumptions are defined, and background information on bounding the reliability polynomial is provided. Methods for tightening these bounds for the analysis of the hypercube architecture are discussed.
Critical infrastructures provide services upon which society depends heavily;these applications are themselves dependent on distributed information systems.for all aspects of their operation and so survivability of th...
详细信息
Critical infrastructures provide services upon which society depends heavily;these applications are themselves dependent on distributed information systems.for all aspects of their operation and so survivability of the information systems.is an important issue. Fault tolerance is a mechanism by which survivability can be achieved in these information systems. We outline a specification-based approach to fault tolerance, called RAPTOR, that enables structuring of fault-tolerance specifications and an implementation partially synthesized front the formal specification. The RAPTOR approach uses three specifications describing the fault-tolerant system, the errors to be detected, and the actions to take to recover front those errors. System specification utilizes an object-oriented database to store the descriptions associated with these large, complex systems. The error detection and recovery specifications are defined using the formal specification notation Z. We also describe an implementation architecture and explore our solution with a case study.
The following topics are dealt with: distributed operating systems.local area networks;network fault tolerance;hypercubes;distributeddatabases;real-time systems.replicated programs;computer architectures;and voting. ...
详细信息
The following topics are dealt with: distributed operating systems.local area networks;network fault tolerance;hypercubes;distributeddatabases;real-time systems.replicated programs;computer architectures;and voting. Abstracts of individual papers can be found under the relevant classification codes in this or other issues.
The authors address the problem of state inconsistencies (i. e. , interacting processes having different and inconsistent views of one another) that arise at the kernel level of distributedsystems.based on local area...
详细信息
ISBN:
(纸本)0818606908
The authors address the problem of state inconsistencies (i. e. , interacting processes having different and inconsistent views of one another) that arise at the kernel level of distributedsystems.based on local area networks. Such systems.are particularly susceptible to state inconsistencies becaue entities are highly autonomous and thus may fail independently. The problem is compounded by the inherent delays and errors in communicating events between machines in the network. A description is given of three common classes of events that may cause state inconsistencies: (1) failures of processes, machines, and/or the network;(2) new machines joining or exiting from the system;and (3) processes or hosts migrating from one machine to another in the network. Systematic solutions to the problems, based mainly on the concept of kernel-supported process aliases, are presented. The solutions are structured and easy to understand.
Focus is on recovery of transactions in a distributed DB/DC system. The objective is to use transaction-level structural information to eliminate costly lower-level handshaking protocols, eliminate the need for any ce...
详细信息
ISBN:
(纸本)0818606908
Focus is on recovery of transactions in a distributed DB/DC system. The objective is to use transaction-level structural information to eliminate costly lower-level handshaking protocols, eliminate the need for any centralized recovery management mechanism by making recovery actions local to interacting components, and eliminate propagation of recovery actions to more than one antecedent or precedent component. Progressive recovery is a way of tracking the progress of a transaction to meet the above objective. Transaction processing involves different execution stages (DC, DB, followed by the DC), perhaps on different processors. Some stages make database changes and others are purely transformations of messages. The latter permit re-executions without side effects. The former must be well protected from re-executions. In contrast with optimistic recovery schemes, progressive recovery does not track communication and state dependencies.
Non-determinism in concurrent or distributedsoftwaresystems.(i.e., various possible execution orders among different distributed components) presents new challenges to the existing reliability analysis methods based...
详细信息
ISBN:
(纸本)9781450330565
Non-determinism in concurrent or distributedsoftwaresystems.(i.e., various possible execution orders among different distributed components) presents new challenges to the existing reliability analysis methods based on Markov chains. In this work, we present a toolkit RaPiD for the reliability analysis of non-deterministic systems. Taking Markov decision process as reliability model, RaPiD can help in the analysis of three fundamental and rewarding aspects regarding softwarereliability. First, to have reliability assurance on a system, RaPiD can synthesize the overall system reliability given the reliability values of system components. second, given a requirement on the overall system reliability, RaPiD can distribute the reliability requirement to each component. Lastly, RaPiD can identify the component that affects the system reliability most significantly. RaPiD has been applied to analyze several real-world systems.including a financial stock trading system, a proton therapy control system and an ambient assisted living room system. The is available at http://***/cfp/demos
A major obstacle in implementing a rollback recovery scheme for fault tolerance in a concurrent distributed system is the domino effect. A low overhead checkpointing scheme is proposed to prevent this effect. Each pro...
详细信息
A major obstacle in implementing a rollback recovery scheme for fault tolerance in a concurrent distributed system is the domino effect. A low overhead checkpointing scheme is proposed to prevent this effect. Each process saves its state periodically. The state-save synchronization among processes is implemented by bounding clock drifts. A communication protocol that assures that all saved states are consistent is developed.
Fault tolerant software utilizes redundancy and diversity in an attempt to tolerate software design faults. The two most widely studied approaches to software fault tolerance are called Recovery Blocks (RB) and N-vers...
详细信息
ISBN:
(纸本)078030943X
Fault tolerant software utilizes redundancy and diversity in an attempt to tolerate software design faults. The two most widely studied approaches to software fault tolerance are called Recovery Blocks (RB) and N-version Programming (NVP). Both RB and NVP have been the subject of numerous research efforts and publications. These research efforts primarily address design issues such as independence, implementation issues that arise in distributedsystems. and experimental performance analysis. Very few researchers have addressed the analysis of the reliability of fault tolerant software. In this paper we present fault tree models that can be used for qualitative and quantitative analysis of fault tolerant software. There are several advantages to a simple fault tree model of fault tolerant software, in addition to the intrinsic beauty of simplicity. First, the implications of the conclusions drawn from the model are easier for the reader to understand. second, the qualitative effects of the input parameters are easier to deduce. Third, a reader can develop a model of a similar system and be confident of the results. Finally, and most important, a simple fault tree model of fault tolerant software can more easily be combined with an analysis of the hardware structure on which it executes. This combination will facilitate the integrated analysis of fault tolerant hardware and softwaresystems.
softwaresystems.running continuously for a long time often confront software aging, which is the phenomenon of progressive degradation of execution environment caused by latent software faults. Removal of such faults...
详细信息
ISBN:
(纸本)9781479955848
softwaresystems.running continuously for a long time often confront software aging, which is the phenomenon of progressive degradation of execution environment caused by latent software faults. Removal of such faults in software development process is a crucial issue for system reliability. A known major obstacle is typically the large latency to discover the existence of software aging. We propose a systematic approach to detect software aging which has shorter test time and higher accuracy compared to traditional aging detection via stress testing and trend detection. The approach is based on a differential analysis where a software version under test is compared against a previous version in terms of behavioral changes of resource metrics. A key instrument adopted is a divergence chart, which expresses time-dependent differences between two signals. Our experimental study focuses on memory-leak detection and evaluates divergence charts computed using multiple statistical techniques paired with application-level memory related metrics (RSS and Heap Usage). The results show that the proposed method achieves good performance for memory-leak detection in comparison to techniques widely adopted in previous works (e.g., linear regression, moving average and median).
暂无评论