N-modular Redundancy (NMR) protects against arbitrary types of hardware or software failures in a minority of system components, thereby yielding the highest degree of reliability. A study is made of the application o...
详细信息
ISBN:
(纸本)0818606908
N-modular Redundancy (NMR) protects against arbitrary types of hardware or software failures in a minority of system components, thereby yielding the highest degree of reliability. A study is made of the application of NMR, specifically triple modular redundancy (TMR), to general-purpose database processing. The authors discuss the structure and implementation tradeoffs of a TMR system that is 'synchronized' at the transaction level, i. e. , in which complete transactions are distributed to all nodes, where they are processed independently, and only the majority output is accepted. The inherent cost of such a TMR database system is examined on the basis of preliminary performance results from a version implemented on three SUN-2/120 workstations.
Fault-tolerant distributed algorithms that are designed to reach agreement have been the subject of a great deal of recent study, primarily focussed on the Byzantine agreement paradigm. The author explores new paradig...
详细信息
ISBN:
(纸本)0818606908
Fault-tolerant distributed algorithms that are designed to reach agreement have been the subject of a great deal of recent study, primarily focussed on the Byzantine agreement paradigm. The author explores new paradigms and problems that arise in the context of maintaining agreement, rather than reaching agreement in an isolated instance. The emphasis is on open problem areas rather than on specific solutions.
An approach is presented that will allow database applications to increase availability in the face of network partitions and other communications failures, by permitting a controlled amount of nonserializable databas...
详细信息
ISBN:
(纸本)0818606908
An approach is presented that will allow database applications to increase availability in the face of network partitions and other communications failures, by permitting a controlled amount of nonserializable database activity. The underlying replicated database substrate ensures mutual consistency, without serializability, by timestamping all updates issued by database interactions. Compensating actions, triggered by exception conditions in the database, attempt to correct problems arising from nonserializable execution or notify human agents to investigate and correct the problem. Probabilistic concurrency control uses a controlled amount of inter-site synchronization to reduce the likelihood of nonserializable execution and the burden of compensation, at the cost of slightly reduced availability. This approach, illustrated by means of examples, allows application designers to tailor the system to achieve any desired balance between availability and consistency.
One property that makes failures difficult to handle in programs is that the actions of a failed component may occur asynchronously with respect to execution of the program. In this study, an approach to dealing with ...
详细信息
ISBN:
(纸本)0818606908
One property that makes failures difficult to handle in programs is that the actions of a failed component may occur asynchronously with respect to execution of the program. In this study, an approach to dealing with this asynchrony is presented. It is based on treating a failure as an event in a concurrent system of processes, and then integrating failure handling mechanisms into distributed programming languages. The technique is illustrated by considering the class of failures suffered by fail-stop processors, and proposing extensions of the Synchronizing Resources (SR) distributed programming language to handle such failures. Two SR programs using these mechanisms are presented.
The authors address the problem of state inconsistencies (i. e. , interacting processes having different and inconsistent views of one another) that arise at the kernel level of distributedsystems.based on local area...
详细信息
ISBN:
(纸本)0818606908
The authors address the problem of state inconsistencies (i. e. , interacting processes having different and inconsistent views of one another) that arise at the kernel level of distributedsystems.based on local area networks. Such systems.are particularly susceptible to state inconsistencies becaue entities are highly autonomous and thus may fail independently. The problem is compounded by the inherent delays and errors in communicating events between machines in the network. A description is given of three common classes of events that may cause state inconsistencies: (1) failures of processes, machines, and/or the network;(2) new machines joining or exiting from the system;and (3) processes or hosts migrating from one machine to another in the network. Systematic solutions to the problems, based mainly on the concept of kernel-supported process aliases, are presented. The solutions are structured and easy to understand.
Focus is on recovery of transactions in a distributed DB/DC system. The objective is to use transaction-level structural information to eliminate costly lower-level handshaking protocols, eliminate the need for any ce...
详细信息
ISBN:
(纸本)0818606908
Focus is on recovery of transactions in a distributed DB/DC system. The objective is to use transaction-level structural information to eliminate costly lower-level handshaking protocols, eliminate the need for any centralized recovery management mechanism by making recovery actions local to interacting components, and eliminate propagation of recovery actions to more than one antecedent or precedent component. Progressive recovery is a way of tracking the progress of a transaction to meet the above objective. Transaction processing involves different execution stages (DC, DB, followed by the DC), perhaps on different processors. Some stages make database changes and others are purely transformations of messages. The latter permit re-executions without side effects. The former must be well protected from re-executions. In contrast with optimistic recovery schemes, progressive recovery does not track communication and state dependencies.
The following topics are dealt with: real-time distributed programming systems.architecture and interconnection schemes;fault tolerance;reliability estimation and performance modeling;performance analysis;intercommuni...
详细信息
ISBN:
(纸本)0818607491
The following topics are dealt with: real-time distributed programming systems.architecture and interconnection schemes;fault tolerance;reliability estimation and performance modeling;performance analysis;intercommunication protocols;operative systems.dynamic and distributed scheduling;task allocation and load balancing;real-time operating system for nuclear power plant computer;real-time juggling robot;and real-time direct kinematics on a VLSI chip. 30 papers were presented, all of which are published in full in the present proceedings. Abstracts of individual papers can be found under the classification codes in this or other issues.
One-to-many (group) interprocess communication is useful in many real-time distributed applications. It may be conveniently and efficiently realized using the multicast feature available in contemporary local area net...
详细信息
ISBN:
(纸本)0818607491
One-to-many (group) interprocess communication is useful in many real-time distributed applications. It may be conveniently and efficiently realized using the multicast feature available in contemporary local area networks. A kernel model which supports reliable group communication in a distributed computing environment is presented. New semantic tools which capture the nondeterminism of the underlying low-level events concisely are introduced and a process alias-based structuring technique for the kernel to handle the reliability problems that may arise during group communication is described. The scheme works by maintaining a close association between group messages and their corresponding reply messages. Sample programs illustrate how the semantic tools may be used.
For systems.containing a large number of processing elements (PEs), the capability to recover from a PE fault is important. The dynamic redundancy (DR) network can tolerate faults in the network and support a system t...
详细信息
ISBN:
(纸本)0818607491
For systems.containing a large number of processing elements (PEs), the capability to recover from a PE fault is important. The dynamic redundancy (DR) network can tolerate faults in the network and support a system that tolerates PE faults without degradation in performance by adding spare PEs, while retaining the full capability of a multistage cube. A variation of the DR network, the reduced DR (RDR) network, is presented which can be implemented more cost effectively while retaining most of the advantages of the DR. The reliability of systems.containing the DR or RDR networks and spare PEs and the reliability of systems.with no spare PEs are also estimated and compared. It is shown that using the DR or RDR network and spare PEs in a system can achieve better system reliability over a wide range of N, where N is the number of functioning PEs, than can using any kind of N multiplied by N fault-tolerant network and no spares.
The following topics are dealt with: reliable communication;network partition-handling;fault-tolerant systems.object management;concurrency control;reliable transaction management;and design of reliable software. 24 p...
详细信息
ISBN:
(纸本)0818605642
The following topics are dealt with: reliable communication;network partition-handling;fault-tolerant systems.object management;concurrency control;reliable transaction management;and design of reliable software. 24 papers were presented, all of which are published in full in the present proceedings.
暂无评论