A checkpoint algorithm is presented that benefits from the research in concurrency control, commit, and site recovery algorithms in transaction processing. In the authors' approach a number of checkpointing proces...
详细信息
ISBN:
(纸本)0818608757
A checkpoint algorithm is presented that benefits from the research in concurrency control, commit, and site recovery algorithms in transaction processing. In the authors' approach a number of checkpointing processes, a number of rollback processes, and computations on operational processes can proceed concurrently while tolerating the failure of an arbitrary number of processes. Each process takes checkpoints independently. During recovery after a failure, a process invokes a two-phase rollback algorithm. It collects information about relevant message exchanges in the system in the first phase and uses it in the second phase to determine both the set of processes that must roll back and the set of checkpoints up to which rollback must occur. Concurrent rollbacks are completed in the order of the priorities of the recovering processes. The proposed solution is optimistic in the sense that it does well if failures are infrequent by minimizing overhead during normal processing.
Replicated execution of distributed programs, which provides a means of masking hardware (processor) failures in a distributed system, is discussed. Application-level entities (processes, objects) are replicated to ex...
详细信息
Replicated execution of distributed programs, which provides a means of masking hardware (processor) failures in a distributed system, is discussed. Application-level entities (processes, objects) are replicated to execute on distinct processors. Such replica entities communicate by message passing. Nondeterminism within the replicas could cause messages to be processed in nonidentical order, producing a divergence of state. Possible sources of nondeterminism are identified, and a generic mechanism for ensuring that nonfaulty replicas process messages in identical order, thereby preventing state divergence among such replicate entities, is presented.
An approach to checkpointing and rollback recovery in a distributed computing system using a common time base is proposed. First, a common time base is established in the system using a hardware clock synchronization ...
详细信息
ISBN:
(纸本)0818608757
An approach to checkpointing and rollback recovery in a distributed computing system using a common time base is proposed. First, a common time base is established in the system using a hardware clock synchronization algorithm. This common time base is coupled with a pseudorecovery block approach to develop a checkpointing algorithm that has the following advantages: (i) maximum process autonomy, (ii) no wait for commitment for establishing recovery lines, (iii) fewer messages to be exchanged, and (iv) less memory requirement.
The various softwaresystems.developed for the DIII-D tokamak have played a highly visible and important role in tokamak operations and fusion research. Because of the heavy reliance on in-house developed software enc...
详细信息
The various softwaresystems.developed for the DIII-D tokamak have played a highly visible and important role in tokamak operations and fusion research. Because of the heavy reliance on in-house developed software encompassing all aspects of operating the tokamak, much attention has been given to the careful design, development and maintenance of these softwaresystems.softwaresystems.responsible for tokamak control and monitoring, neutral beam injection, and data acquisition demand the highest level of reliability during plasma operations. These systems.made up of hundreds of programs totaling thousands of lines of code have presented a wide variety of software design and development issues ranging from low level hardware communications, database management, and distributed process control, to man machine interfaces. The focus of this paper will be to describe how software is developed and managed for the DIII-D control and data acquisition computers. It will include an overview and status of softwaresystems.implemented for tokamak control, neutral beam control, and data acquisition. The issues and challenges faced developing and managing the large amounts of software in support of the dynamic and everchanging needs of the DIII-D experimental program will be addressed.
The problem of voting is studied for both the exact and inexact cases. Optimal solutions based on explicit computation of condition probabilities are given. The most commonly used strategies, i.e., majority, median, a...
详细信息
The problem of voting is studied for both the exact and inexact cases. Optimal solutions based on explicit computation of condition probabilities are given. The most commonly used strategies, i.e., majority, median, and plurality are compared quantitatively. The results show that plurality voting is the most powerful of these techniques and is, in fact, optimal for a certain class of probability distributions. An efficient method of implementing a generalized plurality voter when nonfaulty processes can produce differing answers is also given.
A knowledge-based approach for query processing during network partitioning is proposed. The approach uses available domain and summary knowledge to infer inaccessible data to answer a given query. A rule induction te...
详细信息
A knowledge-based approach for query processing during network partitioning is proposed. The approach uses available domain and summary knowledge to infer inaccessible data to answer a given query. A rule induction technique is used to extract correlated knowledge between attributes from the database contents. This knowledge is represented as rules for data inference. On the basis of a set of queries, simulation is used to evaluate the effectiveness of the proposed data inference technique for improving data availability under network partitioning. Object allocation has a significant impact on data availability. Allocating objects that increase remote redundancy and reduce local redundancy increases data availability during network partitioning. A prototype distributeddatabase system that uses the proposed inference technique with correlated knowledge from a ship database has been implemented. Experience indicates that the proposed inference technique can significantly improve the availability of a distributeddatabase during network partitioning.
Multicast communication in a distributed system connected by a local area network can increase parallelism, and it can also provide a greater functionality than one-to-one communication. In the authors' multicast ...
详细信息
Multicast communication in a distributed system connected by a local area network can increase parallelism, and it can also provide a greater functionality than one-to-one communication. In the authors' multicast protocol, the sender directs a message to a named group of receivers, which can be specified by function without requiring the sender to know the specific members of the group. Each host's kernel in the network can respond to every group message sent, providing various levels of reliability. It was found that the overhead of providing dependable multicast over a single local area network was very small, mainly because the protocol operates at the kernel level rather than the user level. Several forms of this multicast communication, expressed as simple message-passing communication primitives, are described, and the effectiveness of the protocol is evaluated using an example of a distributed algorithm. Performance analyses and actual performance data for the protocol are presented.
The symposium Materials contain 21 papers. The following topics are dealt with: checkpointing and logging algorithms;backward recovery schemes;replication and parallelism;dependability modeling and assessment;agreemen...
详细信息
ISBN:
(纸本)0818622601
The symposium Materials contain 21 papers. The following topics are dealt with: checkpointing and logging algorithms;backward recovery schemes;replication and parallelism;dependability modeling and assessment;agreement;and garbage collection.
The authors report a study of the dependability of the various communication topologies that can be used to construct a Delta-4 system. Single and dual bus and ring configurations are possible (based on 802.4, 802.5, ...
详细信息
ISBN:
(纸本)0818622601
The authors report a study of the dependability of the various communication topologies that can be used to construct a Delta-4 system. Single and dual bus and ring configurations are possible (based on 802.4, 802.5, and FDDI standards);the authors give closed-form expressions for the reliability and availability of each topology when repair is taken into account. It is shown that the dimensioning parameter in the dependability of the communication system is the coverage of the self-checking mechanisms built into the network attachment controllers.
The authors present an election protocol that does not assume an underlying ring structure and that tolerates failures, including lost messages and network partitioning, during the execution of the protocol itself. Th...
详细信息
ISBN:
(纸本)0818608757
The authors present an election protocol that does not assume an underlying ring structure and that tolerates failures, including lost messages and network partitioning, during the execution of the protocol itself. The major problem to be solved is that when nodes cannot communicate with one another or messages are lost, a conflict in resolving the election will often arise. In the authors' approach, the conflict is detected by the cohorts (noncandidate participants in the election). Related election protocols are discussed, and the system model is described together with assumptions about the communication subsystem. The protocol and the lost-message situations are then examined.
暂无评论