A fault-tolerant mutual exclusion algorithm for distributedsystems.is presented. The algorithm uses a distributed queue strategy and maintains alternative paths at each site to provide a high degree of fault toleranc...
详细信息
A fault-tolerant mutual exclusion algorithm for distributedsystems.is presented. The algorithm uses a distributed queue strategy and maintains alternative paths at each site to provide a high degree of fault tolerance. However, owing to these alternative paths, the algorithm must use reverse messages to avoid the occurrence of directed cycles, which may form when the direction of edges is reversed after the token passes through. If there is no alternative path, the total number of the messages exchanged is O(2 × log N) in light traffic and two messages in heavy traffic;however, in this case the system cannot tolerate even a single communication link or site failure. If there are alternative paths between sites, the system can achieve a higher degree of fault tolerance at the expense of increased message traffic (owing to reverse messages). Thus, there is a tradeoff between efficiency and reliability, and a system can be designed to balance these two criteria properly. A recovery procedure for restoring a recovering site consistently into the system is also presented.
ROSE, a modular distributed operating system that provides support for building reliable applications, is designed and implemented. Failure detection capabilities are provided by a failure detection server. Configurat...
详细信息
ROSE, a modular distributed operating system that provides support for building reliable applications, is designed and implemented. Failure detection capabilities are provided by a failure detection server. Configuration objects can be used to capture the relationship among multiple processes that cooperate to replicate certain resources. Replicated address space (RAS) objects, whose content is accessible with a high probability despite hardware failures, can be used to increase data availability. Finally, a resistant process (RP) abstraction allows user processes to survive hardware failures with minimal interruption. Two different implementations of RP are provided: one checkpoints the information about its state in an RAS object periodically;the other uses replicated execution by executing the same code in different nodes at the same time.
The authors present an enhancement to distributed file systems.that allows the users of the system to keep local copies of important files, decreasing the dependency over file servers. Using the notions of stashing an...
详细信息
The authors present an enhancement to distributed file systems.that allows the users of the system to keep local copies of important files, decreasing the dependency over file servers. Using the notions of stashing and quasi-copies, the system allows users to tune up the quality of the service they want to receive when the file server is not reachable. One of the key points of this work is the focus on the tradeoff between availability and degradation of service. The other main contribution is the design of a distributed file system which is ideally suited to very large distributedsystems. in that it provides users with greater tolerance of network partitions and server failures. It is emphasized that the use of stashing does not preclude the use of other performance-enhancing or fault-tolerant techniques. The file system architecture has been implemented and FACE, a prototype of a file system service based on Sun's NFS, is described. Performance figures are reported. These figures show that the overhead of providing the service is negligible. Current plans also call for porting the FACE design to a number of other processors.
Traditional approaches to reliability and performance analysis become intractable when dealing with complex parallel and distributed processing systems. computer networks, and software for such systems. New approaches...
详细信息
Traditional approaches to reliability and performance analysis become intractable when dealing with complex parallel and distributed processing systems. computer networks, and software for such systems. New approaches based on Petri nets, dataflow graphs, simulations and approximations are now used in such cases. In order to extend the utility of Petri nets and dataflow graphs, the authors present a decomposition technique that can be used to partition a large system into smaller subsystems. where performance indexes of the total system can be obtained (at least approximately) from the subsystem analyses. The decomposition reduces the computational complexity of analysis significantly. The approach (using marked graph components) is similar to the concept of 'near-completely decomposable' stochastic processes.< >
Adaptability is an essential tool for managing escalating software costs and to build high-reliability, high-performance systems. Algorithmic adaptability, which supports techniques for switching between classes of sc...
详细信息
Adaptability is an essential tool for managing escalating software costs and to build high-reliability, high-performance systems. Algorithmic adaptability, which supports techniques for switching between classes of schedulers in distributed transaction systems. is modeled. RAID, an experimental system implemented to support experimentation in adaptability, is discussed. Adaptability features in RAID, including algorithmic adaptability, fault tolerance, and implementation techniques for an adaptable server-based design, are modeled.
This conference proceedings contains 19 papers. The following topics are dealt with: recovery;replication;network architecture;reliable communication;performance analysis;evaluation and modeling;and simulation and tes...
详细信息
This conference proceedings contains 19 papers. The following topics are dealt with: recovery;replication;network architecture;reliable communication;performance analysis;evaluation and modeling;and simulation and testing.
A major obstacle in implementing a rollback recovery scheme for fault tolerance in a concurrent distributed system is the domino effect. A low overhead checkpointing scheme is proposed to prevent this effect. Each pro...
详细信息
A major obstacle in implementing a rollback recovery scheme for fault tolerance in a concurrent distributed system is the domino effect. A low overhead checkpointing scheme is proposed to prevent this effect. Each process saves its state periodically. The state-save synchronization among processes is implemented by bounding clock drifts. A communication protocol that assures that all saved states are consistent is developed.
HGPSS, a simulation language and environment aimed specifically at distributedsystems. is described. HGPSS is upwardly compatible with GPSS, adding a number of features for the modeling of distributeddatabase system...
详细信息
HGPSS, a simulation language and environment aimed specifically at distributedsystems. is described. HGPSS is upwardly compatible with GPSS, adding a number of features for the modeling of distributeddatabasesystems. The incorporation of these primitives reduces the complexity of the task of the simulation programmer. In addition, HGPSS is a portable system, thus permitting the use of more-powerful processors. HGPSS presents a novel approach to simulation, namely, that of incorporating application-specific functionality into the basic tools. By enriching the simulation language with constructs designed explicitly for an application environment, the task of the modeler can be simplified substantially. Furthermore, for situations in which general algorithmic facilities are necessary, a direct C interface is provided. A software modeling environment for determining the performance of various distributeddatabasesystems.is described, which provides the user with the tools needed to model and analyze such a system.
This paper presents a software modeling environment for estimating the performance of distributeddatabasesystems. This tool supports a simulation language, HGPSS, which comprises various simulation primitives, conta...
详细信息
ISBN:
(纸本)0818619465
This paper presents a software modeling environment for estimating the performance of distributeddatabasesystems. This tool supports a simulation language, HGPSS, which comprises various simulation primitives, contains a collection of network modules, and allows for the collection of statistics. This provides an overview of the HGPSS environment emphasizing its applicability to the modeling of distributeddatabases.
Multicast communication in a distributed system connected by a local area network can increase parallelism, and it can also provide a greater functionality than one-to-one communication. In the authors' multicast ...
详细信息
Multicast communication in a distributed system connected by a local area network can increase parallelism, and it can also provide a greater functionality than one-to-one communication. In the authors' multicast protocol, the sender directs a message to a named group of receivers, which can be specified by function without requiring the sender to know the specific members of the group. Each host's kernel in the network can respond to every group message sent, providing various levels of reliability. It was found that the overhead of providing dependable multicast over a single local area network was very small, mainly because the protocol operates at the kernel level rather than the user level. Several forms of this multicast communication, expressed as simple message-passing communication primitives, are described, and the effectiveness of the protocol is evaluated using an example of a distributed algorithm. Performance analyses and actual performance data for the protocol are presented.
暂无评论