Fault tolerance is a survival attribute of complex computer systems.and software in their ability to deliver continuous service to their users in the presence of faults. Formulating an analytic model for dependability...
详细信息
ISBN:
(纸本)0818682132
Fault tolerance is a survival attribute of complex computer systems.and software in their ability to deliver continuous service to their users in the presence of faults. Formulating an analytic model for dependability and performance evaluation of hardware/software fault tolerant architectures can be quite cumbersome. Also, in practice, isolating the effect of various parameters on a system, while holding the others constant requires exploring a variety of scenarios. It is economically infeasible to build several such systems. Simulation offers an attractive mechanism for dependability evaluation and the study of the influence of various parameters on the failure behavior of the system. In this paper, we develop algorithms to simulate the failure behavior of three commonly used fault tolerant architectures, viz., distributed Recovery Block (DRB), N-Version Programming (NVP) and N-Self Checking Programming (NSCP). We demonstrate the ability of the approach to simulate complex failure scenarios with various dependencies using some illustrative numerical examples.
NTT software Labs. is producing a distributed, self-configuring information navigation infrastructure designed to scale to global proportions. For reasons of large scale, unreliability (of the Internet, its connected ...
详细信息
NTT software Labs. is producing a distributed, self-configuring information navigation infrastructure designed to scale to global proportions. For reasons of large scale, unreliability (of the Internet, its connected computers, and the implementations), and the complete autonomy of the participants, a number of difficult database and cache consistency problems arise that are not solved by techniques commonly used either for the Internet (i.e. DNS), or for existing distributeddatabasesystems. This paper describes a set of strategies designed to solve these problems. In particular, it focuses on the use of third-party detection and notification of database and cache inconsistency.
Presents a technique that uses coverage measures in reliability estimation for fault-tolerant programs, particularly N-version software. This technique exploits both coverage and time measures collected during testing...
详细信息
Presents a technique that uses coverage measures in reliability estimation for fault-tolerant programs, particularly N-version software. This technique exploits both coverage and time measures collected during testing phases for the individual program versions and the N-version software system for reliability prediction. The application of this technique to single-version software was presented in our previous research (IEEE 3rd Int. Symp. on software Metrics, Berlin, Germany, March 1996). In this paper, we extend this technique and apply it on the N-version programs. The results obtained from the experiment conducted on an industrial project demonstrate that our technique significantly reduces the hazard of reliability overestimation for both single-version and multi-version fault-tolerant softwaresystems.
In a large scale real time control system, functions of subsystems.differ slightly although they seem to be identical. Functions change when hardware is replaced or when the function itself is upgraded. If the standar...
详细信息
In a large scale real time control system, functions of subsystems.differ slightly although they seem to be identical. Functions change when hardware is replaced or when the function itself is upgraded. If the standard system is generated by merging the differences and is installed then to each subsystem, subsystems.become different with maintenance being carried out in each subsystem to fit its operational conditions. Furthermore, subsystems.are distributed in a wide area. software maintenance therefore becomes very difficult. The paper describes a distributed management for software maintenance in the Tokyo metropolitan area railway system. This management has two phases: generating programs for making maintenance easy and executing programs for increasing the reliability of the system. There it contributes to the effectiveness of maintenance and increases the reliability of the total system.
Three algorithms designed to enforce different quality of service criteria are presented, as well as empirical assessments of the algorithms for three large industrial telecommunications systems. These assessments are...
详细信息
Three algorithms designed to enforce different quality of service criteria are presented, as well as empirical assessments of the algorithms for three large industrial telecommunications systems. These assessments are made in terms of the simulated performance of each system on average loads selected from operational distributions collected during beta release and field use. In addition, synthetic heavy loads designed to cause the overall CPU utilization rates to exceed 90% of capacity were run. The algorithms build on previously defined load testing algorithms, and use parameters and operational distributions computed for that purpose. This makes the quality of service enforcement algorithms particularly efficient. The primary bases for the assessment of the algorithms were the overall deviation of the response time from the average, and the fraction of service requests that were throttled from clients under varying conditions.
The RPC (remote procedure call) protocol is one of the popular communication mechanisms. It is very simple and transparent to write distributed programs. The RPC protocol user need not to have information on distribut...
详细信息
The RPC (remote procedure call) protocol is one of the popular communication mechanisms. It is very simple and transparent to write distributed programs. The RPC protocol user need not to have information on distributed environments and can easily construct distributed application systems. The RPC protocol reduces the communications overload since it uses messages to communicate. This paper presents a group communications to cooperate with the RPC. After creating a specific group by a user request, they can communicate with each other. A group RPC protocol can improve the reliability, transparency and facility of the classic RPC protocol. The group RPC proposed can be used in various applications such as video conference, replicated distributeddatabase and distributed network management.
Hardware-software co-synthesis is the process of partitioning an embedded system specification into hardware and software modules to meet performance, cost and reliability goals. In this paper, we address the problem ...
详细信息
Hardware-software co-synthesis is the process of partitioning an embedded system specification into hardware and software modules to meet performance, cost and reliability goals. In this paper, we address the problem of hardware-software co-synthesis of fault-tolerant real-time heterogeneous distributed embedded systems. Fault detection capability is imparted to the embedded system by adding assertion and duplicate-and-compare tasks to the task graph specification prior to cosynthesis. The reliability and availability of the architecture are evaluated during co-synthesis. Our algorithm allows the user to specify multiple types of assertions for each task. It uses the assertion or combination of assertions which achieves the required fault coverage without incurring too much overhead. We propose new methods to: 1) perform fault tolerance based task clustering 2) derive the best error recovery topology using a small number of extra processing elements, 3) exploit multi-dimensional assertions, and 4) share assertions to reduce the fault tolerance overhead. Our algorithm can tackle multirate systems.commonly found in multimedia applications. Application of the proposed algorithm to several real-life telecom transport system examples shows its efficacy.
Fault tolerance is a survival attribute of complex computer systems.and software in their ability to deliver continuous service to their users in the presence of faults. Formulating an analytic model for dependability...
详细信息
Fault tolerance is a survival attribute of complex computer systems.and software in their ability to deliver continuous service to their users in the presence of faults. Formulating an analytic model for dependability and performance evaluation of hardware/software fault tolerant architectures can be quite cumbersome. Also, in practice, isolating the effect of various parameters on a system, while holding the others constant requires exploring a variety of scenarios. It is economically infeasible to build several such systems. Simulation offers an attractive mechanism for dependability evaluation and the study of the influence of various parameters on the failure behavior of the system. In this paper, we develop algorithms to simulate the failure behavior of three commonly used fault tolerant architectures, viz., distributed Recovery Block (DRB), N-Version Programming (NVP) and N-Self Checking Programming (NSCP). We demonstrate the ability of the approach to simulate complex failure scenarios with various dependencies using some illustrative numerical examples.
Energy management systems.supervisory control and data acquisition (EMS/SCADA) systems.are usually geographically distributed and have operational organizations. They are changing in accordance with the various and va...
详细信息
Energy management systems.supervisory control and data acquisition (EMS/SCADA) systems.are usually geographically distributed and have operational organizations. They are changing in accordance with the various and varying environments, and they should be flexible enough to adapt to those changes quickly. The paper proposes a new architecture called SCOPE (System Configuration of Power Control System) to realize flexible and reliable EMS/SCADA systems. SCOPE makes application programs independent of the operational organization and system configuration of the EMS/SCADA system, i.e., application programs are not influenced by changes in them. These properties make EMS/SCADA systems.flexible and reliable, and also the development of EMS/SCADA systems.becomes efficient and economical. Through developing and evaluating a SCOPE prototype system, it has been confirmed that the flexibility and maintainability of EMS/SCADA systems.based on the SCOPE architecture has been improved.
The various softwaresystems.developed for the DIII-D tokamak have played a highly visible and important role in tokamak operations and fusion research. Because of the heavy reliance on in-house developed software enc...
详细信息
The various softwaresystems.developed for the DIII-D tokamak have played a highly visible and important role in tokamak operations and fusion research. Because of the heavy reliance on in-house developed software encompassing all aspects of operating the tokamak, much attention has been given to the careful design, development and maintenance of these softwaresystems.softwaresystems.responsible for tokamak control and monitoring, neutral beam injection, and data acquisition demand the highest level of reliability during plasma operations. These systems.made up of hundreds of programs totaling thousands of lines of code have presented a wide variety of software design and development issues ranging from low level hardware communications, database management, and distributed process control, to man machine interfaces. The focus of this paper will be to describe how software is developed and managed for the DIII-D control and data acquisition computers. It will include an overview and status of softwaresystems.implemented for tokamak control, neutral beam control, and data acquisition. The issues and challenges faced developing and managing the large amounts of software in support of the dynamic and everchanging needs of the DIII-D experimental program will be addressed.
暂无评论