The strategy of using multiple versions of independently developed software as a means to tolerate residual software design faults is suggested by the success of hardware redundancy for tolerating hardware failures. A...
详细信息
The strategy of using multiple versions of independently developed software as a means to tolerate residual software design faults is suggested by the success of hardware redundancy for tolerating hardware failures. Although, as generally accepted, the independence of hardware failures resulting from physical wearout can lead to substantial increases in reliability for redundant hardware structures, a similar conclusion is not immediate for software. The degree to which design faults are manifested as independent failures determines the effectiveness of redundancy as a method for improving software reliability. Interest in multiversion software centers on whether it provides an adequate measure of increased reliability to warrant its use in critical applications. The effectiveness of multiversion software is studied by comparing estimates of the failure probabilities of these systems with the failure probabilities of single versions. The estimates are obtained under a model of dependent failures and compared with estimates obtained when failures are assumed to be independent. The experimental results are based on 20 versions of an aero-space application developed and independently validated by 60 programmers from 4 universities. Descriptions of the application and development process are given, together with an analysis of the 20 versions.
Two different techniques have evolved for software fault tolerance-n-version programming and recovery blocks. Both of these techniques are based on design diversity and employ multiple software versions for the same p...
详细信息
Two different techniques have evolved for software fault tolerance-n-version programming and recovery blocks. Both of these techniques are based on design diversity and employ multiple software versions for the same problem. Due to the different methods employed for error detection by the two techniques, one may be better suited than the other for certain applications. We describe an environment that supports execution of programs using both n-version programming and recovery blocks in a uniform manner. The basic unit of fault tolerance supported by this system is at the procedure or function level. Each such program unit can be packaged as its own task, and different fault tolerance techniques can subsequently be employed, even within the same application. The environment also allows versions to be written in different programming languages, and executed on different machines. This enhances the independence between the different versions, making the fault tolerance techniques more effective. This environment has been developed for use on Unix-based hosts, and currently runs on a network of Sun and DEC workstations.
For more than ten years, design diversity experiments have been conducted to study fault-tolerant multiple-version software systems. Design diversity is the approach by which multiple versions of a software system are...
详细信息
For more than ten years, design diversity experiments have been conducted to study fault-tolerant multiple-version software systems. Design diversity is the approach by which multiple versions of a software system are independently developed. Our current focus is on distributed software engineering techniques and methods for improving the specification and testing phases. With multiversion development, multiple implementations allow the use of an automated approach to testing called Back-to-Back (B/B) Testing in which the outputs are compared to detect any discrepancies. This obviates the need to determine the correct response a priori, allowing automated execution of a large number of test cases. However, a specification defect may lead to similar errors in the multiple versions and the underlying fault may not be detected with a BIB testing approach. The use of diverse formal specifications is a proposed solution to this problem since defects in independently-written specifications are likely to be different. To examine these issues, an experiment was performed using the design diversity approach in the specification, design, implementation, and testing of distributed software. In the experiment, three diverse formal specifications were used to produce multiple independent implementations of a distributed communication protocol in Ada. Another important aspect of this study was the investigation of problems encountered in building complex concurrent processing systems in Ada. Many pitfalls were discovered in mapping the formal specifications into Ada implementations. In the experiment, the process of controlling human factors, collecting accurate and appropriate data, and drawing valid conclusions was a continuing challenge. [ABSTRACT FROM AUTHOR]
This paper presents the results of an empirical study of software error detection using self checks and n-version voting. A total of 24 graduate students in computer science at the University of Vir- ginia and the Uni...
详细信息
This paper presents the results of an empirical study of software error detection using self checks and n-version voting. A total of 24 graduate students in computer science at the University of Vir- ginia and the University of California, Irvine, were hired as program- mers. Working independently, each first prepared a set of self checks using just the requirements specification of an aerospace application, and then each added self checks to an existing implementation of that specification. The modified programs were executed to measure the er- ror-detection performance of the checks and to compare this with er- ror detection using simple voting among multiple versions. The goal of this study w"as to learn more about the effectiveness of such checks. The analysis of the checks revealed that there are great differences in the ability of individual programmers to design effective checks. We found that some checks that might have been effective failed to detect an error because they were badly placed, and there were nu- merous instances of checks signaling nonexistent errors. In general, specification-based checks alone were not as effective as combining them with code-based checks. Using self checks, faults were identified that had not been detected previously by voting 28 versions of the program over a million randomly-generated inputs. This appeared to result from the fact that the self checks could examine the internal state of the executing program whereas voting examines only final results of com- putations. If internal states had to be identical inn-version voting sys- tems, then there would be no reason to write multiple versions. The programs were executed on 100 000 new randomly-generated input cases in order to compare error detection by self checks and by 2-version and 3-version voting. Both self checks and voting techniques led to the identification of the same number of faults for this input, although the identified faults were not the same. Furthermore
Under a voting strategy in a fault-tolerant software system there is a difference between correctness and agreement. An independent n-version programming reliability model which distinguishes between correctness and a...
详细信息
Under a voting strategy in a fault-tolerant software system there is a difference between correctness and agreement. An independent n-version programming reliability model which distinguishes between correctness and agreement is proposed for treating small output spaces. We use an alternative voting strategy, viz, consensus voting, to treat cases when there can be agreement among incorrect outputs, a case which can occur with small output spaces. The consensus voting strategy automatically adapts the voting to various version reliability and output-space cardinality characteristics. The majority-voting strategy provides reliability which is a lower bound, and the 2-out-of-n voting strategy provides reliability which is an upper bound, on the reliability by consensus voting. The reciprocal of the cardinality-of-output-space is a lower bound on the average reliability of fault-tolerant system versions below which the system reliability begins to deteriorate as more versions are added.
We have conducted a large-scale experiment inn-version programming. A total of 27 versions of a program were prepared independently from the same specification at two universities. The results of executing the versio...
详细信息
We have conducted a large-scale experiment inn-version programming. A total of 27 versions of a program were prepared independently from the same specification at two universities. The results of executing the versions revealed that the versions were individually extremely reliable but that the number of input cases in which more than one failed was substantially more than would be expected if they were statistically independent. After the versions had been executed, the failures of each version were examined and the associated faults located. In this paper we pre- sent an analysis of these faults. Our goal in undertaking this analysis was to understand better the nature of the faults. We found that in some cases the programmers made equivalent logical errors, indicating that some parts of the problem were simply more difficult than others. We also found cases in which apparently different logical errors yielded faults that caused statistically correlated failures, indicating that there are special cases in the input space that present difficulty in various parts of the solution. A formal model is presented to explain this phenomenon. It appears that minor differences in the software development environment, such as the use of different programming languages for the different versions, would not have a major impact in reducing the incidence of faults that cause correlated failures. [ABSTRACT FROM AUTHOR]
Above all, it is vital to recognize that completely guranteed behavior is impossible and that there are inherent risks in relying on computer systems in critical environments. The unforeseen consequences are often the...
详细信息
Above all, it is vital to recognize that completely guranteed behavior is impossible and that there are inherent risks in relying on computer systems in critical environments. The unforeseen consequences are often the most disastrous [neumann 1986]. Section 1 of this survey reviews the current state of the art of system reliability, safety, and fault tolerance. The emphasis is on the contribution of software to these areas. Section 2 reviews current approaches to software fault tolerance. It discusses why some of the assumptions underlying hardware fault tolerance do not hold for software. It argues that the current software fault tolerance techniques are more accurately thought of as delayed debugging than as fault tolerance. It goes on to show that in providing both backtracking and executable specifications, logic programming offers most of the tools currently used in software fault tolerance. Section 3 presents a generalization of the recovery block approach to software fault tolerance, called resourceful systems. Systems are resourceful if they are able to determine whether they have achieved their goals or, if not, to develop and carry out alternate plans. Section 3 develops an approach to designing resourceful systems based upon a functionally rich architecture and an explicit goal orientation.
An experiment is described for the determination of the overhead associated with n-version programming, a technique for achieving software reliability. Results are presented for the performance as a function of the nu...
详细信息
An experiment is described for the determination of the overhead associated with n-version programming, a technique for achieving software reliability. Results are presented for the performance as a function of the number of versions in the voting process. Overhead is defined as additional algorithm execution time as required for the management of the voting process at each algorithm checkpoint. The experiment is conducted in a Unix C environment.
We have identified a difficulty in the implementation of n-version programming. The problem, which we call the Consistent Comparison Problem, arises for applications in which decisions are based on the results of comp...
详细信息
We have identified a difficulty in the implementation of n-version programming. The problem, which we call the Consistent Comparison Problem, arises for applications in which decisions are based on the results of comparisons of finite-precisionnumbers. We show that whenversions make comparisons involving the results of finite-precision calculations, it is impossible to guarantee the consistency of their results. It is therefore possible that correct versions may arrive at completely different outputs for an application that does not apparently have multiple correct solutions. If this problem is not dealt with explicitly, ann-version system may be unable to reach a consensus even whennone of its component versions fails. [ABSTRACT FROM AUTHOR]
Multi-version software systems achieve fault tolerance through software redundancy. Diverse software versions are executed concurrently by a supervisory system that reports consensus results, allowing the results from...
详细信息
Multi-version software systems achieve fault tolerance through software redundancy. Diverse software versions are executed concurrently by a supervisory system that reports consensus results, allowing the results from erroneous versions to be masked by the majority. The Second Generation Experiment is a large scale empirical study of multi-version software systems engaging researchers at six sites. This paper presents UCLA's perspective of this experiment, its role in the preliminary analysis, and related research at the Dependable Computing and Fault Tolerant Systems Laboratory.
暂无评论