Building dependable distributed systems from commercial off-the-shelf components is of growing practical importance. For both cost and production reasons, there is interest in approaches and architectures that facilit...
详细信息
Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distri...
Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. The hierarchical protocol leverages the underlying network topology to achieve faster failure detection. In addition to studying the effectiveness and efficiency of these two agreement protocols, we propose a third protocol that extends the hierarchical approach by piggybacking gossip information on application-generated messages. The protocols are simulated and evaluated with a fault-injection model for scalable distributed systems comprised of clusters of workstations connected by high-performance networks, such as the CPlant system at Sandia National Laboratories. The model supports permanent and transient node and link failures, with rates specified at simulation time, for processors functioning in a fail-silent fashion. Through high-fidelity, CAD-based modeling and simulation, we demonstrate the strengths and weaknesses of each approach in terms of agreement time, number of gossips, and overall scalability.
There have been significant advances in methods for specifying and solving models that aim to predict the performance and dependability of computer systems and networks. At the same time, however, there have been dram...
详细信息
There have been significant advances in methods for specifying and solving models that aim to predict the performance and dependability of computer systems and networks. At the same time, however, there have been dramatic increases in the complexity of the systems whose performance and dependability must be evaluated, and considerable increases in the expectations of analysts that use performance/dependability evaluation tools. This paper briefly reviews the progress that has been made in the development of performance/dependability evaluation tools, and argues that the next important step is the creation of modeling frameworks and software environments that support multi-level, multi-formalism modeling and multiple solution methods within a single integrated framework. In addition, this paper presents an overview of the Mobius project, which aims to provide a modeling framework and software environment that support multiple modeling formalisms, methods for model composition and connection, and a way to integrate multiple analytical/numerical- and simulation-based model solution methods. Finally, it suggests research that must take place to make this aim a reality, and thus facilitate the performance and dependability evaluation of complex computer systems and networks.
This paper describes the basis for and preliminary implementation of a new fault injector, called Loki, developed specifically for distributed systems. Loki addresses issues related to injecting correlated faults in d...
详细信息
This paper describes the basis for and preliminary implementation of a new fault injector, called Loki, developed specifically for distributed systems. Loki addresses issues related to injecting correlated faults in distributed systems. In Loki, fault injection is performed based on a partial view of the global state of an application. In particular, facilities are provided to pass user-specified state information between nodes to provide a partial view of the global state in order to try to inject complex faults successfully. A post-runtime analysis, using an off-line clock synchronization and a bounding technique, is used to place events and injections on a single global time-line and determine whether the intended faults were properly injected. Finally, observations containing successful fault injections are used to estimate specified dependability measures. In addition to describing the details of our new approach, we present experimental results obtained from a preliminary implementation in order to illustrate Loki's ability to inject complex faults predictably.
An MPI library's implementation of broadcast communication can significantly affect the performance of applications built with that library. In order to choose between similar implementations or to evaluate availa...
详细信息
An MPI library's implementation of broadcast communication can significantly affect the performance of applications built with that library. In order to choose between similar implementations or to evaluate available libraries, accurate measurements of broadcast performance are required. As we demonstrate, existing methods for measuring broadcast performance are either inaccurate or inadequate. Fortunately, we have designed an accurate method for measuring broadcast performance, even in a challenging grid environment. Measuring broadcast performance is not easy. Simply sending one broadcast after another allows them to proceed through the network concurrently, thus resulting in inaccurate per broadcast timings. Existing methods either fail to eliminate this pipelining effect or eliminate it by introducing overheads that are as difficult to measure as the performance of the broadcast itself. This problem becomes even more challenging in grid environments. Latencies along different links can vary significantly. Thus, an algorithm's performance is difficult to predict from it's communication pattern. Even when accurate prediction is possible, the pattern is often unknown. Our method introduces a measurable overhead to eliminate the pipelining effect, regardless of variations in link latencies.
Building dependable distributed systems from commercial off-the-shelf components is of growing practical importance. For both cost and production reasons, there is interest in approaches and architectures that facilit...
详细信息
Building dependable distributed systems from commercial off-the-shelf components is of growing practical importance. For both cost and production reasons, there is interest in approaches and architectures that facilitate building such systems. The AQuA architecture is one such approach; its goal is to provide adaptive fault tolerance to CORBA applications by replicating objects, providing a high-level method for applications to specify their desired dependability, and providing a dependability manager that attempts to reconfigure a system at runtime so that dependability requests are satisfied. This paper describes how dependability is provided in AQuA. In particular it describes Proteus, the part of AQuA that dynamically manages replicated distributed objects to make them dependable. Given a dependability request, Proteus chooses a fault tolerance approach and reconfigures the system to try to meet the request. The infrastructure of Proteus is described in this paper, along with its use in implementing active replication and a simple dependability policy.
Building dependable distributed systems using ad hoc methods is a challenging task. Without proper support, an application programmer must face the daunting requirement of having to provide fault tolerance at the appl...
详细信息
Building dependable distributed systems using ad hoc methods is a challenging task. Without proper support, an application programmer must face the daunting requirement of having to provide fault tolerance at the application level, in addition to dealing with the complexities of the distributed application itself. This approach requires a deep knowledge of fault tolerance on the part of the application designer, and has a high implementation cost. What is needed is a systematic approach to providing dependability to distributed applications. Proteus, part of the AQuA architecture, fills this need and provides facilities to make a standard distributed CORBA application dependable, with minimal changes to an application. Furthermore, it permits applications to specify, either directly or via the Quality Objects (QuO) infrastructure, the level of dependability they expect of a remote object, and will attempt to configure the system to achieve the requested dependability level. Our previous papers have focused on the architecture and implementation of Proteus. This paper describes how to construct dependable applications using the AQuA architecture, by describing the interface that a programmer is presented with and the graphical monitoring facilities that it provides.
The authors present a metacomputing application of multivariate, nonhierarchical statistical clustering to geographic environmental data from the 48 conterminous United States in order to produce maps of regions of ec...
详细信息
The FALCON development environment was designed around three basic data representations: scalars, vectors, and dense matrices. Utilizing the FALCON interactive restructuring system, the environment has been enhanced t...
详细信息
The FALCON development environment was designed around three basic data representations: scalars, vectors, and dense matrices. Utilizing the FALCON interactive restructuring system, the environment has been enhanced to allow the identification of structures within sparse matrices, such as diagonal matrices or symmetric matrices, and the use of this information for improving performance of the generated code. In addition, the environment supports the modification of the representation of the data. Such modifications have been shown to provide significant performance improvements.
暂无评论