Grid computing relies on fragile partnerships. Clients with hundreds or even thousands of pending service requests must seek out and form temporary alliances with remote servers eager to satisfy them. Yet, despite the...
详细信息
ISBN:
(纸本)0769518524
Grid computing relies on fragile partnerships. Clients with hundreds or even thousands of pending service requests must seek out and form temporary alliances with remote servers eager to satisfy them. Yet, despite the high quality and reliability of these servers and their software, unexpected events and behavior are common. Communication networks, power systems. operating systems. middleware and operator intervention all conspire to attack even the most carefully arranged client-server interaction. To survive in such an imperfect world, customers of grid resources must be equipped with resilient client software that tolerates failures while aggressively representing their interests. Following our tradition of developing technology that harnesses the power of opportunistic resources, the Condor Project is actively engaged in developing the basic mechanisms for building dependable and effective grid computing clients. Guided by our experience and the practical needs of production users in disciplines as diverse as astronomy and sociology, the Project aims to equip users with powerful software that complements the reliability of the servers that they exploit. Our most visible product is the Condor-G job manager. Other research ventures, including the full Condor distributed system, offer valuable lessons in dependable client-side management. Dependability has been explored in a number of branches of computing, ranging from databasesystems.to programming languages. The hard-earned lessons from these fields are also essential to grid computing. Fundamental concepts such as timeouts, logging, checkpoints, transactions, leases, and atomic operations must be employed and expressed in basic protocols and interfaces for CPU and I/O access. Without these techniques, clients and servers lose track of the other's state, leading to missed opportunities, wasted resources, incorrect results, and unnecessary failures. This principle is espoused in systems.such as Condor-G and prot
The dQUOB system conceptualization of datastreams as database and its SQL interface to data streams is an intuitive way for users to think about their data needs in a large scale application containing hundreds if not...
详细信息
ISBN:
(纸本)0769516866
The dQUOB system conceptualization of datastreams as database and its SQL interface to data streams is an intuitive way for users to think about their data needs in a large scale application containing hundreds if not thousands of data streams. Experience with dQUOB has shown the need for more aggressive memory management to achieve the scalability we desire. This paper addresses the problem with a two-fold solution. The first is replacement of the existing First Come First Served (FCFS) scheduling algorithm with an Earliest Job First (EJF) algorithm which we demonstrate to yield better average service time. The second is an introspection algorithm that sets and adapts the sizes of join windows in response to knowledge acquired at runtime about event rates. In addition to the potential for significant improvements in memory utilization, the algorithm presented here also provides a means by which the user can reason about join window sizes. Wide area measurements demonstrate the adaptive capability required by the introspection technique.
The dQUOB system conceptualization of data streams as database and its SQL interface to data streams is an intuitive way for users to think about their data needs in a large scale application containing hundreds if no...
详细信息
The dQUOB system conceptualization of data streams as database and its SQL interface to data streams is an intuitive way for users to think about their data needs in a large scale application containing hundreds if not thousands of data streams. Experience with dQUOB has shown the need for more aggressive memory management to achieve the scalability we desire. This paper addresses the problem with a two-fold solution. The first one is replacement of the existing first-come first-served scheduling algorithm with an earliest job first algorithm which we demonstrate to yield better average service time. The second one is an introspection algorithm that sets and adapts the sizes of join windows in response to the knowledge acquired at runtime about event rates. In addition to the potential for significant improvements in memory utilization, the algorithm presented here also provides a means by which the user can reason about join window sizes. Wide area measurements demonstrate the adaptive capability required by the introspection technique.
In this paper we describe an infrastructure that provides increased reliability for three-tier applications, transparently, using commercial off-the-shelf application servers and databasesystems. In this infrastructu...
详细信息
In this paper we describe an infrastructure that provides increased reliability for three-tier applications, transparently, using commercial off-the-shelf application servers and databasesystems. In this infrastructure the application servers are actively replicated to protect the business logic processing. Replicating the transaction coordinator renders the two-phase commit protocol non-blocking and thus, avoids potentially long service disruptions caused by coordinator failure. A thin interpositioning library provides client-side automatic failover, so that clients know the outcome of their requests. The interaction between the application servers and the database servers is handled through replicated gateways that prevent duplicate requests from reaching the database servers. Aborted transactions, caused by process or communication faults, are automatically retried on the client's behalf.
Critical infrastructures provide services upon which society depends heavily;these applications are themselves dependent on distributed information systems.for all aspects of their operation and so survivability of th...
详细信息
Critical infrastructures provide services upon which society depends heavily;these applications are themselves dependent on distributed information systems.for all aspects of their operation and so survivability of the information systems.is an important issue. Fault tolerance is a mechanism by which survivability can be achieved in these information systems. We outline a specification-based approach to fault tolerance, called RAPTOR, that enables structuring of fault-tolerance specifications and an implementation partially synthesized front the formal specification. The RAPTOR approach uses three specifications describing the fault-tolerant system, the errors to be detected, and the actions to take to recover front those errors. System specification utilizes an object-oriented database to store the descriptions associated with these large, complex systems. The error detection and recovery specifications are defined using the formal specification notation Z. We also describe an implementation architecture and explore our solution with a case study.
In this paper, we revisit the problem of software fault tolerance in distributedsystems. In particular, we propose an extension of a message-driven confidence-driven (MDCD) protocol we have developed for error contai...
详细信息
In this paper, we revisit the problem of software fault tolerance in distributedsystems. In particular, we propose an extension of a message-driven confidence-driven (MDCD) protocol we have developed for error containment and recovery in a particular type of distributed embedded system. More specifically we augment the original MDCD protocol by introducing the method of "fine-grained confidence adjustment," which enables us to remove the architectural restrictions. The dynamic nature of the MDCD approach gives it a number of desirable characteristics. First, this approach does not impose any restrictions on interactions among application software components or require costly message-exchange based process coordination/synchronization. second, the algorithms allow redundancies to be applied only to low-confidence or critical interacting software components in a distributed system, permitting flexible realization of software fault tolerance. Finally, the dynamic error containment and recovery mechanisms are transparent to the application and ready to be implemented by generic middleware.
The Global Change Master Directory (GCMD) is an earth science repository that specifically tracks research data on global climatic change. The GCMD is migrating from a centralized architecture to a globally distribute...
详细信息
ISBN:
(纸本)076951300X
The Global Change Master Directory (GCMD) is an earth science repository that specifically tracks research data on global climatic change. The GCMD is migrating from a centralized architecture to a globally distributed replicated heterogeneous federated system. One of the greatest challenges facing database research is the integration of heterogeneous systems.without compromising the local autonomy, reliability and transparency of the various databases that are participating in the integration. This paper discusses these challenges in the context of the design and implementation of the next version of the GCMD software (Version 8.0). The proposed system has been designed and developed using an object-oriented system architecture based on Java, RMI (Remote Method Invocation) and JDBC. This system enables other sources to be integrated into the GCMD system, with limited changes to the local system itself. This paper describes the components of the GCMD system and addresses the issues of heterogeneity, distribution and autonomy.
Critical infrastructures provide services upon which society depends heavily; these applications are themselves dependent on distributed information systems.for all aspects of their operation and so survivability of t...
详细信息
ISBN:
(纸本)0769513069
Critical infrastructures provide services upon which society depends heavily; these applications are themselves dependent on distributed information systems.for all aspects of their operation and so survivability of the information systems.is an important issue. Fault tolerance is a mechanism by which survivability can be achieved in these information systems. We outline a specification-based approach to fault tolerance, called RAPTOR, that enables structuring of fault-tolerance specifications and an implementation partially, synthesized from the formal specification. The RAPTOR approach uses three specifications describing the fault-tolerant system, the errors to be detected, and the actions to take to recover from those errors. System specification utilizes an object-oriented database to store the descriptions associated with these large, complex systems. The error detection and recovery specifications are defined using the formal specification notation Z. We also describe an implementation architecture and explore our solution with a case study.
The authors revisit the problem of software fault tolerance in distributedsystems. In particular, we propose an extension of a message-driven confidence-driven (MDCD) protocol we have developed for error containment ...
详细信息
ISBN:
(纸本)0769513069
The authors revisit the problem of software fault tolerance in distributedsystems. In particular, we propose an extension of a message-driven confidence-driven (MDCD) protocol we have developed for error containment and recovery in a particular type of distributed embedded system. More specifically, we augment the original MDCD protocol by introducing the method of "fine-grained confidence adjustment," which enables us to remove the architectural restrictions. The dynamic nature of the MDCD approach gives it a number of desirable characteristics. First, this approach does not impose any restrictions on interactions among application software components or require costly message-exchange based process coordination/synchronization. second, the algorithms allow redundancies to be applied only to low-confidence or critical interacting software components in a distributed system, permitting flexible realization of software fault tolerance. Finally, the dynamic error containment and recovery mechanisms are transparent to the application and ready to be implemented by generic middleware.
The Global Change Master Directory (GCMD) is an earth science repository that specifically tracks research data on global climatic change. The GCMD is migrating from a centralized architecture to a globally distribute...
详细信息
The Global Change Master Directory (GCMD) is an earth science repository that specifically tracks research data on global climatic change. The GCMD is migrating from a centralized architecture to a globally distributed replicated heterogeneous federated system. One of the greatest challenges facing database research is the integration of heterogeneous systems.without compromising the local autonomy, reliability and transparency of the various databases that are participating in the integration. This paper discusses these challenges in the context of the design and implementation of the next version of the GCMD software (Version 8.0). The proposed system has been designed and developed using an object-oriented system architecture based on Java, RMI (Remote Method Invocation) and JDBC. This system enables other sources to be integrated into the GCMD system, with limited changes to the local system itself. This paper describes the components of the GCMD system and addresses the issues of heterogeneity, distribution and autonomy.
暂无评论