Management policies can be used to specify requirements about the desired behaviour of distributedsystems. Violations of policies (faults) can then be detected, isolated, located, and corrected using a policy-driven ...
详细信息
Management policies can be used to specify requirements about the desired behaviour of distributedsystems. Violations of policies (faults) can then be detected, isolated, located, and corrected using a policy-driven fault management system. Other work in this are to date has focused on network-level faults. We believe that in a distributed system it is more appropriate to focus on faults at the application level. Furthermore, this work has been largely domain specific - a generic, structured approach to this problem is needed. Our work has focused on policy-driven fault management in distributedsystems.at the application level. In this paper, we define a generic architecture for policy-driven fault management, and present a prototype system based on this architecture. We also discuss experience to date using and experimenting with our prototype system.
Creating robust software requires not only careful specification and implementation, but also quantitative measurement. This paper describes Ballista exception handling testing of the High Level Architecture Run-Time ...
详细信息
Creating robust software requires not only careful specification and implementation, but also quantitative measurement. This paper describes Ballista exception handling testing of the High Level Architecture Run-Time Infrastructure (HLA RTI). The RTI is a standard distributed simulation system intended to provide completely robust exception handling, yet implementations have normalized robustness failure rates as high as 10%. Non-robust testing responses include exception handler crashes, segmentation violations, `unknown' exceptions, and task hangs. Other issues include different robustness failure modes across ports to two operating systems. and mandatory client machine rebooting after a particular RTI failure. Testing the RTI led to scalable extensions of the Ballista architecture for handling exception-based error reporting models, testing object-oriented software structures (including call-backs, pass by reference, and constructors), and operating in a state-rich, distributed system environment. These results demonstrate that robustness testing can provide useful feedback to high-quality software development processes, and can be applied to domains well beyond the previous work on testing operating systems.
The authors present an election protocol that does not assume an underlying ring structure and that tolerates failures, including lost messages and network partitioning, during the execution of the protocol itself. Th...
详细信息
ISBN:
(纸本)0818608757
The authors present an election protocol that does not assume an underlying ring structure and that tolerates failures, including lost messages and network partitioning, during the execution of the protocol itself. The major problem to be solved is that when nodes cannot communicate with one another or messages are lost, a conflict in resolving the election will often arise. In the authors' approach, the conflict is detected by the cohorts (noncandidate participants in the election). Related election protocols are discussed, and the system model is described together with assumptions about the communication subsystem. The protocol and the lost-message situations are then examined.
This paper presents a software modeling environment for estimating the performance of distributeddatabasesystems. This tool supports a simulation language, HGPSS, which comprises various simulation primitives, conta...
详细信息
ISBN:
(纸本)0818619465
This paper presents a software modeling environment for estimating the performance of distributeddatabasesystems. This tool supports a simulation language, HGPSS, which comprises various simulation primitives, contains a collection of network modules, and allows for the collection of statistics. This provides an overview of the HGPSS environment emphasizing its applicability to the modeling of distributeddatabases.
This paper describes how parallel retrieval is implemented in the content-based visual information retrieval framework VizIR. Generally, two major use cases for parallelisation exist in visual retrieval systems. distr...
详细信息
ISBN:
(纸本)0819455547
This paper describes how parallel retrieval is implemented in the content-based visual information retrieval framework VizIR. Generally, two major use cases for parallelisation exist in visual retrieval systems.distributed querying and simultaneous multi-user querying. distributed querying includes parallel query execution and querying multiple databases. Content-based querying is a two-step process: transformation of feature space to distance space using distance measures and selection of result set elements from distance space. Parallel distance measurement is implemented by sharing example media and query parameters between querying threads. In VizIR, parallelisation is heavily based on caching strategies. Querying multiple distributeddatabases is already supported by standard relational database management systems. The most relevant issues here are error handling and minimisation of network bandwidth consumption. Moreover, we describe strategies for distributed similarity measurement and content-based indexing. Simultaneous multi-user querying raises problems such as caching of querying results and usage of relevance feedback and user preferences for query refinement. We propose a 'real' multi-user querying environment that allows users to interact in defining queries and browse through result sets simultaneously. The proposed approach opens an entirely new field of applications for visual information retrieval systems.
In order to guarantee data reliability in distributed storage systems. erasure codes are widely used for the desirable storage properties. Nevertheless, the codes have one drawback that overmuch data are needed to rep...
详细信息
ISBN:
(纸本)9781479955848
In order to guarantee data reliability in distributed storage systems. erasure codes are widely used for the desirable storage properties. Nevertheless, the codes have one drawback that overmuch data are needed to repair a failure, resulting in both large bandwidth consuming in the network and high calculation pressure on the replacement node. For repair bandwidth problem, researchers derive the tradeoff between storage and repair traffic from network coding and propose regenerating codes. However, the constructions of regenerating codes complicate the systems.as well as recovery calculation. Hence, this paper proposes a distributed repair method based on general erasure codes to mitigate the burden of both recovery computation and network traffic. We observe that distributing recovery computation among helpers can distract the whole calculation procedure and accelerate repair speed in practical systems. Furthermore, by combining this technique with network topology, we introduce a novel repair tree to minimize repair traffic. Repair tree is also derived from network coding. The performance of the repair tree is preliminarily analyzed and evaluated, which infers that the storage-bandwidth bound of regenerating codes can be broken under this model.
As softwaredistributed Shared Memory(DSM) systems.become attractive on larger clusters, the focus of attention moves toward improving the reliability of systems. In this paper, we propose a lightweight logging scheme...
详细信息
ISBN:
(纸本)0769520693
As softwaredistributed Shared Memory(DSM) systems.become attractive on larger clusters, the focus of attention moves toward improving the reliability of systems. In this paper, we propose a lightweight logging scheme, called remote logging, and a recovery protocol for home-based DSM. Remote logging stores coherence-related data to the volatile memory of a remote node. The logging overhead can be moderated with high-speed system area network and user-level DMA operations supported by modern communication protocols. Remote logging tolerates multiple failures if the backup nodes of failed nodes are alive. It makes the reliability of DSM grow much higher. Experimental results show that our fault-tolerant DSM has low overhead compared to conventional stable logging and it can be effectively recovered from some concurrent failures.
The following topics are dealt with: real-time distributed programming systems.architecture and interconnection schemes;fault tolerance;reliability estimation and performance modeling;performance analysis;intercommuni...
详细信息
ISBN:
(纸本)0818607491
The following topics are dealt with: real-time distributed programming systems.architecture and interconnection schemes;fault tolerance;reliability estimation and performance modeling;performance analysis;intercommunication protocols;operative systems.dynamic and distributed scheduling;task allocation and load balancing;real-time operating system for nuclear power plant computer;real-time juggling robot;and real-time direct kinematics on a VLSI chip. 30 papers were presented, all of which are published in full in the present proceedings. Abstracts of individual papers can be found under the classification codes in this or other issues.
The proceedings contains 27 papers from the 1996 IEEE Real-Time Technology and Applications symposium. Topics discussed include case studies and applications of real time systems.databasesystems.and concurrency cont...
详细信息
The proceedings contains 27 papers from the 1996 IEEE Real-Time Technology and Applications symposium. Topics discussed include case studies and applications of real time systems.databasesystems.and concurrency control, software engineering, data communication systems. real time system development and analysis tools, formal methods and processing scheduling, and operating systems.and distributedsystems.
Modern stream processing systems.need to process large volumes of data in real-time. Various stream processing frameworks have been developed and messaging systems.are widely applied to transfer streaming data among d...
详细信息
ISBN:
(纸本)9781728198705
Modern stream processing systems.need to process large volumes of data in real-time. Various stream processing frameworks have been developed and messaging systems.are widely applied to transfer streaming data among different applications. As a distributed messaging system with growing popularity, Apache Kafka processes streaming data in small batches for efficiency. However, the robustness of Kafka's batching method against variable operating conditions is not known. In this paper we study the impact of the batch size on the performance of Kafka. Both configuration parameters, the spatial and temporal batch size, are considered. We build a Kafka testbed using Docker containers to analyze the distribution of Kafka's end-to-end latency. The experimental results indicate that evaluating the mean latency only is unreliable in the context of real-time systems. In the experiments where network faults are injected, we find that the batch size affects the message loss rate in the presence of an unstable network connection. However, allocating resources for message processing and delivery that will violate the reliability requirements implemented as latency constraints of a real-time system is inefficient To address these challenges we propose a reactive batching strategy. We evaluate our batching strategy in both good and poor network conditions. The results show that the strategy is powerful enough to meet both latency and throughput constraints even when network conditions are variable.
暂无评论