Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distri...
Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. The hierarchical protocol leverages the underlying network topology to achieve faster failure detection. In addition to studying the effectiveness and efficiency of these two agreement protocols, we propose a third protocol that extends the hierarchical approach by piggybacking gossip information on application-generated messages. The protocols are simulated and evaluated with a fault-injection model for scalable distributed systems comprised of clusters of workstations connected by high-performance networks, such as the CPlant system at Sandia National Laboratories. The model supports permanent and transient node and link failures, with rates specified at simulation time, for processors functioning in a fail-silent fashion. Through high-fidelity, CAD-based modeling and simulation, we demonstrate the strengths and weaknesses of each approach in terms of agreement time, number of gossips, and overall scalability.
Gossip protocols and services provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications...
Gossip protocols and services provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. Extending the gossip protocol such that a system reaches consensus on detected faults can be performed via a flat structure, or it can be hierarchically distributed across cooperating layers of nodes. In this paper, the performance of gossip services employing flat and hierarchical schemes is analyzed on an experimental testbed in terms of consensus time, resource utilization and scalability. performance associated with a hierarchically arranged gossip scheme is analyzed with varying group sizes and is shown to scale well. Resource utilization of the gossip-style failure detection and consensus service is measured in terms of network bandwidth utilization and CPU utilization. Analytical models are developed for resource utilization and performance projections are made for large system sizes.
Functional dependencies (FDs) form a valuable ingredient for various data management tasks. However, existing methods can hardly discover practical and interpretable FDs, especially in large noisy real-life datasets. ...
详细信息
Recently, resource virtualisation has been proven effective for deploying large-scale IT-infrastructures, such as grids and clouds. However, many studies also indicate that the system's energy-efficiency will be r...
详细信息
Gossip protocols provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. However, in...
Gossip protocols provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. However, in order to be effective with application recovery and reconfiguration, these protocols require mechanisms by which failures can be detected with system-wide consensus in a scalable fashion. This paper presents three new gossip-style protocols supported by a novel algorithm to achieve consensus in scalable, heterogeneous clusters. The round-robin protocol improves on basic randomized gossiping by distributing gossip messages in a deterministic order that optimizes bandwidth consumption. Redundant gossiping is completely eliminated in the binary round-robin protocol, and the round-robin with sequence check protocol is a useful extension that yields efficient detection times without the need for system-specific optimization. The distributed consensus algorithm works with these gossip protocols to achieve agreement among the operable nodes in the cluster on the state of the system featuring either a flat or a layered design. The various protocols are simulated and evaluated in terms of consensus time and scalability using a high-fidelity, fault-injection model for distributed systems comprised of clusters of workstations connected by high-performance networks.
In recent times, Machine Learning has played an important role in developing novel advanced tools for threat detection and mitigation. Intrusion Detection, Misinformation, Malware, and Fraud Detection are just some ex...
详细信息
As computational clusters rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and pr...
详细信息
ISBN:
(纸本)9781605589428
As computational clusters rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved computational performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing. By making use of a multi-cluster simulator, we study the impact of sub-optimal checkpoint intervals on overall application efficiency. By using a model of a 1926 node cluster and workload statistics from Los Alamos National laboratory to parameterize the simulator, we find that dramatically over-estimating the AMTTI has a fairly minor impact on application efficiency while potentially having a much more severe impact on user-centric performance metrics such a queueing delay. We compare and contrast these results with the trends predicted by an analytical model. Copyright 2010 ACM.
The computing power provided by highperformance and low cost PC-based clusters and Grid computing platforms are attractive and they are equal or superior to supercomputers and mainframes. In parallel, discussions on ...
详细信息
The emergence of pay-as-you-use compute clouds has enabled scientists to experiment with the latest processor architectures and accelerators. However, the lack of standardization in cloud computing, more specifically ...
详细信息
Scientific applications running on heterogeneous computing systems, which often have unpredictable behavior, enhance their performance by employing loop scheduling techniques as methods to avoid load imbalance through...
详细信息
暂无评论