The storage requirement for distributed tracing can be reduced significantly by sampling only the anomalous or interesting traces that occur rarely at runtime. In this paper, we introduce an unsupervised sampling pipe...
详细信息
The storage requirement for distributed tracing can be reduced significantly by sampling only the anomalous or interesting traces that occur rarely at runtime. In this paper, we introduce an unsupervised sampling pipeline for distributed tracing that ensures high sampling accuracy while reducing the storage requirement. The proposed method, SampleHST-X, extends our recent work SampleHST. It operates based on a budget which limits the percentage of traces to be sampled while adjusting the storage quota of normal and anomalous traces depending on the size of this budget. The sampling process relies on accurately defining clusters of normal and anomalous traces by leveraging the distribution of mass scores, which characterize the probability of observing different traces, obtained from a forest of Half Space Trees (HST). In our experiments, using traces from a cloud data center, SampleHST yields 2.3 x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} to 9.5 x \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} better sampling performance. SampleHST-X further extends the SampleHST approach by incorporating a novel class of Half Space Trees, namely Approximate HST, that uses approximate counters to update the mass scores. These counters significantly reduces the space requirement for HST while the sampling performance remains similar. In addition to this extension, SampleHST-X includes a Family of Graph Spectral Distances (FGSD) based trace characterization component, which, in addition to point anomalies, enables it to sample traces with collective anomalies. For such trace
Currently, microservices are trending as the most popular software application design architecture. Software organisations are also being targeted by more cyber-attacks every day and newer security measures are in hig...
详细信息
Currently, microservices are trending as the most popular software application design architecture. Software organisations are also being targeted by more cyber-attacks every day and newer security measures are in high demand. One available measure is the application of anomaly detection, which is defined as the discovery of irregular or unusual activity that occurs to a greater or lesser degree than normal occurrences in a data series. In this paper, we continue existing work where various real-world cyberattacks are executed against a running microservices application, and the application traffic is logged and returned in the form of distributed traces. A Diffusion Convolutional Recurrent Neural Network is used to model the set of distributed traces and learn the spatial and temporal dependencies of the application traffic. Subsequently, the model is used to make predictions for ongoing microservice activity and threshold-based anomaly detection is applied to detect irregular microservice activity indicating the presence of seeded cyber security attacks, or anomalies. The cyber-attacks used to evaluate this approach include a brute force attack, a batch registration of bot accounts and a distributed denial of service attack. (C) 2022 The Author(s). Published by Elsevier Ltd.
distributed tracing has been widely adopted in many microservice systems and plays an important role in monitoring and analyzing the system. However, trace data often come in large volumes, incurring substantial compu...
详细信息
distributed tracing has been widely adopted in many microservice systems and plays an important role in monitoring and analyzing the system. However, trace data often come in large volumes, incurring substantial computational and storage costs. To reduce the quantity of traces, trace sampling has become a prominent topic of discussion, and several methods have been proposed in prior work. To attain higher-quality sampling outcomes, biased sampling has gained more attention compared to random sampling. Previous biased sampling methods primarily considered the importance of traces based on diversity, aiming to sample more edge-case traces and fewer common-case traces. However, we contend that relying solely on trace diversity for sampling is insufficient, system runtime state is another crucial factor that needs to be considered, especially in cases of system failures. In this study, we introduce TraStrainer, an online sampler that takes into account both system runtime state and trace diversity. TraStrainer employs an interpretable and automated encoding method to represent traces as vectors. Simultaneously, it adaptively determines sampling preferences by analyzing system runtime metrics. When sampling, it combines the results of system-bias and diversity-bias through a dynamic voting mechanism. Experimental results demonstrate that TraStrainer can achieve higher quality sampling results and significantly improve the performance of downstream root cause analysis (RCA) tasks. It has led to an average increase of 32.63% in Top-1 RCA accuracy compared to four baselines in two datasets.
Background: Cloud-native software systems often have a much more decentralized structure and many independently deployable and (horizontally) scalable components, making it more complicated to create a shared and cons...
详细信息
Background: Cloud-native software systems often have a much more decentralized structure and many independently deployable and (horizontally) scalable components, making it more complicated to create a shared and consolidated picture of the overall decentralized system state. Today, observability is often understood as a triad of collecting and processing metrics, distributed tracing data, and logging. The result is often a complex observability system composed of three stovepipes whose data are difficult to correlate. Objective: This study analyzes whether these three historically emerged observability stovepipes of logs, metrics and distributed traces could be handled in a more integrated way and with a more straightforward instrumentation approach. Method: This study applied an action research methodology used mainly in industry-academia collaboration and common in software engineering. The research design utilized iterative action research cycles, including one long-term use case. Results: This study presents a unified logging library for Python and a unified logging architecture that uses the structured logging approach. The evaluation shows that several thousand events per minute are easily processable. Conclusions: The results indicate that a unification of the current observability triad is possible without the necessity to develop utterly new toolchains.
暂无评论