The triangle counting problem in graph streams has been extensively studied in social network analysis, recommendation systems, user portraits and other fields. However, cloud computing based streamingalgorithms caus...
详细信息
The triangle counting problem in graph streams has been extensively studied in social network analysis, recommendation systems, user portraits and other fields. However, cloud computing based streamingalgorithms cause high bandwidth occupation and long transmission latency due to limited bandwidth of the cloud. Recently, edge computing is promising to overcome the issue of transmitting large-scale data for cloud computing. However, directly applying edge computing in streaming triangle counting will reduce the accuracy of the triangle count estimation, due to the limitation of local computing at the edge network. We term the cooperations between edge computing and cloud computing for streaming triangle counting as edge-cloud triangle counting in graph streams. In this paper, we first propose a streaming framework for edge-cloud triangle counting in graph streams. Then, we propose a streaming triangle counting algorithm called Trie-based Edge Compression (TbEC) by using the binary trie at the edge network that enables lossless compression and efficient transmission to the cloud. In addition, to extend our algorithms for triangle counting in multigraphs, we present a dual deduplication strategy collaboratively using the trie-based data structure and a Bloom Filter. Our experiments with real-world datasets show that TbEC is (a) Accurate: yielding up to 3.35x more accurate smaller estimation error than the state-of-the-art distributedstreaming algorithm, (b) Fast: yielding up to 10.59x faster than the state-of-the-art distributedstreaming algorithm, (c) Scalable: scaling linearly with the number of edges in the input graph stream.& COPY;2023 Elsevier B.V. All rights reserved.
Triangle counting is a fundamental problem in graph mining, essential for analyzing graph streams with arbitrary edge orders. However, exact counting becomes impractical due to the massive size of real-world graph str...
详细信息
ISBN:
(纸本)9798331530044;9798331530037
Triangle counting is a fundamental problem in graph mining, essential for analyzing graph streams with arbitrary edge orders. However, exact counting becomes impractical due to the massive size of real-world graph streams. To address this, approximate algorithms have been developed, but existing distributed streaming algorithms lack adaptability and struggle with edge deletions. In this article, we propose DTC, a novel family of single-pass distributed streaming algorithms for global and local triangle counting in fully dynamic graph streams. Our DTC-AR algorithm accurately estimates triangle counts without prior knowledge of graph size, leveraging multi-machine resources. Additionally, we introduce DTC-FD, an algorithm tailored for fully dynamic graph streams, incorporating edge insertions and deletions. Using Random Pairing and future edge insertion compensation, DTC-FD achieves unbiased and accurate approximations across multiple machines. Experimental results demonstrate significant improvements over baselines. DTC-AR achieves up to 2029.4 x and 27.1x more accuracy, while maintaining the best trade-off between accuracy and storage space. DTC-FD reduces estimation errors by up to 32.5 x and 19.3 x, scaling linearly with graph stream size. These findings highlight the effectiveness of our proposed algorithms in tackling triangle counting in real-world scenarios. The source code and datasets are released and available at https://***/Anonymousview/Real-Time-and-Accurate-distributed-Triangle-Counting-in-Fully-Dynamic-Graph-Streams.
Big data is flowing into every area of our life, professional and personal. Big data is defined as datasets whose size is beyond the ability of typical software tools to capture, store, manage and analyze, due to the ...
详细信息
ISBN:
(纸本)9781479942749
Big data is flowing into every area of our life, professional and personal. Big data is defined as datasets whose size is beyond the ability of typical software tools to capture, store, manage and analyze, due to the time and memory complexity. Velocity is one of the main properties of big data. In this demo, we present SAMOA (Scalable Advanced Massive Online Analysis), an open-source platform for mining big data streams. It provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Storm, S4, and Samza. SAMOA is written in Java and is available at http://*** under the Apache Software License version 2.0.
暂无评论