There has been significant recent interest in parallelgraph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memor...
详细信息
There has been significant recent interest in parallelgraph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallelalgorithms for 20 important graph problems. We also present the interfaces, optimizations, and graph processing techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We have made the implementations developed in this work publicly-available as the graph Based Benchmark Suite (GBBS).
There has been significant recent interest in parallelgraph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memor...
详细信息
ISBN:
(纸本)9781450357999
There has been significant recent interest in parallelgraph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallelalgorithms for 13 important graph problems. We also present the optimizations and techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We provide a publicly-available benchmark suite containing our implementations.
graph coloring is widely used to parallelize scientific applications by identifying subsets of independent tasks that can be executed simultaneously. graph coloring assigns colors the vertices of a graph, such that no...
详细信息
graph coloring is widely used to parallelize scientific applications by identifying subsets of independent tasks that can be executed simultaneously. graph coloring assigns colors the vertices of a graph, such that no adjacent vertices have the same color. The number of colors used corresponds to the number of parallel steps in a real-world end-application. Therefore, the total runtime of the graph coloring kernel adds to the overall parallel overhead of the real-world end-application, whereas the number of the vertices of each color class determines the number of the independent concurrent tasks of each parallel step, thus affecting the amount of parallelism and hardware resource utilization in the execution of the real-world end-application. In this work, we propose a high-performance graph coloring algorithm, named ColorTM, that leverages Hardware Transactional Memory (HTM) to detect coloring inconsistencies between adjacent vertices. ColorTM detects and resolves coloring inconsistencies between adjacent vertices with an eager approach to minimize data access costs, and implements a speculative synchronization scheme to minimize synchronization costs and increase parallelism. We extend our proposed algorithmic design to propose a balanced graph coloring algorithm, named BalColorTM, with which all color classes include almost the same number of vertices to achieve high parallelism and resource utilization in the execution of the real-world endapplications. We evaluate ColorTM and BalColorTM using a wide variety of large real-world graphs with diverse characteristics. ColorTM and BalColorTM improve performance by 12.98x and 1.78x on average using 56 parallel threads compared to prior state-of-the-art approaches. Moreover, we study the impact of our proposed graph coloring algorithmic designs on a popular end-application, i.e., Community Detection, and demonstrate the ColorTM and BalColorTM can provide high-performance improvements in real-world end-applications acr
Influence maximization-the problem of identifying a subset of k influential seeds (vertices) in a network- is a classical problem in network science with numerous applications. The problem is NP-hard, but there exist ...
详细信息
Influence maximization-the problem of identifying a subset of k influential seeds (vertices) in a network- is a classical problem in network science with numerous applications. The problem is NP-hard, but there exist efficient polynomial time approximations. However, scaling these algorithms still remain a daunting task due to the complexities associated with steps involving stochastic sampling and large-scale aggregations. In this paper, we present a new parallel distributed approximation algorithm for influence maximization with provable approximation guarantees. Our approach, which we call GreediRIS, leverages the RANDGREEDI framework-a state-of-the-art approach for distributed submodular optimization-for solving a step that computes a maximum k cover. GreediRIS combines distributed and streaming models of computations, along with pruning techniques, to effectively address the communication bottlenecks of the algorithm. Experimental results on up to 512 nodes (32K cores) of the NERSC Perlmutter supercomputer show that GreediRIS can achieve good strong scaling performance, preserve quality, and significantly outperform the other state-of-theart distributed implementations. For instance, on 512 nodes, the most performant variant of GreediRIS achieves geometric mean speedups of 28.99x and 36.35x for two different diffusion models, over a state-of-the-art parallel implementation. We also present a communication-optimized version of GreediRIS that further improves the speedups by two orders of magnitude.
We study the problem of approximating the distances in an undirected weighted graph G by the distances in trees based on the notion of stretch. Focusing on decentralized models of computation such as the CONGEST, PRAM...
详细信息
We study the problem of approximating the distances in an undirected weighted graph G by the distances in trees based on the notion of stretch. Focusing on decentralized models of computation such as the CONGEST, PRAM, and semi-streaming models, our main results are as follows: (1) We develop a simple randomized algorithm that constructs a spanning tree such that the expected stretch of every edge is O(log(3) n), where n is the number of nodes in G. If G is unweighted, then this algorithm can be implemented to run in O(hop(G)) rounds in the CONGEST model, where hop(G) is the hop-diameter of G;thus our algorithm is asymptotically optimal in this case. In the weighted case, the run-time of the algorithm matches the currently best known bound for exact single source shortest path (SSSP) computations, which despite recent progress is still separated from the lower bound of Omega(root n + hop(G)) by polynomial factors. A naive attempt to replace exact SSSP computations with approximate ones in order to improve the complexity in the weighted case encounters a fundamental challenge, as the underlying decomposition technique fails to work under distance approximation. (2) We overcome this obstacle by developing a technique termed blurry ball growing. This technique, in combination with a clever algorithmic idea of Miller, Peng, and Xu (SPAA 2013), allows us to obtain low diameter graph decompositions with small edge cutting probabilities based solely on approximate SSSP computations. (3) Using these decompositions, we in turn obtain metric tree embedding algorithms in the vein of the celebrated work of Bartal (FOCS 1996), whose computational complexity is optimal up to polylogarithmic factors not only in the CONGEST model but also in the PRAM and semi-streaming models. Our embeddings have the additional useful property that the tree can be mapped back to the original graph such that each edge is "used" only logarithmically many times. This property is of interest for capaci
Community detection is the problem of finding naturally forming clusters in networks. It is an important problem in mining and analyzing social and other complex networks. Community detection can be used to analyze co...
详细信息
ISBN:
(纸本)9783031785405;9783031785412
Community detection is the problem of finding naturally forming clusters in networks. It is an important problem in mining and analyzing social and other complex networks. Community detection can be used to analyze complex systems in the real world and has applications in many areas, including network science, data mining, and computational biology. Label propagation is a community detection method that is simpler and faster than other methods such as Louvain, InfoMap, and spectral-based approaches. Some real-world networks can be very large and have billions of nodes and edges. Sequential algorithms might not be suitable for dealing with such large networks. This paper presents distributed-memory and hybrid parallel community detection algorithms based on the label propagation method. We incorporated novel optimizations and communication schemes, leading to very efficient and scalable algorithms. We also discuss various load-balancing schemes and present their comparative performances. These algorithms have been implemented and evaluated using large high-performance computing systems. Our hybrid algorithm is scalable to thousands of processors and has the capability to process massive networks. This algorithm was able to detect communities in the Metaclust50 network, a massive network with 282 million nodes and 42 billion edges, in 654 s using 4096 processors.
Cycles are one of the fundamental subgraph patterns and being able to enumerate them in graphs enables important applications in a wide variety of fields, including finance, biology, chemistry, and network science. Ho...
详细信息
Cycles are one of the fundamental subgraph patterns and being able to enumerate them in graphs enables important applications in a wide variety of fields, including finance, biology, chemistry, and network science. However, to enable cycle enumeration in real-world applications, efficient parallelalgorithms are required. In this work, we propose scalable parallelisation of state-of-the-art sequential algorithms for enumerating simple, temporal, and hop-constrained cycles. First, we focus on the simple cycle enumeration problem and parallelise the algorithms by Johnson and by Read and Tarjan in a fine-grained manner. We theoretically show that our resulting fine-grained parallelalgorithms are scalable, with the fine-grained parallel ReadTarjan algorithm being strongly scalable. In contrast, we show that straightforward coarse-grained parallel versions of these simple cycle enumeration algorithms that exploit edge- or vertex-level parallelism are not scalable. Next, we adapt our fine-grained approach to enable the enumeration of cycles under time-window, temporal, and hop constraints. Our evaluation on a cluster with 256 CPU cores that can execute up to 1,024 simultaneous threads demonstrates a near-linear scalability of our fine-grained parallelalgorithms when enumerating cycles under the aforementioned constraints. On the same cluster, our fine-grained parallelalgorithms achieve, on average, one order of magnitude speedup compared to the respective coarse-grained parallel versions of the state-of-the-art algorithms for cycle enumeration. The performance gap between the fine-grained and the coarse-grained parallelalgorithms increases as we use more CPU cores.
Probabilistic breadth-first traversals (BPTs) are used in many network science and graph machine learning applications. In this paper, we are motivated by the application of BPTs in stochastic diffusion-based graph pr...
详细信息
ISBN:
(纸本)9798400706103
Probabilistic breadth-first traversals (BPTs) are used in many network science and graph machine learning applications. In this paper, we are motivated by the application of BPTs in stochastic diffusion-based graph problems such as influence maximization. These applications heavily rely on BPTs to implement a Monte-Carlo sampling step for their approximations. Given the large sampling complexity, stochasticity of the diffusion process, and the inherent irregularity in real-world graph topologies, efficiently parallelizing these BPTs remains significantly challenging. In this paper, we present a new algorithm to fuse a massive number of concurrently executing BPTs with random starts on the input graph. Our algorithm is designed to fuse BPTs by combining separate probabilistic traversals into a unified frontier. To show the general applicability of the fused BPT technique, we have incorporated it into two state-of-the-art influence maximization parallel implementations (gIM and Ripples). Our experiments on up to 4K nodes of the OLCF Frontier supercomputer (32, 768 GPUs and 196K CPU cores) show strong scaling behavior, and that fused BPTs can improve the performance of these implementations up to 182.13x (avg. 75.15x) and 359.86x (avg. 135.17x) for gIM and Ripples, respectively.
Dynamic trees are a well-studied and fundamental building block of dynamic graphalgorithms dating back to the seminal work of Sleator and Tarjan [STOC'81, (1981), pp. 114-122]. The problem is to maintain a tree s...
详细信息
ISBN:
(纸本)9798400704161
Dynamic trees are a well-studied and fundamental building block of dynamic graphalgorithms dating back to the seminal work of Sleator and Tarjan [STOC'81, (1981), pp. 114-122]. The problem is to maintain a tree subject to online edge insertions and deletions while answering queries about the tree, such as the heaviest weight on a path, etc. In the parallel batch-dynamic setting, the goal is to process batches of edge updates work efficiently in low (polylog n) span. Two work-efficient algorithms are known: batch-parallel Euler Tour Trees by Tseng et al. [ALENEX'19, (2019), pp. 92-106] and parallel Rake-Compress (RC) Trees by Acar et al. [ESA'20, (2020), pp. 2:1-2:23]. Both however are randomized and work efficient in expectation. Several downstream results that use these data structures (and indeed to the best of our knowledge, all known workefficient parallel batch-dynamic graphalgorithms) are therefore also randomized. In this work, we give the first deterministic work-efficient solution to the problem. Our algorithm maintains a parallel RC-Tree on n vertices subject to batches of k edge updates deterministically in worst-case O(k log(1 + n/k)) work and O(log n log log k) span on the Common-CRCW PRAM. We also show how to improve the span of the randomized algorithm from O(log n log* n) to O(log n). Lastly, as a result of our new deterministic algorithm, we also derandomize several downstream results that make use of parallel batch-dynamic dynamic trees, previously for which the only efficient solutions were randomized.
We develop a distributed-memory parallel algorithm for performing batch updates on streaming graphs, where vertices and edges are continuously added or removed. Our algorithm leverages distributed sparse matrices as t...
详细信息
暂无评论