There has been significant recent interest in parallelgraph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memor...
详细信息
There has been significant recent interest in parallelgraph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallelalgorithms for 20 important graph problems. We also present the interfaces, optimizations, and graph processing techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We have made the implementations developed in this work publicly-available as the graph Based Benchmark Suite (GBBS).
There has been significant recent interest in parallelgraph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memor...
详细信息
ISBN:
(纸本)9781450357999
There has been significant recent interest in parallelgraph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallelalgorithms for 13 important graph problems. We also present the optimizations and techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We provide a publicly-available benchmark suite containing our implementations.
In this paper,a sequential algorithm computing the all vertex pair distance matrix D and the path matrix Pis *** a PRAM EREW model with p,1≤p≤n^2,processors,a parallel version of the sequential algorithm is *** meth...
详细信息
In this paper,a sequential algorithm computing the all vertex pair distance matrix D and the path matrix Pis *** a PRAM EREW model with p,1≤p≤n^2,processors,a parallel version of the sequential algorithm is *** method can also be used to get a parallel algorithm to compute transitive closure arrayof an undirected *** time complexify of the parallel algorithm is O(n^3/p).If D,P andare known,it is shown that the problems to find all connected components, to compute the diameter of an undirected graph,to determine the center of a directed graph and to search for a directed cycle with the minimum(maximum)length in a directed graph can all be solved in O(n^2/p^+ logp)time.
This paper describes the process used to extend the Boost graph Library (BGL) for parallel operation with distributed memory. The BGL consists of a rich set of generic graphalgorithms and supporting data structures, ...
详细信息
This paper describes the process used to extend the Boost graph Library (BGL) for parallel operation with distributed memory. The BGL consists of a rich set of generic graphalgorithms and supporting data structures, but it was not originally designed with parallelism in mind. In this paper, we revisit the abstractions comprising the BGL in the context of distributed-memory parallelism, lifting away the implicit requirements of sequential execution and a single shared address space. We illustrate our approach by describing the process as applied to one of the core algorithms in the BGL, breadth-first search. The result is a generic algorithm that is unchanged from the sequential algorithm, requiring only the introduction of external (distributed) data structures for parallel execution. More importantly, the generic implementation retains its interface and semantics, such that other distributed algorithms can be built upon it, just as algorithms are layered in the sequential case. By characterizing these extensions as well as the extension process, we develop general principles and patterns for using (and reusing) generic, object-oriented parallel software libraries. We demonstrate that the resulting algorithm implementations are both efficient and scalable with performance results for several algorithms.
Community detection is the problem of finding naturally forming clusters in networks. It is an important problem in mining and analyzing social and other complex networks. Community detection can be used to analyze co...
详细信息
ISBN:
(纸本)9783031785405;9783031785412
Community detection is the problem of finding naturally forming clusters in networks. It is an important problem in mining and analyzing social and other complex networks. Community detection can be used to analyze complex systems in the real world and has applications in many areas, including network science, data mining, and computational biology. Label propagation is a community detection method that is simpler and faster than other methods such as Louvain, InfoMap, and spectral-based approaches. Some real-world networks can be very large and have billions of nodes and edges. Sequential algorithms might not be suitable for dealing with such large networks. This paper presents distributed-memory and hybrid parallel community detection algorithms based on the label propagation method. We incorporated novel optimizations and communication schemes, leading to very efficient and scalable algorithms. We also discuss various load-balancing schemes and present their comparative performances. These algorithms have been implemented and evaluated using large high-performance computing systems. Our hybrid algorithm is scalable to thousands of processors and has the capability to process massive networks. This algorithm was able to detect communities in the Metaclust50 network, a massive network with 282 million nodes and 42 billion edges, in 654 s using 4096 processors.
This paper describes the process used to extend the Boost graph Library (BGL) for parallel operation with distributed memory. The BGL consists of a rich set of generic graphalgorithms and supporting data structures, ...
详细信息
This paper describes the process used to extend the Boost graph Library (BGL) for parallel operation with distributed memory. The BGL consists of a rich set of generic graphalgorithms and supporting data structures, but it was not originally designed with parallelism in mind. In this paper, we revisit the abstractions comprising the BGL in the context of distributed-memory parallelism, lifting away the implicit requirements of sequential execution and a single shared address space. We illustrate our approach by describing the process as applied to one of the core algorithms in the BGL, breadth-first search. The result is a generic algorithm that is unchanged from the sequential algorithm, requiring only the introduction of external (distributed) data structures for parallel execution. More importantly, the generic implementation retains its interface and semantics, such that other distributed algorithms can be built upon it, just as algorithms are layered in the sequential case. By characterizing these extensions as well as the extension process, we develop general principles and patterns for using (and reusing) generic, object-oriented parallel software libraries. We demonstrate that the resulting algorithm implementations are both efficient and scalable with performance results for several algorithms.
This paper analyzes what structural features of graph problems allow efficient parallelalgorithms. We survey some parallelalgorithms for typical problems on three kinds of graphs, outerplanar graphs, trapezoid graph...
详细信息
This paper analyzes what structural features of graph problems allow efficient parallelalgorithms. We survey some parallelalgorithms for typical problems on three kinds of graphs, outerplanar graphs, trapezoid graphs and in-tournament graphs. Our results on the shortest path problem, the longest path problem and the maximum flow problem on outerplanar graphs, the minimum-weight connected dominating set problem and the coloring problem on trapezoid graphs and Hamiltonian path and Hamiltonian cycle problem on in-tournament graphs are adopted as working examples.
This paper describes the process used to extend the Boost graph Library (BGL) for parallel operation with distributed memory. The BGL consists of a rich set of generic graphalgorithms and supporting data structures, ...
详细信息
ISBN:
(纸本)9781595930316
This paper describes the process used to extend the Boost graph Library (BGL) for parallel operation with distributed memory. The BGL consists of a rich set of generic graphalgorithms and supporting data structures, but it was not originally designed with parallelism in mind. In this paper, we revisit the abstractions comprising the BGL in the context of distributed-memory parallelism, lifting away the implicit requirements of sequential execution and a single shared address space. We illustrate our approach by describing the process as applied to one of the core algorithms in the BGL, breadth-first search. The result is a generic algorithm that is unchanged from the sequential algorithm, requiring only the introduction of external (distributed) data structures for parallel execution. More importantly, the generic implementation retains its interface and semantics, such that other distributed algorithms can be built upon it, just as algorithms are layered in the sequential case. By characterizing these extensions as well as the extension process, we develop general principles and patterns for using (and reusing) generic, object-oriented parallel software libraries. We demonstrate that the resulting algorithm implementations are both efficient and scalable with performance results for several algorithms.
Influence maximization-the problem of identifying a subset of k influential seeds (vertices) in a network- is a classical problem in network science with numerous applications. The problem is NP-hard, but there exist ...
详细信息
Influence maximization-the problem of identifying a subset of k influential seeds (vertices) in a network- is a classical problem in network science with numerous applications. The problem is NP-hard, but there exist efficient polynomial time approximations. However, scaling these algorithms still remain a daunting task due to the complexities associated with steps involving stochastic sampling and large-scale aggregations. In this paper, we present a new parallel distributed approximation algorithm for influence maximization with provable approximation guarantees. Our approach, which we call GreediRIS, leverages the RANDGREEDI framework-a state-of-the-art approach for distributed submodular optimization-for solving a step that computes a maximum k cover. GreediRIS combines distributed and streaming models of computations, along with pruning techniques, to effectively address the communication bottlenecks of the algorithm. Experimental results on up to 512 nodes (32K cores) of the NERSC Perlmutter supercomputer show that GreediRIS can achieve good strong scaling performance, preserve quality, and significantly outperform the other state-of-theart distributed implementations. For instance, on 512 nodes, the most performant variant of GreediRIS achieves geometric mean speedups of 28.99x and 36.35x for two different diffusion models, over a state-of-the-art parallel implementation. We also present a communication-optimized version of GreediRIS that further improves the speedups by two orders of magnitude.
graph coloring is widely used to parallelize scientific applications by identifying subsets of independent tasks that can be executed simultaneously. graph coloring assigns colors the vertices of a graph, such that no...
详细信息
graph coloring is widely used to parallelize scientific applications by identifying subsets of independent tasks that can be executed simultaneously. graph coloring assigns colors the vertices of a graph, such that no adjacent vertices have the same color. The number of colors used corresponds to the number of parallel steps in a real-world end-application. Therefore, the total runtime of the graph coloring kernel adds to the overall parallel overhead of the real-world end-application, whereas the number of the vertices of each color class determines the number of the independent concurrent tasks of each parallel step, thus affecting the amount of parallelism and hardware resource utilization in the execution of the real-world end-application. In this work, we propose a high-performance graph coloring algorithm, named ColorTM, that leverages Hardware Transactional Memory (HTM) to detect coloring inconsistencies between adjacent vertices. ColorTM detects and resolves coloring inconsistencies between adjacent vertices with an eager approach to minimize data access costs, and implements a speculative synchronization scheme to minimize synchronization costs and increase parallelism. We extend our proposed algorithmic design to propose a balanced graph coloring algorithm, named BalColorTM, with which all color classes include almost the same number of vertices to achieve high parallelism and resource utilization in the execution of the real-world endapplications. We evaluate ColorTM and BalColorTM using a wide variety of large real-world graphs with diverse characteristics. ColorTM and BalColorTM improve performance by 12.98x and 1.78x on average using 56 parallel threads compared to prior state-of-the-art approaches. Moreover, we study the impact of our proposed graph coloring algorithmic designs on a popular end-application, i.e., Community Detection, and demonstrate the ColorTM and BalColorTM can provide high-performance improvements in real-world end-applications acr
暂无评论