The notion of networks is inherent in the structure, function and behavior of the natural and engineered world that surround us. Consequently, graph models and methods have assumed a prominent role to play in this mod...
详细信息
ISBN:
(纸本)9783981926323
The notion of networks is inherent in the structure, function and behavior of the natural and engineered world that surround us. Consequently, graph models and methods have assumed a prominent role to play in this modern era of Big Data, and are taking a center stage in the discovery pipelines of various data-driven scientific domains. In this paper, we present a brief review of the state-of-the-art in parallelgraph analytics, particularly focusing on iterative graphalgorithms and their implementation on modern day multicore/manycore architectures. The class of iterative graphalgorithms covers a broad class of graph operations of varying complexities, from simpler routines such as Breadth-First Search (BFS), to polynomially-solvable problems such as shortest path computations, to NP-Hard problems such as community detection and graph coloring. We cover a set of common algorithmic abstractions used in implementing such iterative graphalgorithms, state the challenges around parallelization on contemporary parallel platforms (including commodity multicores and emerging manycore platforms), and describe a set of approaches that have led to efficient implementations. We also report on advances in manycore architectural frameworks that have found application in parallelgraph analytics. We conclude the paper identifying potential research directions, opportunities, and challenges that lay ahead in the path toward enabling graph analytics at exascale.
Community detection has become a fundamental operation in numerous graph-theoretic applications. It is used to reveal natural divisions that exist within real world networks without imposing prior size or cardinality ...
详细信息
ISBN:
(纸本)9781479941162
Community detection has become a fundamental operation in numerous graph-theoretic applications. It is used to reveal natural divisions that exist within real world networks without imposing prior size or cardinality constraints on the set of communities. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is an iterative heuristic for modularity optimization. Originally developed by Blondel et al. in 2008, the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method is also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains (e.g., internet, citation, biological). Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number of iterations, while providing real speedups of up to 8x using 32 threads. In addition, our parallel implementation was able to exhibit weak scaling properties on up to 32 threads.
Active messages have proven to be an effective approach for certain communication problems in high performance computing. Many MPI implementations, as well as runtimes for Partitioned Global Address Space languages, u...
详细信息
ISBN:
(纸本)9781450301787
Active messages have proven to be an effective approach for certain communication problems in high performance computing. Many MPI implementations, as well as runtimes for Partitioned Global Address Space languages, use active messages in their low-level transport layers. However, most active message frameworks have low-level programming interfaces that require significant programming effort to use directly in applications and that also prevent optimization opportunities. In this paper we present AM++, a new user-level library for active messages based on generic programming techniques. Our library allows message handlers to be run in an explicit loop that can be optimized and vectorized by the compiler and that can also be executed in parallel on multicore architectures. Runtime optimizations, such as message combining and filtering, are also provided by the library, removing the need to implement that functionality at the application level. Evaluation of AM++ with distributed-memory graphalgorithms shows the usability benefits provided by these library features, as well as their performance advantages.
In an era when power constraints and data movement are proving to be significant barriers for the application of high-end computing, the Tilera many-core architecture offers a low-power platform exhibiting many import...
详细信息
ISBN:
(纸本)9781479959754
In an era when power constraints and data movement are proving to be significant barriers for the application of high-end computing, the Tilera many-core architecture offers a low-power platform exhibiting many important characteristics of future systems, including a large number of simple cores, a sophisticated network-on-chip, and fine-grained control over memory and caching policies. While this emerging architecture has been previously studied for structured compute-intensive kernels, benchmarking the platform for data-bound, irregular applications present significant challenges that have remained unexplored. Community detection is an advanced prototypical graph-theoretic operation with applications in numerous scientific domains including life sciences, cyber security, and power systems. In this work, we explore multiple design strategies toward developing a scalable tool for community detection on the Tilera platform. Using several memory layout and work scheduling techniques we demonstrate speedups of up to 47x on 36 cores of the Tilera TileGX36 platform over the best serial implementation, and also show results that have comparable quality and performance to mainstream x86 platforms. To the best of our knowledge this is the first work addressing graphalgorithms on the Tilera platform. This study demonstrates that through careful design space exploration, low-power many-core platforms like Tilera can be effectively exploited for graphalgorithms that embody all the essential characteristics of an irregular application.
The adoption of a programming language is positively influenced by the breadth of its software libraries. Chapel is a modern and relatively young parallel programming language. Consequently, not many domain-specific s...
详细信息
ISBN:
(纸本)9780769561493
The adoption of a programming language is positively influenced by the breadth of its software libraries. Chapel is a modern and relatively young parallel programming language. Consequently, not many domain-specific software libraries exists that are written for Chapel. graph processing is an important domain with many applications in cyber security, energy, social networking, and health. Implementing graphalgorithms in the language of linear algebra enables many advantages including rapid development, flexibility, high-performance, and scalability. graphBLAS initiative aims to standardize an interface for linear-algebraic primitives for graph computations. This paper presents initial experiences and findings of implementing a subset of important graphBLAS operations in Chapel. We analyzed the bottlenecks in both shared and distributed memory. We also provided alternative implementations whenever the default implementation lacked performance or scaling.
We introduce NetworKit, an open-source software package for analyzing the structure of large complex networks. Appropriate algorithmic solutions are required to handle increasingly common large graph data sets contain...
详细信息
We introduce NetworKit, an open-source software package for analyzing the structure of large complex networks. Appropriate algorithmic solutions are required to handle increasingly common large graph data sets containing up to billions of connections. We describe the methodology applied to develop scalable solutions to network analysis problems, including techniques like parallelization, heuristics for computationally expensive problems, efficient data structures, and modular software architecture. Our goal for the software is to package results of our algorithm engineering efforts and put them into the hands of domain experts. NetworKit is implemented as a hybrid combining the kernels written in C++ with a Python frontend, enabling integration into the Python ecosystem of tested tools for data analysis and scientific computing. The package provides a wide range of functionality (including common and novel analytics algorithms and graph generators) and does so via a convenient interface. In an experimental comparison with related software, NetworKit shows the best performance on a range of typical analysis tasks.
Enumerating simple cycles has important applications in computational biology, network science, and financial crime analysis. In this work, we focus on parallelising the state-of-the-art simple cycle enumeration algor...
详细信息
ISBN:
(纸本)9781450391467
Enumerating simple cycles has important applications in computational biology, network science, and financial crime analysis. In this work, we focus on parallelising the state-of-the-art simple cycle enumeration algorithms by Johnson and Read-Tarjan along with their applications to temporal graphs. To our knowledge, we are the first ones to parallelise these two algorithms in a fine-grained manner. We are also the first to demonstrate experimentally a linear performance scaling. Such a scaling is made possible by our decomposition of long sequential searches into fine-grained tasks, which are then dynamically scheduled across CPU cores, enabling an optimal load balancing. Furthermore, we show that coarse-grained parallel versions of the Johnson and the Read-Tarjan algorithms that exploit edge- or vertex-level parallelism are not scalable. On a cluster of four multi-core CPUs with 256 physical cores, our fine-grained parallelalgorithms are, on average, an order of magnitude faster than their coarse-grained parallel counterparts. The performance gap between the fine-grained and the coarse-grained parallelalgorithms widens as we use more CPU cores. When using all 256 CPU cores, our parallelalgorithms enumerate temporal cycles, on average, 260x faster than the serial algorithm of Kumar and Calders. Code repository: https://***/IBM/parallel-cycle-enumeration
Today's graphs used in domains such as machine learning or social network analysis may contain hundreds of billions of edges. Yet, they are not necessarily stored efficiently, and standard graph representations su...
详细信息
ISBN:
(纸本)9781450359863
Today's graphs used in domains such as machine learning or social network analysis may contain hundreds of billions of edges. Yet, they are not necessarily stored efficiently, and standard graph representations such as adjacency lists waste a significant number of bits while graph compression schemes such as Webgraph often require time-consuming decompression. To address this, we propose Log(graph): a graph representation that combines high compression ratios with very low-overhead decompression to enable cheaper and faster graph processing. The key idea is to encode a graph so that the parts of the representation approach or match the respective storage lower bounds. We call our approach "graph logarithmization" because these bounds are usually logarithmic. Our high-performance Log(graph) implementation based on modern bitwise operations and state-of-the-art succinct data structures achieves high compression ratios as well as performance. For example, compared to the tuned graph Algorithm Processing Benchmark Suite (GAPBS), it reduces graph sizes by 20-35% while matching GAPBS' performance or even delivering speedups due to reducing amounts of transferred data. It approaches the compression ratio of the established Webgraph compression library while enabling speedups of up to more than 2x. Log(graph) can improve the design of various graph processing engines or libraries on single NUMA nodes as well as distributed-memory systems.
graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffer...
详细信息
ISBN:
(纸本)9781450301190
graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture. We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graphalgorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.
The proliferation of data in graph form calls for the development of scalable graphalgorithms that exploit parallel processing environments. One such problem is the computation of a graph's minimum spanning fores...
详细信息
The proliferation of data in graph form calls for the development of scalable graphalgorithms that exploit parallel processing environments. One such problem is the computation of a graph's minimum spanning forest (MSF). Past research has proposed several parallelalgorithms for this problem, yet none of them scales to large, high-density graphs. In this paper we propose a novel, scalable, parallel MSF algorithm for undirected weighted graphs. Our algorithm leverages Prim's algorithm in a parallel fashion, concurrently expanding several subsets of the computed MSF. Our effort focuses on minimizing the communication among different processors without constraining the local growth of a processor's computed subtree. In effect, we achieve a scalability that previous approaches lacked. We implement our algorithm in CUDA, running on a GPU and study its performance using real and synthetic, sparse as well as dense, structured and unstructured graph data. Our experimental study demonstrates that our algorithm outperforms the previous state-of-the-art GPU-based MSF algorithm, while being several order of magnitude faster than sequential CPU-based algorithms.
暂无评论