There is an ever-increasing need for exploring large-scale graph data sets in computational sciences, social networks, and business analytics. However, due to irregular and memory-intensive nature, graph applications ...
详细信息
There is an ever-increasing need for exploring large-scale graph data sets in computational sciences, social networks, and business analytics. However, due to irregular and memory-intensive nature, graph applications are notoriously known for their poor performance on parallel computer systems. In this paper we propose a new hybrid MPI/Pthreads breadth-first search (BFS) algorithm featuring with (i) overlapping computation and communication by separating them into multiple threads, (ii) maximizing multi-threading parallelism on multi-cores with massive threads to improve throughputs, and (iii) exploiting pipeline parallelism using lock-free queues for asynchronous communication. By comparing it with traditional MPI-only BFS algorithm, we learned several valuable lessons that would help to understand and exploit parallelism in graph traversal applications. Experiments show our algorithm is 1.9x faster than the MPI-only version, capable of processing 1.45 billion edges per second on a 32-node SMP cluster. At a large scale, our algorithm is 1.49x than the MPI-only BFS algorithm in Combinatorial BLAS Library with 6,144 cores.
We present techniques to process large scale-free graphs in distributed memory. Our aim is to scale to trillions of edges, and our research is targeted at leadership class supercomputers and clusters with local non-vo...
详细信息
ISBN:
(纸本)9780769549712
We present techniques to process large scale-free graphs in distributed memory. Our aim is to scale to trillions of edges, and our research is targeted at leadership class supercomputers and clusters with local non-volatile memory, e. g., NAND Flash. We apply an edge list partitioning technique, designed to accommodate high-degree vertices (hubs) that create scaling challenges when processing scale-free graphs. In addition to partitioning hubs, we use ghost vertices to represent the hubs to reduce communication hotspots. We present a scaling study with three important graph algorithms: Breadth-First Search (BFS), K-Core decomposition, and Triangle Counting. We also demonstrate scalability on BG/P Intrepid by comparing to best known graph500 results [1]. We show results on two clusters with local NVRAM storage that are capable of traversing trillion-edge scale-free graphs. By leveraging node-local NAND Flash, our approach can process thirty-two times larger datasets with only a 39% performance degradation in Traversed Edges Per Second (TEPS).
Agent-based simulation has become a key technique for modeling and simulating dynamic, complicated behaviors in social and behavioral sciences. As these simulations become more complex, they generate an increasingly l...
详细信息
ISBN:
(纸本)9781467347976
Agent-based simulation has become a key technique for modeling and simulating dynamic, complicated behaviors in social and behavioral sciences. As these simulations become more complex, they generate an increasingly large amount of data. Lacking the appropriate tools and support, it has become difficult for social scientists to interpret and analyze the results of these simulations. In this paper, we introduce the Aggregate Temporal graph (ATG), a graph formulation that can be used to capture complex relationships between discrete simulation states in time. Using this formulation, we can assist social scientists in identifying critical simulation states by examining graph substructures. In particular, we define the concept of a Gateway and its inverse, a Terminal, which capture the relationships between pivotal states in the simulation and their inevitable outcomes. We propose two real-time computable algorithms to identify these relationships and provide a proof of correctness, complexity analysis, and empirical run-time analysis. We demonstrate the use of these algorithms on a large-scale social science simulation of political power and violence in present-day Thailand, and discuss broader applications of the ATG and associated algorithms in other domains such as analytic provenance.
Given a social network, which of its nodes are more central? This question was asked many times in sociology, psychology and computer science, and a whole plethora of centrality measures (a.k.a. centrality indices, or...
详细信息
ISBN:
(纸本)9780769551098
Given a social network, which of its nodes are more central? This question was asked many times in sociology, psychology and computer science, and a whole plethora of centrality measures (a.k.a. centrality indices, or rankings) were proposed to account for the importance of the nodes of a network. In this paper, we approach the problem of computing geometric centralities, such as closeness [1] and harmonic centrality [2], on very large graphs;traditionally this task requires an all-pairs shortest-path computation in the exact case, or a number of breadth-first traversals for approximated computations, but these techniques yield very weak statistical guarantees on highly disconnected graphs. We rather assume that the graph is accessed in a semi-streaming fashion, that is, that adjacency lists are scanned almost sequentially, and that a very small amount of memory (in the order of a dozen bytes) per node is available in core memory. We leverage the newly discovered algorithms based on HyperLogLog counters [3], making it possible to approximate a number of geometric centralities at a very high speed and with high accuracy. While the application of similar algorithms for the approximation of closeness was attempted in the MapReduce [4] framework [5], our exploitation of HyperLogLog counters reduces exponentially the memory footprint, paving the way for in-core processing of networks with a hundred billion nodes using "just" 2 1113 of RAM. Moreover, the computations we describe are inherently parallelizable, and scale linearly with the number of available cores.
The breadth-first search (BFS) is one of the most important kernels in graph theory. The graph500 benchmark measures the performance of any supercomputer performing a BFS in terms of traversed edges per second (TEPS)....
详细信息
ISBN:
(纸本)9781479912926;9781479912933
The breadth-first search (BFS) is one of the most important kernels in graph theory. The graph500 benchmark measures the performance of any supercomputer performing a BFS in terms of traversed edges per second (TEPS). Previous studies have proposed hybrid approaches that combine a well-known top-down algorithm and an efficient bottom-up algorithm for large frontiers. This reduces some unnecessary searching of outgoing edges in the BFS traversal of a small-world graph, such as a Kronecker graph. In this paper, we describe a highly efficient BFS using column-wise partitioning of the adjacency list while carefully considering the non-uniform memory access (NUMA) architecture. We explicitly manage the way in which each working thread accesses a partial adjacency list in local memory during BFS traversal. Our implementation has achieved a processing rate of 11.15 billion edges per second on a 4-way Intel Xeon E5-4640 system for a scale-26 problem of a Kronecker graph with 2(26) vertices and 2(30) edges. Not all of the speedup techniques in this paper are limited to the NUMA architecture system. With our winning Green graph500 submission of June 2013, we achieved 64.12 GTEPS per kilowatt hour on an ASUS Pad TF700T with an NVIDIA Tegra 3 mobile processor.
Detecting strongly connected components (SCCs) in a directed graph is a fundamental graph analysis algorithm that is used in many science and engineering domains. Traditional approaches in parallel SCC detection, howe...
详细信息
ISBN:
(纸本)9781450323789
Detecting strongly connected components (SCCs) in a directed graph is a fundamental graph analysis algorithm that is used in many science and engineering domains. Traditional approaches in parallel SCC detection, however, show limited performance and poor scaling behavior when applied to large real-world graph instances. In this paper, we investigate the shortcomings of the conventional approach and propose a series of extensions that consider the fundamental properties of real-world graphs, e.g. the small-world property. Our scalable implementation offers excellent performance on diverse, small-world graphs resulting in a 5.01x to 29.41x parallel speedup over the optimal sequential algorithm with 16 cores and 32 hardware threads.
With the recent increasing popularity of social networking services like Facebook and Twitter, community structure has become a problem of considerable interest. Although there are more than a hundred algorithms that ...
详细信息
ISBN:
(纸本)9783642406423;9783642406430
With the recent increasing popularity of social networking services like Facebook and Twitter, community structure has become a problem of considerable interest. Although there are more than a hundred algorithms that find communities in networks, only a few are able to detect overlapping communities, and an even smaller number of them follow an approach based on the evolution dynamics of these networks. Thus, we present FRINGE, an algorithm for the detection of overlapping communities in networks, which, based on the ideas of friendship and leadership, not only returns the overlapping communities detected, but also specifies their leading members. We describe the algorithm in detail and compare its results with those obtained by CFinder and iLCD for both synthetic and real-life networks. These results show that our proposal behaves well in networks with a clear social hierarchy, as seen in modern social networks.
Efficiently storing and processing massive graph data sets is a challenging prob- lem as researchers seek to leverage "Big Data" to answer next-generation scientific questions. New techniques are required to...
详细信息
Efficiently storing and processing massive graph data sets is a challenging prob- lem as researchers seek to leverage "Big Data" to answer next-generation scientific questions. New techniques are required to process large scale-free graphs in shared, distributed, and external memory. This dissertation develops new techniques to parallelize the storage, computation, and communication for scale-free graphs with high-degree vertices. Our work facilitates the processing of large real-world graph datasets through the development of parallel algorithms and tools that scale to large computational and memory resources, overcoming challenges not addressed by exist- ing techniques. Our aim is to scale to trillions of edges, and our research is targeted at leadership class supercomputers, clusters with local non-volatile memory, and shared memory systems. We present three novel techniques to address scaling challenges in processing large scale-free graphs. We apply an asynchronous graph traversal technique using prioritized visitor queues that is capable of tolerating data latencies to the external graph storage media and message passing communication. To accommodate large high-degree vertices, we present an edge list partitioning technique that evenly parti- tions graphs containing high-degree vertices. Finally, we propose a technique we call distributed delegates that distributes and parallelizes the storage, computation, and communication when processing high-degree vertices. The edges of high-degree ver- tices are distributed, providing additional opportunities for parallelism not present in existing methods. We apply our techniques to multiple graph algorithms: Breadth-First Search, Single Source Shortest Path, Connected Components, K-Core decomposition, Trian- gle Counting, and Page Rank. Our experimental study of these algorithms demon- strates excellent scalability on supercomputers, clusters with non-volatile memory, and shared memory systems. Our study includes multi
Simulations of the critical Ising model by means of local update algorithms suffer from critical slowing down. One way to partially compensate for the influence of this phenomenon on the runtime of simulations is usin...
详细信息
ISBN:
(纸本)9781450323789
Simulations of the critical Ising model by means of local update algorithms suffer from critical slowing down. One way to partially compensate for the influence of this phenomenon on the runtime of simulations is using increasingly faster and parallel computer hardware. Another approach is using algorithms that do not suffer from critical slowing down, such as cluster algorithms. This paper reports on the Swendsen-Wang multi-cluster algorithm on Intel Xeon Phi coprocessor 5110P, Nvidia Tesla M2090 GPU, and x86 multi-core CPU. We present shared memory versions of the said algorithm for the simulation of the two- and three-dimensional Ising model. We use a combination of local cluster search and global label reduction by means of atomic hardware primitives. Further, we describe an MPI version of the algorithm on Xeon Phi and CPU, respectively. Significant performance improvements over known implementations of the Swendsen-Wang algorithm are demonstrated.
A generalized split (k, l) partition is a vertex set partition into at most k independent sets and l cliques. We prove that the (2, 1) partitioned probe problem is in P whereas the (2, 2) partitioned probe is NP-compl...
详细信息
暂无评论