In graph embedding, the connectivity information of a graph is used to represent each vertex as a point in a d-dimensional space. Unlike the original, irregular structural information, such a representation can be use...
详细信息
ISBN:
(纸本)9781450388160
In graph embedding, the connectivity information of a graph is used to represent each vertex as a point in a d-dimensional space. Unlike the original, irregular structural information, such a representation can be used for a multitude of machine learning tasks. Although the process is extremely useful in practice, it is indeed expensive and unfortunately, the graphs are becoming larger and harder to embed. Attempts at scaling up the process to larger graphs have been successful but often at a steep price in hardware requirements. We present Gosh, an approach for embedding graphs of arbitrary sizes on a single GPU with minimum constraints. Gosh utilizes a novel graph coarsening approach to compress the graph and minimize the work required for embedding, delivering high-quality embeddings at a fraction of the time compared to the state-of-the-art. In addition to this, it incorporates a decomposition schema that enables any arbitrarily large graph to be embedded using a single GPU with minimum constraints on the memory size. With these techniques, Gosh is able to embed a graph with over 65 million vertices and 1.8 billion edges in less than an hour on a single GPU and obtains a 93% AUCROC for link-prediction which can be increased to 95% by running the tool for 80 minutes.
Exploratory graph analytics helps maximize the informational value from a graph. However, increasing graph sizes makes it impossible for existing popular exploratory data analysis tools to handle dozens of terabytes o...
详细信息
ISBN:
(纸本)9781665423694
Exploratory graph analytics helps maximize the informational value from a graph. However, increasing graph sizes makes it impossible for existing popular exploratory data analysis tools to handle dozens of terabytes or even larger data sets in the memory of a common laptop/personal computer. Arkouda is a framework under early development that brings together the productivity of Python at the user-side with the high performance of Chapel at the server-side. In this paper, we present our initial work on overcoming the memory limit and high-performance computing coding roadblocks for high-level Python users to perform large graph analyses. Based on a simple and succinct graph data structure, a high-level Chapel-based graph algorithm, Breadth-First Search (BFS), is presented to show the scalable and parallelgraph algorithm development method in a productive way through Arkouda. The reverse Cuthill-McKee (RCM) algorithm is implemented in Chapel to relabel the vertices of a graph as a preprocessing step to improve the performance of BFS and one low-level BFS algorithm is also developed to compare with the performance of high-level method. Both synthetic graphs and typical graph benchmarks are used to evaluate the performance of the provided graphalgorithms. The experimental results show that, based on the proposed high-level algorithm framework, the performance of BFS can be improved significantly and easily by simply selecting suitable Chapel high-level data structures and parallel constructs. Our code is open source and available from GitHub (https://***/Bader-Research/arkouda).
There has been a growing interest in the graph-streaming setting where a continuous stream of graph updates is mixed with graph queries. In principle, purely-functional trees are an ideal fit for this setting as they ...
详细信息
ISBN:
(纸本)9781450367127
There has been a growing interest in the graph-streaming setting where a continuous stream of graph updates is mixed with graph queries. In principle, purely-functional trees are an ideal fit for this setting as they enable safe parallelism, lightweight snapshots, and strict serializability for queries. However, directly using them for graph processing leads to significant space overhead and poor cache locality. This paper presents C-trees, a compressed purely-functional search tree data structure that significantly improves on the space usage and locality of purely-functional trees. We design theoretically-efficient and practical algorithms for performing batch updates to C-trees, and also show that we can store massive dynamic real-world graphs using only a few bytes per edge, thereby achieving space usage close to that of the best static graph processing frameworks. To study the applicability of our data structure, we designed Aspen, a graph-streaming framework that extends the interface of Ligra with operations for updating graphs. We show that Aspen is faster than two state-of-the-art graph-streaming systems, Stinger and LLAMA, while requiring less memory, and is competitive in performance with the state-of-the-art static graph frameworks, Galois, GAP, and Ligra+. With Aspen, we are able to efficiently process the largest publicly-available graph with over two hundred billion edges in the graph-streaming setting using a single commodity multicore server with 1TB of memory.
Simple graphalgorithms such as PageRank have been the target of numerous hardware accelerators. Yet, there also exist much more complex graph mining algorithms for problems such as clustering or maximal clique listin...
详细信息
ISBN:
(纸本)9781450385572
Simple graphalgorithms such as PageRank have been the target of numerous hardware accelerators. Yet, there also exist much more complex graph mining algorithms for problems such as clustering or maximal clique listing. These algorithms are memory-bound and thus could be accelerated by hardware techniques such as Processing-in-Memory (PIM). However, they also come with non-straightforward parallelism and complicated memory access patterns. In this work, we address this problem with a simple yet surprisingly powerful observation: operations on sets of vertices, such as intersection or union, form a large part of many complex graph mining algorithms, and can offer rich and simple parallelism at multiple levels. This observation drives our cross-layer design, in which we (1) expose set operations using a novel programming paradigm, (2) express and execute these operations efficiently with carefully designed set-centric ISA extensions called SISA, and (3) use PIM to accelerate SISA instructions. The key design idea is to alleviate the bandwidth needs of SISA instructions by mapping set operations to two types of PIM: in-DRAM bulk bitwise computing for bitvectors representing high-degree vertices, and near-memory logic layers for integer arrays representing low-degree vertices. Set-centric SISA-enhanced algorithms are efficient and outperform hand-tuned baselines, offering more than 10x speedup over the established Bron-Kerbosch algorithm for listing maximal cliques. We deliver more than 10 SISA set-centric algorithm formulations, illustrating SISA's wide applicability.
Developing high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose...
详细信息
ISBN:
(纸本)9781450361378
Developing high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose the first maximum matching algorithm designed for FPGAs;it is energy-efficient and has provable guarantees on accuracy, performance, and storage utilization. To achieve this, we forego popular graph processing paradigms, such as vertex-centric programming, that often entail large communication costs. Instead, we propose a substream-centric approach, in which the input stream of data is divided into substreams processed independently to enable more parallelism while lowering communication costs. We base our work on the theory of streaming graphalgorithms and analyze 14 models and 28 algorithms. We use this analysis to provide theoretical underpinning that matches the physical constraints of FPGA platforms. Our algorithm delivers high performance (more than 4x speedup over tuned parallel CPU variants), low memory, high accuracy, and effective usage of FPGA resources. The substream-centric approach could easily be extended to other algorithms to offer low-power and high-performance graph processing on FPGAs.
We present an algorithm to compute the cycle structure of large directed graphs where each node has exactly one outgoing edge. Such graphs appear as state diagrams of finite state machines such as pseudo-random number...
详细信息
ISBN:
(纸本)9780769527840
We present an algorithm to compute the cycle structure of large directed graphs where each node has exactly one outgoing edge. Such graphs appear as state diagrams of finite state machines such as pseudo-random number generators in cryptography. The size of the graphs necessitates that the adjacency list is kept on hard disks. Our algorithm uses multiple processing units, so that a parallel storage system has to be employed to store the graph. We present experimental resultsfor randomly chosen graphs, andfor the graph of the A5/1 generator used in GSM mobile phones.
Hierarchical clustering methods are important in many data mining and pattern recognition tasks. In this paper we present an efficient coarse grained parallel algorithm for Single Link Clustering;a standard inter-clus...
详细信息
ISBN:
(纸本)3540290311
Hierarchical clustering methods are important in many data mining and pattern recognition tasks. In this paper we present an efficient coarse grained parallel algorithm for Single Link Clustering;a standard inter-cluster linkage metric. Our approach is to first describe algorithms for the Prefix Larger Integer Set and the Closest Larger Ancestor problems and then to show how these can be applied to solve the Single Link Clustering problem. In an extensive performance analysis an implementation of these algorithms on a Linux-based cluster has shown to scale well, exhibiting near linear relative speedup.
As the size of available data sets grows, so too does the demand for efficient parallelalgorithms that will yield the solution to complex combinatorial problems on graphs that may be too large to fit entirely in memo...
详细信息
ISBN:
(纸本)9781479956661
As the size of available data sets grows, so too does the demand for efficient parallelalgorithms that will yield the solution to complex combinatorial problems on graphs that may be too large to fit entirely in memory. In previous work, we have provided a set of out-of-core algorithms to solve one of the central examples of such a problem, maximum clique. In this paper, we review the algorithms and report on our ongoing work to use them as a starting point for an optimized, highly scalable implementation of a maximum clique solver.
The clustering coefficient and the transitivity ratio are concepts often used in network analysis, which creates a need for fast practical algorithms for counting triangles in large graphs. Previous research in this a...
详细信息
ISBN:
(纸本)9781509036820
The clustering coefficient and the transitivity ratio are concepts often used in network analysis, which creates a need for fast practical algorithms for counting triangles in large graphs. Previous research in this area focused on sequential algorithms, MapReduce parallelization, and fast approximations. In this paper we propose a parallel triangle counting algorithm for CUDA GPU. We describe the implementation details necessary to achieve high performance and present the experimental evaluation of our approach. The algorithm achieves 15 to 35 times speedup over our CPU implementation, and is capable of finding 8.8 billion triangles in a 180 million edges graph in 12 seconds on the Nvidia GeForce GTX 980 GPU.
Irregular computations on unstructured data are an important class of problems for parallel programming. graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe p...
详细信息
ISBN:
(纸本)9783662480960;9783662480953
Irregular computations on unstructured data are an important class of problems for parallel programming. graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Catalyurek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We show that the new technique scales better on multicore and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels.
暂无评论