This paper presents the performance portable implementation of a kinetic plasma simulation code with C++ parallel algorithm to run across multiple CPUs and GPUs. Relying on the language standard parallelism stdpar and...
详细信息
ISBN:
(纸本)9781665460224
This paper presents the performance portable implementation of a kinetic plasma simulation code with C++ parallel algorithm to run across multiple CPUs and GPUs. Relying on the language standard parallelism stdpar and proposed language standard multi-dimensional array support mdspan, we demonstrate that a performance portable implementation is possible without harming the readability and productivity. We obtain a good overall performance for a mini-application in the range of 20 % to the Kokkos version on Intel Icelake, NVIDIA V100, and A100 GPUs. Our conclusion is that stdpar can be a good candidate to develop a performance portable and productive code targeting the Exascale era platform, assuming this approach will be available on AMD and/or Intel GPUs in the future.
Subgraph matching has garnered increasing attention for its diverse real-world applications. Given the dynamic nature of real-world graphs, addressing evolving scenarios with-out incurring prohibitive overheads has be...
详细信息
ISBN:
(数字)9798350317152
ISBN:
(纸本)9798350317169
Subgraph matching has garnered increasing attention for its diverse real-world applications. Given the dynamic nature of real-world graphs, addressing evolving scenarios with-out incurring prohibitive overheads has been a focus of research. However, existing approaches for dynamic subgraph matching often proceed serially, retrieving incremental matches for each updated edge individually. This approach falls short when handling batch data updates, leading to a decrease in system throughput. Leveraging the parallel processing power of GPUs, which can execute a massive number of cores simultaneously, has been widely recognized for performance acceleration in various domains. Surprisingly, systematic exploration of subgraph matching in the context of batch-dynamic graphs, particularly on a GPU platform, remains untouched. In this paper, we bridge this gap by introducing an efficient framework, GAMMA (GPU-Accelerated Batch-Dynamic Subgraph Matching). Our approach features a DFS-based warp-centric batch-dynamic subgraph matching algorithm. To ensure load balance in the DFS-based search, we propose warp-level work stealing via shared memory. Additionally, we introduce coalesced search to reduce redundant computations. Comprehensive experiments demonstrate the superior performance of GAMMA. Compared to state-of-the-art algorithms, GAMMA showcases a performance improvement up to hundreds of times.
We examine the amount of preprocessing needed for answering certain on-line queries as fast as possible. We start with the following basic problem. Suppose we are given a semigroup (S, ◦). Let s1, . . ., sn be element...
详细信息
This paper presents new deterministic and distributed low-diameter decomposition algorithms for weighted graphs. In particular, we show that if one can efficiently compute approximate distances in a parallel or a dist...
详细信息
ISBN:
(数字)9781665455190
ISBN:
(纸本)9781665455206
This paper presents new deterministic and distributed low-diameter decomposition algorithms for weighted graphs. In particular, we show that if one can efficiently compute approximate distances in a parallel or a distributed setting, one can also efficiently compute low-diameter decompositions. This consequently implies solutions to many fundamental distance based problems using a polylogarithmic number of approximate distance *** low-diameter decomposition generalizes and extends the line of work starting from [RG20] to weighted graphs in a very model-independent manner. Moreover, our clustering results have additional useful properties, including strong-diameter guarantees, separation properties, restricting cluster centers to specified terminals, and more. Applications include:–The first near-linear work and polylogarithmic depth randomized and deterministic parallel algorithm for low-stretch spanning trees (LSST) with polylogarithmic stretch. Previously, the best parallel LSST algorithm required $m.n^{o(1)}$ work and $n^{o(1)}$ depth and was inherently randomized. No deterministic LSST algorithm with truly sub-quadratic work and sub-linear depth was known.–The first near-linear work and polylogarithmic depth deterministic algorithm for computing an $\ell_algorithms-$embedding into polylogarithmic dimensional space with polylogarithmic distortion. The best prior deterministic algorithms for $\ell_algorithms$-embeddings either require large polynomial work or are inherently *** when we apply our techniques to the classical problem of computing a ball-carving with strong-diameter $O(\log^{2}n)$ in an unweighted graph, our new clustering algorithm still leads to an improvement in round complexity from $O(\log^{10}n)$ rounds [CG21] to $O(\log^{4}n)$.
Dynamic graphs, characterized by rapid changes in their topological structure through adding or deleting edges or vertices, pose significant challenges for algorithm design. This paper presents a parallel algorithm fo...
详细信息
ISBN:
(数字)9798350387131
ISBN:
(纸本)9798350387148
Dynamic graphs, characterized by rapid changes in their topological structure through adding or deleting edges or vertices, pose significant challenges for algorithm design. This paper presents a parallel algorithm for dynamic graphs in a batched setting. We employ a popular tree contraction mechanism to create a hierarchical representation of the input graph that allows the identification of localized areas that further enable maintaining a critical graph primitive such as the minimum spanning tree (MST) without requiring re-computation from scratch. We perform experiments to demonstrate the application of our algorithm in real-world graphs where batch-dynamic algorithms on large trees are essential for incremental updates. We show experimental validations on GPUs where our proposed technique can provide up to 3.43x speedup over equivalent parallel implementations on shared memory CPUs. Additionally, our methods can provide up to 4.23x speedup over conventional parallel computation from scratch.
Finding cohesive subgraphs in a large graph has many important applications, such as community detection and biological network analysis. Clique is often a too strict cohesive structure since communities or biological...
详细信息
We present an O(1)-round fully-scalable deterministic massively parallel algorithm for computing the min-plus matrix multiplication of unit-Monge matrices. We use this to derive a O(log n)-round fully-scalable massive...
详细信息
Text parallelization is a crucial aspect of natural language processing, aiming to enhance the efficiency of information retrieval and analysis. This project focuses on leveraging the Term Frequency-Inverse Document F...
详细信息
ISBN:
(数字)9798331540661
ISBN:
(纸本)9798331540678
Text parallelization is a crucial aspect of natural language processing, aiming to enhance the efficiency of information retrieval and analysis. This project focuses on leveraging the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm to achieve text parallelization. TF-IDF is a widely used technique for information retrieval and document similarity measurement. In this study, we propose a novel approach that harnesses the TF-IDF algorithm to identify and parallelize relevant sections of text, thereby improving the speed and scalability of text processing tasks. We present a comprehensive analysis of the proposed method, evaluating its effectiveness in comparison to traditional approaches. Our results demonstrate the potential of TF-IDF-based text parallelization in optimizing information extraction processes. This research contributes to the ongoing efforts in advancing text processing techniques, particularly in the context of large-scale document analysis.
Graph coarsening is an important step for many multi-level algorithms, most notably graph partitioning. However, such methods often utilize an iterative approach, where a new coarser graph representation is explicitly...
详细信息
ISBN:
(数字)9798350387131
ISBN:
(纸本)9798350387148
Graph coarsening is an important step for many multi-level algorithms, most notably graph partitioning. However, such methods often utilize an iterative approach, where a new coarser graph representation is explicitly constructed and retained in memory at each level of coarsening. These overheads can be prohibitive for processing massive datasets or in constrained-memory environments like GPUs. We develop a data structure (CM-Graph) for representing coarsened graphs, which can be used with any adjacency-based graph representation. The CM-Graph data structure uses a constant amount of memory, regardless of the desired level of coarsening. In addition, CMGraph does not require modification to the existing graph representation, it offers a several-fold memory savings in practice, and it can even accelerate graph coarsening, due to not having to explicitly construct coarser graph structures. We further describe efficient GPU parallelizations of the CM-Graph subroutines for adjacency access, which can also be utilized in most arbitrary graph computations without modification.
Large-scale graphs with billions and trillions of vertices and edges require efficient parallel algorithms for common graph problems, one of which is single-source shortest paths (SSSP). Bulk-synchronous parallel algo...
详细信息
ISBN:
(数字)9798350355543
ISBN:
(纸本)9798350355550
Large-scale graphs with billions and trillions of vertices and edges require efficient parallel algorithms for common graph problems, one of which is single-source shortest paths (SSSP). Bulk-synchronous parallel algorithms such as ∆-stepping experience large synchronization costs at the scale of many nodes, so asynchronous approaches are needed for scalability. However, asynchronous approaches are susceptible to wasteful, speculative execution. We introduce ACIC, a highly asynchronous approach modulated by continuous concurrent introspection and adaptation. Using message-driven concurrent reductions and broadcasts, task-based scheduling, and an adaptive aggregation library, we explore techniques such as evolving windows and generation and prioritized flow of optimal updates, or edge relaxations, aimed at reducing speculative loss without constraining parallelism. Our results, while preliminary, demonstrate the promise of these ideas, with the potential to impact a wider class of graph algorithms.
暂无评论