sparsematrix-Vector Multiplication (SpMV) has been widely applied in scientific computation, industry simulation, and intelligent computation domains, which is the critical algorithm in all these applications. Due to...
详细信息
ISBN:
(纸本)9798350342918
sparsematrix-Vector Multiplication (SpMV) has been widely applied in scientific computation, industry simulation, and intelligent computation domains, which is the critical algorithm in all these applications. Due to the poor data locality, low cache usage, and extremely irregular branch patterns caused by the highly sparse and random distributions, SpMV optimization has become one of the most challenging problems for modern high-performance processors. In this paper, we study the bottlenecks of SpMV on current out-of-order CPUs and propose a novel SpMV kernel named PrSpMV to improve its performance by pursuing high predictability. Specifically, we improve the memory access regularity and locality by creating serialized access patterns so that the data prefetching efficiency and cache usage are optimized. We also improve pipeline efficiency by creating regular branch patterns to make branch prediction more accurate. Experiment results show that using the above optimization approaches, PrSpMV can eliminate nearly all branch mispredictions. Moreover, it can also significantly reduce the average L2 cache miss rate from 57% to 20% via efficiently leveraging hardware prefetchers. By using PrSpMV, stride prefetcher can be boosted with 1.31x speedup and dedicated irregular prefetcher can be improved with 1.40x speedup. Meanwhile, on commercial high-end Intel processors, it achieves 1.32x speedup against some state-of-the-art SpMV kernels.
We introduce a distributed memory parallel algorithm for force-directed node embedding that places vertices of a graph into a low-dimensional vector space based on the interplay of attraction among neighboring vertice...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
We introduce a distributed memory parallel algorithm for force-directed node embedding that places vertices of a graph into a low-dimensional vector space based on the interplay of attraction among neighboring vertices and repulsion among distant vertices. We develop our algorithms using two sparsematrix operations, SDDMM and SpMM. We propose a configurable pull -push -based communication strategy that optimizes memory usage and data transfers based on computing resources and asynchronous MPI communication to overlap communication and computation. Our algorithm scales up to 256 nodes on distributed supercomputers by surpassing the performance of state-of-the-art algorithms
In many large-scale simulations that depend on parallel processing to solve problems of scientific interest, the application time can be dominated by the time for the underlying sparse linear system solution. This the...
详细信息
In many large-scale simulations that depend on parallel processing to solve problems of scientific interest, the application time can be dominated by the time for the underlying sparse linear system solution. This thesis concerns the development of effective sparse solvers for distributed memory multiprocessors using a hybrid of direct and iterative sparse solution methods. More specifically, we accelerate the convergence of an iterative solution method, namely the method of Conjugate Gradients (CG) using an incomplete Cholesky *** latter is an approximation to the sparsematrix factor used in a direct method. Our parallel incomplete factorization scheme can support a range of fill-in to provide flexible preconditioning that can meet the requirementsof a variety of applications. We have also developed special techniques that allow the effective application of such preconditioners on distributed memory multiprocessors; the relatively large latencies of interprocessor communication on such parallel computers make conventional schemesusing parallel substitution extremely inefficient. The first part of the dissertation focuses on the design of a parallel tree-based left-looking drop-threshold incomplete Cholesky factorization scheme using extensions of techniques from direct methods. The second part concerns modifications to the incomplete Cholesky factor to enable its efficient application as a preconditioner; these modifications concern selectively replacingcertain triangular submatrices in the factor by their approximate inverses. We develop a`Selective Inversion' (SI) scheme based on explicit inversion of selected submatrices and another variant using Selective sparse Approximate Inversion (SSAI). The final part of the dissertation concerns latency-tolerantapplication of our ICT-SI and ICT-SSAI preconditioners by selectively using parallel matrix-vector multiplication instead of parallel substitution. We analyze the computation and communicationcosts
Graph Neural Networks (GNNs) have emerged as powerful tools for graph-based machine learning tasks, but their performance is often constrained by inefficient sparse operators and limited hardware utilization during mu...
详细信息
Graph Neural Networks (GNNs) have emerged as powerful tools for graph-based machine learning tasks, but their performance is often constrained by inefficient sparse operators and limited hardware utilization during multi-operator workflows. This paper presents GNNPilot, a holistic optimization framework that addresses these challenges through three key innovations. First, we introduce two packing strategies for gather operators, including neighbor packing for load balancing in sparser graphs, and bin packing with a new sparse format for enhanced data locality in denser graphs. Second, we propose dynamic parallelization methods and a novel row panel-based kernel fusion technique to optimize complex multi-operator GNN models. Third, we develop a lightweight sampling-based auto-tuning mechanism that adapts the framework’s optimization strategies to varying input characteristics. Built upon tensor expression-based intermediate representations, GNNPilot maintains the flexibility to optimize both popular and customized GNN models. Extensive experiments across diverse GNN models and graph datasets demonstrate that GNNPilot achieves substantial speedups over state-of-the-art implementations in both the performance of single operators and the efficiency of end-to-end inference. These results establish GNNPilot as an efficient and adaptive solution for accelerating GNN computations on modern GPU architectures.
暂无评论