检索结果-内蒙古大学图书馆

41st IEEE International Conference on Computer Design (ICCD)

作者： Fu, Gelin Xia, Tian Qu, Shaoru Luo, Zhongpei Li, Shuyu Cheng, Pengyu Guo, Runfan Ding, Yitong Ren, Pengju Xi An Jiao Tong Univ Xian Peoples R China

ISBN: (纸本)9798350342918

sparse matrix-Vector Multiplication (SpMV) has been widely applied in scientific computation, industry simulation, and intelligent computation domains, which is the critical algorithm in all these applications. Due to the poor data locality, low cache usage, and extremely irregular branch patterns caused by the highly sparse and random distributions, SpMV optimization has become one of the most challenging problems for modern high-performance processors. In this paper, we study the bottlenecks of SpMV on current out-of-order CPUs and propose a novel SpMV kernel named PrSpMV to improve its performance by pursuing high predictability. Specifically, we improve the memory access regularity and locality by creating serialized access patterns so that the data prefetching efficiency and cache usage are optimized. We also improve pipeline efficiency by creating regular branch patterns to make branch prediction more accurate. Experiment results show that using the above optimization approaches, PrSpMV can eliminate nearly all branch mispredictions. Moreover, it can also significantly reduce the average L2 cache miss rate from 57% to 20% via efficiently leveraging hardware prefetchers. By using PrSpMV, stride prefetcher can be boosted with 1.31x speedup and dedicated irregular prefetcher can be improved with 1.40x speedup. Meanwhile, on commercial high-end Intel processors, it achieves 1.32x speedup against some state-of-the-art SpMV kernels.

关键词： sparse matrix computation Data Prefetching Branch Prediction

来源：评论

学校读者我要写书评

暂无评论

Scalable Node Embedding Algorithms using Distributed sparse matrix Operations

Scalable Node Embedding Algorithms using Distributed Sparse ...

引用

1st International Conference on Smart Energy Systems and Artificial Intelligence (SESAI)

作者： Ranawaka, Isuru Azad, Ariful Indiana Univ Bloomington IN 47405 USA

ISBN: (纸本)9798350364613;9798350364606

We introduce a distributed memory parallel algorithm for force-directed node embedding that places vertices of a graph into a low-dimensional vector space based on the interplay of attraction among neighboring vertices and repulsion among distant vertices. We develop our algorithms using two sparse matrix operations, SDDMM and SpMM. We propose a configurable pull -push -based communication strategy that optimizes memory usage and data transfers based on computing resources and asynchronous MPI communication to overlap communication and computation. Our algorithm scales up to 256 nodes on distributed supercomputers by surpassing the performance of state-of-the-art algorithms

关键词： Embedding SDDMM sparse matrix computation SpMM

来源：评论

学校读者我要写书评

暂无评论

SCLABLE HYBRID sparse LINEAR SOLVERS

SCLABLE HYBRID SPARSE LINEAR SOLVERS

引用

作者： Teranishi, Keita PennState University Libraries

学位级别：Doctor of Philosophy

In many large-scale simulations that depend on parallel processing to solve problems of scientific interest, the application time can be dominated by the time for the underlying sparse linear system solution. This thesis concerns the development of effective sparse solvers for distributed memory multiprocessors using a hybrid of direct and iterative sparse solution methods. More specifically, we accelerate the convergence of an iterative solution method, namely the method of Conjugate Gradients (CG) using an incomplete Cholesky *** latter is an approximation to the sparse matrix factor used in a direct method. Our parallel incomplete factorization scheme can support a range of fill-in to provide flexible preconditioning that can meet the requirementsof a variety of applications. We have also developed special techniques that allow the effective application of such preconditioners on distributed memory multiprocessors; the relatively large latencies of interprocessor communication on such parallel computers make conventional schemesusing parallel substitution extremely inefficient. The first part of the dissertation focuses on the design of a parallel tree-based left-looking drop-threshold incomplete Cholesky factorization scheme using extensions of techniques from direct methods. The second part concerns modifications to the incomplete Cholesky factor to enable its efficient application as a preconditioner; these modifications concern selectively replacingcertain triangular submatrices in the factor by their approximate inverses. We develop a`Selective Inversion' (SI) scheme based on explicit inversion of selected submatrices and another variant using Selective sparse Approximate Inversion (SSAI). The final part of the dissertation concerns latency-tolerantapplication of our ICT-SI and ICT-SSAI preconditioners by selectively using parallel matrix-vector multiplication instead of parallel substitution. We analyze the computation and communicationcosts

关键词： parallel computing numerical linear algebra scientific computing sparse matrix computation

来源：评论

学校读者我要写书评

暂无评论

GNNPilot: A Holistic Framework for High-Performance Graph Neural Network computations on GPUs

引用

ACM Transactions on Architecture and Code Optimization 1000年

作者： Zhengding Hu Jingwei Sun Guangzhong Sun Computer Science and Technology University of Science and Technology of China Hefei China

Graph Neural Networks (GNNs) have emerged as powerful tools for graph-based machine learning tasks, but their performance is often constrained by inefficient sparse operators and limited hardware utilization during multi-operator workflows. This paper presents GNNPilot, a holistic optimization framework that addresses these challenges through three key innovations. First, we introduce two packing strategies for gather operators, including neighbor packing for load balancing in sparser graphs, and bin packing with a new sparse format for enhanced data locality in denser graphs. Second, we propose dynamic parallelization methods and a novel row panel-based kernel fusion technique to optimize complex multi-operator GNN models. Third, we develop a lightweight sampling-based auto-tuning mechanism that adapts the framework’s optimization strategies to varying input characteristics. Built upon tensor expression-based intermediate representations, GNNPilot maintains the flexibility to optimize both popular and customized GNN models. Extensive experiments across diverse GNN models and graph datasets demonstrate that GNNPilot achieves substantial speedups over state-of-the-art implementations in both the performance of single operators and the efficiency of end-to-end inference. These results establish GNNPilot as an efficient and adaptive solution for accelerating GNN computations on modern GPU architectures.

关键词： Graph Neural Network sparse matrix computation GPU

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：