检索结果-内蒙古大学图书馆

A novel ILU preconditioning method with a block structure suitable for simd vectorization

JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS 2023年 419卷

作者： Suzuki, Kengo Fukaya, Takeshi Iwashita, Takeshi Hokkaido Univ Grad Sch Informat Sci & Technol Kita Ku W9 N14 Sapporo Hokkaido 0600814 Japan Hokkaido Univ Informat Initiat Ctr Kita Ku W5 N11 Sapporo Hokkaido 0600811 Japan

Incomplete LU (ILU) preconditioning is typically used when an iterative solver is applied on an asymmetric system of linear equations. A fill-in selection policy significantly affects the ILU preconditioned iterative solver. In this study, by introducing a new fill-in control method based on blocks into ILU preconditioning, we propose a new ILU preconditioning method called ILUB preconditioning. In this method, the effect of the permitted fill-ins on the solver convergence is fully exploited because the resulting preconditioner matrix involves dense matrix blocks that can be processed efficiently using simd instructions. We implemented sequential and parallel ILUB preconditioned GMRES solvers and evaluated the solver performance in numerical tests. The numerical results demonstrated that ILUB preconditioning outperformed conventional ILU(0) preconditioning. (C) 2022 The Author(s). Published by Elsevier B.V.

关键词： Iterative linear solver Incomplete factorization preconditioning simd vectorization GMRES method Multi-threading

来源：评论

学校读者我要写书评

暂无评论

simd vectorization for the Lennard-Jones potential with AVX2 and AVX-512 instructions

引用

COMPUTER PHYSICS COMMUNICATIONS 2019年 237卷 1-7页

作者： Watanabe, Hiroshi Nakagawa, Koh M. Univ Tokyo Inst Solid State Phys Kashiwanoha 5-1-5 Kashiwa Chiba 2778581 Japan

This work describes the simd vectorization of the force calculation of the Lennard-Jones potential with Intel AVX2 and AVX-512 instruction sets. Since the force-calculation kernel of the molecular dynamics method involves indirect access to memory, the data layout is one of the most important factors in vectorization. We find that the Array of Structures (AoS) with padding exhibits better performance than Structure of Arrays (SoA) with appropriate vectorization and optimizations. In particular, AoS with 512-bit width exhibits the best performance among the architectures. While the difference in performance between AoS and SoA is significant for the vectorization with AVX2, that with AVX-512 is minor. The effect of other optimization techniques, such as software pipelining together with vectorization, is also discussed. We present results for benchmarks on three CPU architectures: Intel Haswell (HSW), Knights Landing (KNL), and Skylake (SKL). The performance gains by vectorization are about 42% on HSW compared with the code optimized without vectorization. On KNL, the hand-vectorized codes exhibit 34% better performance than the codes vectorized automatically by the Intel compiler. On SKI., the code vectorized with AVX2 exhibits slightly better performance than that with vectorized AVX-512. (C) 2018 Elsevier B.V. All rights reserved.

关键词： Molecular Dynamics Simulation simd vectorization AVX2 AVX-512 Xeon Phi

来源：评论

学校读者我要写书评

暂无评论

Algorithm 1039: Automatic Generators for a Family of Matrix Multiplication Routines with Apache TVM

引用

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE 2024年第1期50卷 1-34页

作者： Alaejos, Guillermo Castello, Adrian Alonso-Jorda, Pedro Igual, Francisco D. Martinez, Hector Quintana-Orti, Enrique S. Univ Politecn Valencia Valencia 46022 Spain Univ Complutense Madrid Madrid 28040 Spain Univ Cordoba Cordoba 14071 Spain

We explore the utilization of the Apache TVM open source framework to automatically generate a family of algorithms that follow the approach taken by popular linear algebra libraries, such as GotoBLAS2, BLIS, and OpenBLAS, to obtain high-performance blocked formulations of the general matrix multiplication (GEMM). In addition, we fully automatize the generation process by also leveraging the Apache TVM framework to derive a complete variety of the processor-specific micro-kernels for GEMM. This is in contrast with the convention in high-performance libraries, which hand-encode a single micro-kernel per architecture using Assembly code. In global, the combination of our TVM-generated blocked algorithms and micro-kernels for GEMM (1) improves portability, maintainability, and, globally, streamlines the software life cycle;(2) provides high flexibility to easily tailor and optimize the solution to different data types, processor architectures, and matrix operand shapes, yielding performance on a par (or even superior for specific matrix shapes) with that of hand-tuned libraries;and (3) features a small memory footprint.

关键词： Portability and maintainability software lifecycle matrix multiplication BLIS framework Apache TVM blocking simd vectorization high performance

来源：评论

学校读者我要写书评

暂无评论

YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using simd Architectures on CPUs 2024

YFlows: Systematic Dataflow Exploration and Code Generation ...

引用

33rd ACM SIGPLAN International Conference on Compiler Construction (CC)

作者： Zhou, Cyrus Hassman, Zack Shah, Dhirpal Richard, Vaughn Li, Yanjing Univ Chicago Chicago IL 60637 USA

ISBN: (纸本)9798400705076

We address the challenges associated with deploying neural networks on CPUs, with a particular focus on minimizing inference time while maintaining accuracy. Our novel approach is to use the dataflow (i.e., computation order) of a neural network to explore data reuse opportunities using heuristic-guided analysis and a code generation framework, which enables exploration of various Single Instruction, Multiple Data (simd) implementations to achieve optimized neural network execution. Our results demonstrate that the dataflowthat keeps outputs in simd registers while also maximizing both input and weight reuse consistently yields the best performance for a wide variety of inference workloads, achieving up to 3x speedup for 8-bit neural networks, and up to 4.8x speedup for binary neural networks, respectively, over the optimized implementations of neural networks today.

关键词： code generation compiler support simd vectorization CPU optimization data~ow neural network

来源：评论

学校读者我要写书评

暂无评论

TransLib: A Library to Explore Transprecision Floating-Point Arithmetic on Multi-Core IoT End-Nodes

TransLib: A Library to Explore Transprecision Floating-Point...

引用

Design, Automation and Test in Europe Conference and Exhibition (DATE)

作者： Mirsalari, Seyed Ahmad Tagliavini, Giuseppe Rossi, Davide Benini, Luca Univ Bologna Bologna Italy ETH Zurich Switzerland

ISBN: (纸本)9798350396249

Reduced-precision floating-point (FP) arithmetic is being widely adopted to reduce memory footprint and execution time on battery-powered Internet of Things (IoT) end-nodes. However, reduced precision computations must meet end-do-end precision constraints to be acceptable at the application level. This work introduces TransLib(1), an open-source kernel library based on transprecision computing principles, which provides knobs to exploit different FP data types (i.e., float, float16, and bfloat16), also considering the trade-off between homogeneous and mixed-precision solutions. We demonstrate the capabilities of the proposed library on PULP, a 32-bit microcontroller (MCU) coupled with a parallel, programmable accelerator. On average, TransLib kernels achieve an IPC of 0.94 and a speed-up of 1.64x using 16-bit vectorization. The parallel variants achieve a speed-up of 1.97x, 3.91x, and 7.59x on 2, 4, and 8 cores, respectively. The memory footprint reduction is between 25% and 50%. Finally, we show that mixed-precision variants increase the accuracy by 30x at the cost of 2.09x execution time and 1.35x memory footprint compared to float16 vectorized.

关键词： transprecision computing IoT end-nodes parallel programming simd vectorization

来源：评论

学校读者我要写书评

暂无评论

Efficient Application of Hanging-Node Constraints for Matrix-Free High-Order FEM Computations on CPU and GPU 37th

Efficient Application of Hanging-Node Constraints for Matrix...

引用

37th International Supercomputing Conference on High Performance Computing (ISC High Performance Computing)

作者： Munch, Peter Ljungkvist, Karl Kronbichler, Martin Helmholtz Zentrum Hereon Geesthacht Germany Tech Univ Munich Munich Germany Uppsala Univ Uppsala Sweden

ISBN: (纸本)9783031073120;9783031073113

This contribution presents an efficient algorithm for resolving hanging-node constraints on the fly for high-order finite-element computations on adaptively refined meshes, using matrix-free implementations. We concentrate on unstructured hex-dominated meshes and on multi-component elements with nodal Lagrange shape functions in at least one of their components. The application of general constraints is split up into two distinct operators, one specialized in the hanging-node part and a generic one for the remaining constraints, such as Dirichlet boundary conditions. The former implements in-face interpolations efficiently by a sequence of 1D interpolations with sum factorization according to the refinement configuration of the cell. We discuss ways to efficiently encode and decode such refinement configurations. Furthermore, we present distinct differences in the interpolation step on GPU and CPU, as well as compare different vectorization strategies for the latter. Experimental comparisons with a state-of-the-art algorithm that does not exploit the tensor-product structure show that, on CPUs, the additional costs of cells with hanging-node constraints can be reduced by a factor of 5-10 for a Laplace operator evaluation with high-order elements (k = 3) and affine meshes. For non-affine meshes, the costs for the application of hanging-node constraints can be completely hidden behind the memory transfer. The algorithm has been integrated into the open-source finite-element library ***.

关键词： Adaptively refined meshes Finite element methods High order Hanging-node constraints Matrix-free operator evaluation Node-level optimization simd vectorization Manycore optimizations

来源：评论

学校读者我要写书评

暂无评论

A simple and efficient storage format for simd-accelerated SpMV

引用

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS 2021年第4期24卷 3431-3448页

作者： Bian, Haodong Huang, Jianqiang Dong, Runting Guo, Yuluo Liu, Lingbin Huang, Dongqiang Wang, Xiaoying Qinghai Univ Dept Comp Technol & Applicat Xining Peoples R China Tsinghua Univ Dept Comp Sci & Technol Beijing Peoples R China

SpMV (Sparse matrix-vector multiplication) is an essential component in scientific computing and has attracted the attention of researchers in related fields at home and abroad. With the continuous expansion of matrix data, the efficient parallel SpMV algorithm has become a research hotspot for research experts in related fields. The sparse matrix compression format as a critical point to improve computing performance can effectively save storage space and efficiently cooperate with the advantages of the processor system structure to give full play to performance. This paper proposes a new sparse matrix storage format CSR2 (Compressed Sparse Row 2). It is a new single format and suitable for processor platforms with simd (Single Instruction Multiple Data) vectorizations. The format operation of CSR2 is easy to implement with a low overhead of conversion. We compared the SpMV algorithm based on CSR2 with the most advanced single format CSR5 (Compressed Sparse Row 5) and Intel MKL (Intel Math Kernel Library) on the mainstream high-performance processor Intel Xeon E5-2670 v3 CPU. We choose 48 sets of matrices to be used as a benchmark suite. Experimental results show that CSR2 has a remarkable performance improvement compared with CSR5 and MKL. Compared to CSR5, CSR2 can achieve an average acceleration of 1.401 x (up to 1.861 x). Compared to MKL, CSR2 can achieve an average acceleration of 1.261 x (up to 5.921 x). In reality, for applications with multiple iterations, using our CSR2 can bring low-overhead format conversion and high-throughput computing performance.

关键词： SpMV simd vectorization CSR2 CSR5 MKL Storage format

来源：评论

学校读者我要写书评

暂无评论

Automating Vectorized Distributed Graph Computation

引用

Proceedings of the ACM on Management of Data 2024年第6期2卷 1-27页

作者： Wenyue Zhao Yang Cao Peter Buneman Jia Li Nikos Ntarmos University of Edinburgh Edinburgh UK Edinburgh Research Center Central Software Institute Huawei Edinburgh UK

Multi-instance graph algorithms interleave the evaluation of multiple instances of the same algorithm with different inputs over the same graph. They have been shown to be significantly faster than traditional serial and batch evaluation, by sharing computation across instances. However, writing correct multi-instance algorithms is challenging; and in this work, we describe AutoMI, a framework for automatically converting vertex-centric graph algorithms into their vectorized multi-instance versions. We also develop an algebraic characterization of algorithms that can benefit best from multi-instance computation with simpler and faster streamlined vectorization. This allows users to decide when to use such optimization and instruct AutoMI to make the best use of simd vectorization. Using 6 real-life graphs, we show that AutoMI-converted multi-instance algorithms are 9.6 to 29.5 times faster than serial evaluation, 7.1 to 26.4 times faster than batch evaluation, and are even 2.6 to 4.6 times faster than existing highly optimized handcrafted multi-instance algorithms without vectorization.

关键词： algebraic characterization auto vectorization graph computation simd vectorization

来源：评论

学校读者我要写书评

暂无评论

vectorization OF A THREAD-PARALLEL JACOBI SINGULAR VALUE DECOMPOSITION METHOD

引用

SIAM JOURNAL ON SCIENTIFIC COMPUTING 2023年第3期45卷 C73-C100页

作者： Novakovic, Vedran Zagreb 10000 Croatia

The eigenvalue decomposition (EVD) of (a batch of) Hermitian matrices of order two has a role in many numerical algorithms, of which the one-sided Jacobi method for the singular value decomposition (SVD) is the prime example. In this paper the batched EVD is vectorized with a vector-friendly data layout and the AVX-512 simd instructions of Intel CPUs, alongside other key components of a real and a complex OpenMP-parallel Jacobi-type SVD method, inspired by the sequential xGESVJ routines from LAPACK. These vectorized building blocks should be portable to other platforms that support similar vector operations. Unconditional numerical reproducibility is guaranteed for the batched EVD, sequential or threaded, and for the column transformations, which are, like the scaled dot-products, presently sequential but can be threaded if nested parallelism is desired. No avoidable overflow of the results can occur with the proposed EVD or the whole SVD. The measured accuracy of the proposed EVD often surpasses that of the xLAEV2 routines from LAPACK. While the batched EVD outperforms the matching sequence of xLAEV2 calls, speedup of the parallel SVD is modest but can be improved and is already beneficial with enough threads. Regardless of their number, the proposed SVD method gives identical results but of a somewhat lower accuracy than xGESVJ.

关键词： batched eigendecomposition of Hermitian matrices of order two simd vectorization singular value decomposition parallel one-sided Jacobi-type SVD method

来源：评论

学校读者我要写书评

暂无评论

EFFICIENT MATRIX-FREE HIGH-ORDER FINITE ELEMENT EVALUATION FOR SIMPLICIAL ELEMENTS

引用

SIAM JOURNAL ON SCIENTIFIC COMPUTING 2020年第3期42卷 C97-C123页

作者： Moxey, David Amici, Roman Kirby, Mike Univ Exeter Coll Engn Math & Phys Sci Exeter EX17 1EJ Devon England Univ Utah Sci Comp & Imaging Inst Salt Lake City UT 84112 USA

With the gap between processor clock speeds and memory bandwidth speeds continuing to increase, the use of arithmetically intense schemes, such as high-order finite element methods, continues to be of considerable interest. In particular, the use of matrix-free formulations of finite element operators for tensor-product elements of quadrilaterals in two dimensions and hexahedra in three dimensions, in combination with single-instruction multiple-data instruction sets, is a well-studied topic at present for the efficient implicit solution of elliptic equations. However, a considerable limiting factor for this approach is the use of meshes comprising of only quadrilaterals or hexahedra, the creation of which is still an open problem within the mesh generation community. In this article, we study the efficiency of high-order finite element operators for the Helmholtz equation with a focus on extending this approach to unstructured meshes of triangles, tetrahedra, and prismatic elements using the spectral/hp element method and corresponding tensor-product bases for these element types. We show that although performance is naturally degraded when going from hexahedra to these simplicial elements, efficient implementations can still be obtained that are capable of attaining 50% through 70% floating point operations of the peak of processors with both AVX2 and AVX512 instruction sets.

关键词： simd vectorization high-order finite elements spectral/hp element method high-performance computing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：