检索结果-内蒙古大学图书馆

Automatic generation of ARM NEON micro-kernels for matrix multiplication

JOURNAL OF SUPERCOMPUTING 2024年第10期80卷 13873-13899页

作者： Alaejos, Guillermo Martinez, Hector Castello, Adrian Dolz, Manuel F. Igual, Francisco D. Alonso-Jorda, Pedro Quintana-Orti, Enrique S. Univ Politecn Valencia Valencia Spain Univ Cordoba Cordoba Spain Univ Jaume I Castello Castellon De La Plana Spain Univ Complutense Madrid Madrid Spain

General matrix multiplication (gemm) is a fundamental kernel in scientific computing and current frameworks for deep learning. Modern realisations of gemm are mostly written in C, on top of a small, highly tuned micro-kernel that is usually encoded in assembly. The high performance realisation of gemm in linear algebra libraries in general include a single micro-kernel per architecture, usually implemented by an expert. In this paper, we explore a couple of paths to automatically generate gemm micro-kernels, either using C++ templates with vector intrinsics or high-level Python scripts that directly produce assembly code. Both solutions can integrate high performance software techniques, such as loop unrolling and software pipelining, accommodate any data type, and easily generate micro-kernels of any requested dimension. The performance of this solution is tested on three ARM-based cores and compared with state-of-the-art libraries for these processors: BLIS, OpenBLAS and ArmPL. The experimental results show that the auto-generation approach is highly competitive, mainly due to the possibility of adapting the micro-kernel to the problem dimensions.

关键词： Matrix multiplication ARM NEON simd arithmetic units High performance

来源：评论

学校读者我要写书评

暂无评论

Convolution Operators for Deep Learning Inference on the Fujitsu A64FX Processor 34

Convolution Operators for Deep Learning Inference on the Fuj...

引用

34th IEEE International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

作者： Dolz, Manuel F. Martinez, Hector Alonso, Pedro Quintana-Orti, Enrique S. Univ Jaume I Castellon Castellon de La Plana Spain Univ Cordoba Cordoba Spain Univ Politecn Valencia Valencia Spain

ISBN: (数字)9781665451550

ISBN: (纸本)9781665451550

The convolution operator is a crucial kernel for many computer vision and signal processing applications that rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has received considerable attention in the past few years for a fair range of processor architectures. In this paper, we follow the technology trend toward integrating long simd (single instruction, multiple data) arithmetic units into high performance multicore processors to analyse the benefits of this type of hardware acceleration for latency-constrained DL workloads. For this purpose, we implement and optimise for the Fujitsu processor A64FX, three distinct methods for the calculation of the convolution, namely, the lowering approach, a blocked variant of the direct convolution algorithm, and the Winograd minimal filtering algorithm. Our experimental results include an extensive evaluation of the parallel scalability of these three methods and a comparison of their global performance using three popular DL models and a representative dataset.

关键词： Convolutional neural networks high performance simd arithmetic units ARM-based A64FX processor

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：