检索结果-内蒙古大学图书馆

An efficient and portable simd algorithm for charge/current deposition in Particle-In-Cell codes

COMPUTER PHYSICS COMMUNICATIONS 2017年 210卷 145-154页

作者： Vincenti, H. Lobet, M. Lehe, R. Sasanka, R. Vay, J. -L. Lawrence Berkeley Natl Lab 1 Cyclotron Rd Berkeley CA 94720 USA CEA Lasers Interact & Dynam Lab LIDyL Gif Sur Yvette France Intel Corp Hillsboro OR 97124 USA

In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (R,'20 pi/word on-die to 10,000 pi/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale. computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long simd instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. simd register length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable simd vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scatter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of x 2 to x2.5 speed-up in double precision for particle shape factor of orders 1-3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles). Program summary Program Title: vec_deposition Program Files doi: http://***/10.17632/nh77fv9k8c.1 Licensing provisions: BSD 3-Clause Programming language: Fortran 90 External routines/libraries: OpenMP > 4.0 Nature of problem: Exascale archite

关键词： Particle-In-Cell method OpenMP simd vectorization AVX2 AVX512 Tiling Cache reuse Many-core architectures

来源：评论

学校读者我要写书评

暂无评论

Extending OpenMP simd Support for Target Specific Code and Application to ARM SVE 13th

Extending OpenMP SIMD Support for Target Specific Code and A...

引用

13th International Workshop on OpenMP (IWOMP)

作者： Lee, Jinpil Petrogalli, Francesco Hunter, Graham Sato, Mitsuhisa RIKEN Adv Inst Computat Sci Kobe Hyogo Japan ARM Ltd Cambridge England

ISBN: (纸本)9783319655789;9783319655772

Recent trends in processor design accommodate wide vector extensions. simd vectorization is more important than before to exploit the potential performance of the target architecture. The latest OpenMP specification provides new directives which help compilers produce better code for simd auto-vectorization. However, it is hard to optimize the simd code performance in OpenMP since the target simd code generation mostly relies on the compiler implementation. In this paper, we propose a new directive that specifies user-defined simd variants of functions used in simd loops. The compiler can then use the user-defined simd variants when it encounters OpenMP loops instead of auto-vectorized simd variants. The user can optimize the simd performance by implementing highly-optimized simd code with intrinsic functions. The performance evaluation using a image composition kernel shows that the user can optimize simd code generation in an explicit way by using our approach. The user-defined function reduces the number of instructions by 70% compared with the auto-vectorized code generated from the serial code.

关键词： OpenMP simd vectorization VLA programming - Vector Length Agnostic programming

来源：评论

学校读者我要写书评

暂无评论

Multiple Pattern Matching for Network Security Applications: Acceleration through vectorization 46

Multiple Pattern Matching for Network Security Applications:...

引用

46th International Conference on Parallel Processing Workshops (ICPPW)

作者： Stylianopoulos, Charalampos Almgren, Magnus Landsiedel, Olaf Papatriantafilou, Marina Chalmers Univ Technol Gothenburg Sweden

ISBN: (纸本)9781538610428

Pattern matching is a key building block of Intrusion Detection Systems and firewalls, which are deployed nowadays on commodity systems from laptops to massive web servers in the cloud. In fact, pattern matching is one of their most computationally intensive parts and a bottleneck to their performance. In Network Intrusion Detection, for example, pattern matching algorithms handle thousands of patterns and contribute to more than 70% of the total running time of the system. In this paper, we introduce efficient algorithmic designs for multiple pattern matching which (a) ensure cache locality and (b) utilize modern simd instructions. We first identify properties of pattern matching that make it fit for vectorization and show how to use them in the algorithmic design. Second, we build on an earlier, cache-aware algorithmic design and we show how cache-locality combined with simd gather instructions, introduced in 2013 to Intel's family of processors, can be applied to pattern matching. We evaluate our algorithmic design with open data sets of real-world network traffic: Our results on two different platforms, Haswell and Xeon-Phi, show a speedup of 1.8x and 3.6x, respectively, over Direct Filter Classification (DFC), a recently proposed algorithm by Choi et al. for pattern matching exploiting cache locality, and a speedup of more than 2.3x over Aho-Corasick, a widely used algorithm in today's Intrusion Detection Systems.

关键词： pattern matching simd vectorization gather

来源：评论

学校读者我要写书评

暂无评论

Dynamic simd Vector Lane Scheduling

Dynamic SIMD Vector Lane Scheduling

引用

International Supercomputing Conference (ISC High Performance)

作者： Krzikalla, Olaf Wende, Florian Hoehnerbach, Markus Tech Univ Dresden Dresden Germany Zuse Inst Berlin Germany RWTH Univ Aachen Germany

ISBN: (纸本)9783319460796;9783319460789

A classical technique to vectorize code that contains control flow is a control-flow to data-flow conversion. In that approach statements are augmented with masks that denote whether a given vector lane participates in the statement's execution or idles. If the scheduling of work to vector lanes is performed statically, then some of the vector lanes will run idle in case of control flow divergences or varying work intensities across the loop iterations. With an increasing number of vector lanes, the likelihood of divergences or heavily unbalanced work assignments increases and static scheduling leads to a poor resource utilization. In this paper, we investigate different approaches to dynamic simd vector lane scheduling using the Mandelbrot set algorithm as a test case. To overcome the limitations of static scheduling, idle vector lanes are assigned work items dynamically, thereby minimizing per-lane idle cycles. Our evaluation on the Knights Corner and Knights Landing platform shows, that our approaches can lead to considerable performance gains over a static work assignment. By using the AVX-512 vector compress and expand instruction, we are able to further improve the scheduling.

关键词： simd vectorization Dynamic scheduling Intel Xeon Phi

来源：评论

学校读者我要写书评

暂无评论

A Basic Linear Algebra Compiler for Structured Matrices 16

A Basic Linear Algebra Compiler for Structured Matrices

引用

14th International Symposium on Code Generation and Optimization (CGO)

作者： Spampinato, Daniele G. Pueschel, Markus Swiss Fed Inst Technol Dept Comp Sci Zurich Switzerland

ISBN: (纸本)9781450337786

Many problems in science and engineering are in practice modeled and solved through matrix computations. Often, the matrices involved have structure such as symmetric or triangular, which reduces the operations count needed to perform the computation. For example, dense linear systems of equations are solved by first converting to triangular form and optimization problems may yield matrices with any kind of structure. The well-known BLAS (basic linear algebra sub-routine) interface provides a small set of structured matrix computations, chosen to serve a certain set of higher level functions (LAPACK). However, if a user encounters a computation or structure that is not supported, she loses the benefits of the structure and chooses a generic library. In this paper, we address this problem by providing a compiler that translates a given basic linear algebra computation on structured matrices into optimized C code, optionally vectorized with intrinsics. Our work combines prior work on the Spiral-like LGen compiler with techniques from polyhedral compilation to mathematically capture matrix structures. In the paper we consider upper/lower triangular and symmetric matrices but the approach is extensible to a much larger set including blocked structures. We run experiments on a modern Intel platform against the Intel MKL library and a baseline implementation showing competitive performance results for both BLAS and non-BLAS functionalities.

关键词： Program synthesis Basic linear algebra Structured matrices DSL Tiling simd vectorization

来源：评论

学校读者我要写书评

暂无评论

An Empirical Study of Performance, Power Consumption, and Energy Cost of Erasure Code Computing for HPC Cloud Storage Systems 10

An Empirical Study of Performance, Power Consumption, and En...

引用

IEEE International Conference on Networking, Architecture and Storage (NAS 2015)

作者： Chen, Hsing-bung Grider, Gary Inman, Jeff Fields, Parks Kuehn, Jeff Alan Los Alamos Natl Lab Los Alamos NM 87545 USA

ISBN: (纸本)9781467378918

Erasure code storage systems are becoming popular choices for cloud storage systems due to cost-effective storage space saving schemes and higher fitult-resilience capabilities. Both erasure code encoding and decoding procedures are involving heavy array, matrix, and table-lookup compute intensive operations. Multi-core, many-core, and streaming simd extension are implemented in modern CPU designs. In this paper, we study the power consumption and energy efficiency of erasure code computing using traditional Intel x86 platform and Intel Streaming simd extension platform. We use a breakdown power consumption analysis approach and conduct power studies of erasure code encoding process on various storage devices. We present the impact of various storage devices on erasure code based storage systems in terms of processing time, power utilization, and energy cost. Finally we conclude our studies and demonstrate the Intel x86's Streaming simd extensions computing is a cost-effective and favorable choice for future power efficient HPC cloud storage systems.

关键词： Erasure code simd vectorization Power measurement Energy cost Power consumption Cloud storage

来源：评论

学校读者我要写书评

暂无评论

Soft-Output Demapper and Viterbi Decoder for Software-Defined Radio

Soft-Output Demapper and Viterbi Decoder for Software-Define...

引用

Conference on Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments

作者： Marcin, Darmetko Warsaw Univ Technol Inst Radioelect Warsaw Poland

ISBN: (纸本)9781628413694

Viterbi algorithm is commonly used in communication systems to decode convolutional codes. Soft decision demapping can be used to further improve Viterbi decoder performance. This paper presents implementation of soft-decision demapping and Viterbi decoder for software-defined radio (SDR). Fast simplified algorithms for soft demapping of four modulations common in satellite communications systems (BPSK, QPSK, 8-PSK and 16-APSK) were implemented. To increase software processing speed simd (single instruction multiple data) instructions were used.

关键词： log-likelihood ratio demapper Viterbi decoder software-defined radio demodulation phase shift keying simd vectorization

来源：评论

学校读者我要写书评

暂无评论

Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly

引用

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION 2014年第4期11卷 57-57页

作者： Luporini, Fabio Varbanescu, Ana Lucia Rathgeber, Florian Bercea, Gheorghe-Teodor Ramanujam, J. Ham, David A. Kelly, Paul H. J. Imperial Coll London Dept Comp London England Univ Amsterdam Inst Informat NL-1012 WX Amsterdam Netherlands Imperial Coll London Dept Math London England Louisiana State Univ Ctr Computat & Technol Baton Rouge LA 70803 USA Louisiana State Univ Sch Elect Engn & Comp Sci Baton Rouge LA 70803 USA

We study and systematically evaluate a class of composable code transformations that improve arithmetic intensity in local assembly operations, which represent a significant fraction of the execution time in finite element methods. Their performance optimization is indeed a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions, which vary among different problems, make it hard to determine an optimal sequence of successful transformations. Our investigation has resulted in the implementation of a compiler (called COFFEE) for local assembly kernels, fully integrated with a framework for developing finite element methods. The compiler manipulates abstract syntax trees generated from a domain-specific language by introducing domain-aware optimizations for instruction-level parallelism and register locality. Eventually, it produces C code including vector simd intrinsics. Experiments using a range of real-world finite element problems of increasing complexity show that significant performance improvement is achieved. The generality of the approach and the applicability of the proposed code transformations to other domains is also discussed.

关键词： Design Performance Finite element integration local assembly compilers optimizations simd vectorization

来源：评论

学校读者我要写书评

暂无评论

A Basic Linear Algebra Compiler 14

A Basic Linear Algebra Compiler

引用

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

作者： Daniele G. Spampinato Markus Püschel Dept. of Computer Science ETH Zurich

ISBN: (纸本)9781450326704

Many applications in media processing, control, graphics, and other domains require efficient small-scale linear algebra computations. However, most existing high performance libraries for linear algebra, such as ATLAS or Intel MKL are more geared towards large-scale problems (matrix sizes in the hundreds and larger) and towards specific interfaces (e.g., BLAS). In this paper we present LGen: a compiler for small-scale, basic linear algebra computations. The input to LGen is a fixed-size linear algebra expression; the output is a corresponding C function optionally including intrinsics to efficiently use simd vector extensions. LGen generates code using two levels of mathematical domain-specific languages (DSLs). The DSLs are used to perform tiling, loop fusion, and vectorization at a high level of abstraction, before the final code is generated. In addition, search is used to select among alternative generated implementations. We show benchmarks of code generated by LGen against Intel MKL and IPP as well as against alternative generators, such as the C++ template-based Eigen and the BTO compiler. The achieved speed-up is typically about a factor of two to three.

关键词： Program synthesis DSL simd vectorization Tiling Small matrices Basic linear algebra

来源：评论

学校读者我要写书评

暂无评论

vectorization Past Dependent Branches Through Speculation

Vectorization Past Dependent Branches Through Speculation

引用

22nd International Conference on Parallel Architectures and Compilation Techniques (PACT)

作者： Sujon, Majedul Haque Whaley, R. Clint Yi, Qing Univ TX San Antonio Dept Comp Sci San Antonio TX 78249 USA

ISBN: (纸本)9781479910212

Modern architectures increasingly rely on simd vectorization to improve performance for floating point intensive scientific applications. However, existing compiler optimization techniques for automatic vectorization are inhibited by the presence of unknown control flow surrounding partially vectorizable computations. In this paper, we present a new approach, speculative vectorization, which speculates past dependent branches to aggressively vectorize computational paths that are expected to be taken frequently at runtime, while simply restarting the calculation using scalar instructions when the speculation fails. We have integrated our technique in an iterative optimizing compiler and have employed empirical tuning to select the profitable paths for speculation. When applied to optimize 9 floating-point benchmarks, our optimizing compiler has achieved up to 6.8X speedup for single precision and 3.4X for double precision kernels using AVX, while vectorizing some operations considered not vectorizable by prior techniques.

关键词： simd vectorization speculation compiler optimization iterative compilation ATLAS iFKO

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：