Intel® Xeon Phi coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long S...
详细信息
Intel® Xeon Phi coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long S...
详细信息
Intel® Xeon Phi coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel® Xeon Phi coprocessors. In this paper, we present several practical SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel® MIC specific alignment optimization, and small matrix transpose/multiplication 2-D vectorization implemented in the Intel® C/C++ and Fortran production compilers for Intel® Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel® Xeon Phi coprocessor.
SIMD vectorization has received significant attention in the past decade as an important method to accelerate scientific applications, media and embedded applications on SIMD architectures such as Intel® SSE, AVX...
详细信息
SIMD vectorization has received significant attention in the past decade as an important method to accelerate scientific applications, media and embedded applications on SIMD architectures such as Intel ® SSE, A...
详细信息
ISBN:
(纸本)9781467309745
SIMD vectorization has received significant attention in the past decade as an important method to accelerate scientific applications, media and embedded applications on SIMD architectures such as Intel ® SSE, AVX, and IBM* AltiVec. However, most of the focus has been directed at loops, effectively executing their iterations on multiple SIMD lanes concurrently relying upon program hints and compiler analysis. This paper presents a set of new C/C++ high-level vector extensions for SIMD programming, and the Intel ® C++ product compiler that is extended to translate these vector extensions and produce optimized SIMD instruction sequences of vectorized functions and loops. For a function, our main idea is to vectorize the entire function for callers instead of just vectorizing loops (if any) inside the function. It poses the challenge of dealing with complicated control-flow in the function body, and matching caller and callee for SIMD vector calls while vectorizing caller functions (or loops) and callee functions. Our compilation methods for automatically compiling vector extensions are described. We present performance results of several non-trivial visual computing, computational, and simulation workloads, utilizing SIMD units through the vector extensions on Intel ® Multicore 128-bit SIMD processors, and we show that significant SIMD speedups (3.07x to 4.69x) are achieved over the serial execution.
暂无评论