Modern architectures increasingly rely on simd vectorization to improve performance for floating point intensive scientific applications. However, existing compiler optimization techniques for automatic vectorization ...
详细信息
ISBN:
(纸本)9781479910212
Modern architectures increasingly rely on simd vectorization to improve performance for floating point intensive scientific applications. However, existing compiler optimization techniques for automatic vectorization are inhibited by the presence of unknown control flow surrounding partially vectorizable computations. In this paper, we present a new approach, speculative vectorization, which speculates past dependent branches to aggressively vectorize computational paths that are expected to be taken frequently at runtime, while simply restarting the calculation using scalar instructions when the speculation fails. We have integrated our technique in an iterative optimizing compiler and have employed empirical tuning to select the profitable paths for speculation. When applied to optimize 9 floating-point benchmarks, our optimizing compiler has achieved up to 6.8X speedup for single precision and 3.4X for double precision kernels using AVX, while vectorizing some operations considered not vectorizable by prior techniques.
Multicore accelerators are used today to supplement traditional superscalar processors in massively parallel computer nodes with extra floating-point computation power. This paper presents our parallelization and perf...
详细信息
Multicore accelerators are used today to supplement traditional superscalar processors in massively parallel computer nodes with extra floating-point computation power. This paper presents our parallelization and performance enhancement and evaluation of the conjugate gradient (CG) linear equation solver with enhanced matrix multiplication on the Cell Broadband Engine accelerator. The paper also compares the CG performance results on the Cell and two CG implementations on a computer with two quadcore Xeon processors, one with OpenMP and the other with OpenMPI. We also report the enhancements made on the CG code and performance analysis of CG on single and dual Cell Broadband Engine packages with 8 and 16 synergistic processing elements and on Xeon for heptadiagonal matrices, in particular to matrix multiplication and synchronization. We also report the communication and computation time breakdowns and the floating point operations per second ratio. Our parallel CG solver is shown to scale well with data size, grid dimensionality, and number of cores. Copyright (C) 2011 John Wiley & Sons, Ltd.
EMERGING VIDEO-MINING APPLICATIONS SUCH AS IMAGE AND VIDEO RETRIEVAL AND INDEXING WILL REQUIRE REAL-TIME PROCESSING CAPABILITIES. A MANY-CORE ARCHITECTURE WITH 64 SMALL, IN-ORDER, GENERAL-PURPOSE CORES AS THE ACCELERA...
详细信息
EMERGING VIDEO-MINING APPLICATIONS SUCH AS IMAGE AND VIDEO RETRIEVAL AND INDEXING WILL REQUIRE REAL-TIME PROCESSING CAPABILITIES. A MANY-CORE ARCHITECTURE WITH 64 SMALL, IN-ORDER, GENERAL-PURPOSE CORES AS THE ACCELERATOR CAN HELP MEET THE NECESSARY PERFORMANCE GOALS AND REQUIREMENTS. THE KEY VIDEO-MINING MODULES CAN ACHIEVE PARALLEL SPEEDUPS OF 19x TO 62x FROM 64 CORES AND GET AN EXTRA 2.3x SPEEDUP FROM 128-BIT simd vectorization ON THE PROPOSED ARCHITECTURE.
暂无评论