General Purpose Graphics Computing Units can be effectively used for enhancing the performance of many contemporary scientific applications. However, programming GPUs using machine-specific notations like CUDA or Open...
详细信息
ISBN:
(纸本)9783319174730;9783319174723
General Purpose Graphics Computing Units can be effectively used for enhancing the performance of many contemporary scientific applications. However, programming GPUs using machine-specific notations like CUDA or OpenCL can be complex and time consuming. In addition, the resulting programs are typically fine-tuned for a particular target device. A promising alternative is to program in a conventional and machine-independent notation extended with directives and use compilers to generate GPU code automatically. These compilers enable portability and increase programmer productivity and, if effective, would not impose much penalty on performance. This paper evaluates two such compilers, PGI and Cray. We first identify a collection of standard transformations that these compilers can apply. Then, we propose a sequence of manual transformations that programmers can apply to enable the generation of efficient GPU kernels. Lastly, using the Rodinia Benchmark suite, we compare the performance of the code generated by the PGI and Cray compilers with that of code written in CUDA. Our evaluation shows that the code produced by the PGI and Cray compilers can perform well. For 6 of the 15 benchmarks that we evaluated, the compiler generated code achieved over 85% of the performance of a hand-tuned CUDA version.
Range-based loop is a powerful construct due to its clear and concise syntax. The abstraction of loop index in a rangebased loop implies loop-level parallelism ready to be exploited. Despite its advantage on hidden pa...
详细信息
ISBN:
(纸本)9781479960491
Range-based loop is a powerful construct due to its clear and concise syntax. The abstraction of loop index in a rangebased loop implies loop-level parallelism ready to be exploited. Despite its advantage on hidden parallelism and programmability, the magnitude of performance gain by accelerating range-based loop on heterogeneous systems is still not well studied. This paper addresses this issue and make three contributions. First, the review showing the magnitude of performance gain from CUDA/ OpenCL code, generated by ten exisiting auto-parallelizing compilers is presented. Second, the performance comparison between range-based and traditional loops acceleration on four workloads from the SHOC benchmark is reported. Third, the performance limitation on using directive-based compiler to accelerate range-based loop is discussed. The results show that transforming scientific workloads to exploit range-based loops is a challenge. The review results show that code generated by autoparallelizing achieved an average of 37 +/- 23 folds speedup relative to sequential CPU, while the proposed range-basedcompiler achieved higher speedup than the average (44.8 +/- 22x). The evaluation against four workloads from highly-tuned benchmark shows that range-based loop acceleration achieved in average 72% of the benchmark's performance. This highlights range-based loops as a promising target for auto parallelizing compiling code on heterogeneous systems.
暂无评论