Fortran DO CONCURRENT has emerged as a new way to achieve parallel execution of loops on CPUs and GPUs. This paper studies the performance portability of this construct on a range of processors and compares it with th...
详细信息
ISBN:
(纸本)9781665451857
Fortran DO CONCURRENT has emerged as a new way to achieve parallel execution of loops on CPUs and GPUs. This paper studies the performance portability of this construct on a range of processors and compares it with the incumbent models: OpenMP, OpenACC and CUDA. To do this study fairly, we implemented the BabelStream memory bandwidth benchmark from scratch, entirely in modern Fortran, for all of the models considered, which include Fortran DO CONCURRENT, as well as two variants of OpenACC, four variants of OpenMP (2 CPU and 2 GPU), CUDA Fortran, and both loop- and array-based references. BabelStream Fortran matches the C++ implementation as closely as possible, and can be used to make language-based comparisons. This paper represents one of the first detailed studies of the performance of Fortran support on heterogeneous architectures;we include results for AArch64 and x86_64 CPUs as well as AMD, Intel and NVIDIA GPU platforms.
Usage of multiprocessor and multicore computers implies parallel programming. Tools for preparing parallel programs include parallel languages and libraries as well as parallelizing compilers and convertors that can p...
详细信息
Usage of multiprocessor and multicore computers implies parallel programming. Tools for preparing parallel programs include parallel languages and libraries as well as parallelizing compilers and convertors that can perform automatic parallelization. The basic approach for parallelism detection is analysis of data dependencies and properties of program components, including data use and predicates. In this article a suite of used data and predicates sets for program components is proposed and an algorithm for computing these sets is suggested. The algorithm is based on wave propagation on graphs with cycles and labelling. This method allows analysing complex program components, improving data localization and thus providing enhanced data parallelism detection.
Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent...
详细信息
Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the order of data structure traversal. In this work, we present an abstraction for traversing a set of regular data structures (e.g., multidimensional arrays) that allows the design of traversal-agnostic algorithms. Such algorithms can easily optimize for memory performance and employ semi-automated parallelization or autotuning without altering their internal code. We also add an abstraction for autotuning that allows defining tuning parameters in one place and removes boilerplate code. The proposed solution was implemented as an extension of the Noarr library that simplifies a layout-agnostic design of regular data structures. It is implemented entirely using C++ template meta-programming without any nonstandard dependencies, so it is fully compatible with existing compilers, including CUDA NVCC or Intel DPC++. We evaluate the performance and expressiveness of our approach on the Polybench-C benchmarks.
Accelerated computing has increased the need to specialize how a program is parallelized depending on the target. Fully exploiting a highly parallel accelerator, such as a GPU, demands more parallelism and sometimes m...
详细信息
ISBN:
(纸本)9781450391832
Accelerated computing has increased the need to specialize how a program is parallelized depending on the target. Fully exploiting a highly parallel accelerator, such as a GPU, demands more parallelism and sometimes more levels of parallelism than a multicore CPU. OpenMP has a directive for each level of parallelism, but choosing directives for each target can incur a significant productivity cost. We argue that using the new OpenMP loop directive with an appropriate compiler decision process can achieve the same performance benefits of target-specific parallelization with the productivity advantage of a single directive for all targets. In this paper, we introduce a fully descriptive model and demonstrate its benefits with an implementation of the loop directive, comparing performance, productivity, and portability against other production compilers using the SPEC ACCEL benchmark suite. We provide an implementation of our proposal in NVIDIA's HPC compiler. It yields up to 56X speedup and an average of 1.91x-1.79x speedup compared to the baseline performance (depending on the host system) on GPUs, and preserves CPU performance. In addition, our proposal requires 60% fewer parallelism directives.
Heterogeneous architectures are increasingly common in modern High -Performance Computing (HPC) systems. Achieving high-performance on such heterogeneous systems requires new approaches to application development that...
详细信息
Heterogeneous architectures are increasingly common in modern High -Performance Computing (HPC) systems. Achieving high-performance on such heterogeneous systems requires new approaches to application development that are able to achieve the three Ps: Performance, Portability, and Productivity. In this paper, we provide an overview of the state-of-the-art for developing high-performance, portable and productive multi -physics applications with particular focus on the simulation of a plasma fusion reactor. Simulating such a complex system relies on both fluid- and particle -based simulations, and coupling interfaces between these two domains. We also review the current state-of-the-art in reasoning about the performance, portability and productivity of HPC applications.
The Controller model is a heterogeneous parallel programming model implemented as a library. It transparently manages the coordination, communication and kernel launching details on different heterogeneous computing d...
详细信息
The Controller model is a heterogeneous parallel programming model implemented as a library. It transparently manages the coordination, communication and kernel launching details on different heterogeneous computing devices. It exploits native or vendor specific programming models and compilers, such as OpenMP, CUDA or OpenCL, thus enabling the potential performance obtained by using them. This work discusses the integration of FPGAs in the Controller model, using high-level synthesis tools and OpenCL. A new Controller backend for FPGAs is presented based on a previous OpenCL backend for GPUs. We discuss new configuration parameters for FPGA kernels and key ideas to adapt the original OpenCL backend while maintaining the portability of the original model. We present an experimental study to compare performance and development effort metrics obtained with the Controller model, Intel oneAPI and reference codes directly programmed with OpenCL. The results show that using the Controller library has advantages and drawbacks compared with Intel oneAPI, while compared with OpenCL it highly reduces the programming effort with negligible performance overhead.
With the advent of renewable energy, smart grids, and cutting-edge measurement technologies, modern power systems are becoming more complex. As a result, analyzing modern power systems requires more computational powe...
详细信息
ISBN:
(纸本)9781665462020
With the advent of renewable energy, smart grids, and cutting-edge measurement technologies, modern power systems are becoming more complex. As a result, analyzing modern power systems requires more computational power. High Performance Computing (HPC) is the most viable option for meeting this demand. In India's power sector, the use of HPC is minimal. Hence, we are introducing HPC-based power flow analysis, which is the highly used application to analysis the system. The paper demonstrates the importance of HPC for power flow analysis. This paper also discusses a modified Gaussian Elimination method to utilize the sparse nature of Jacobian matrix to speedup the computation. Open-Multi Processing (OpenMP) is used to implement parallel computing. parallel power flow analysis is simulated on the Central Processing Unit (CPU) and Graphics Processing Unit (GPU) nodes of the C-DAC's PARAM Utkarsh supercomputer for various power system networks. The speedup obtained with HPC for the Polish 9241 bus network is 216.14 times the sequential computation.
Visual impairments are a global health issue with profound socioeconomic ramifications in both the developing and the developed world. There exist ongoing research projects, that aim to investigate the influence of li...
详细信息
ISBN:
(纸本)9781665464956
Visual impairments are a global health issue with profound socioeconomic ramifications in both the developing and the developed world. There exist ongoing research projects, that aim to investigate the influence of light in the perception of low vision individuals. But as of today, there is neither clear knowledge nor extensive data regarding the influence of light in low vision situations. This research will address these issues by introducing a methodology and a system to simulate visual impairments. A pipeline based on eye anatomy coupled with real-time image processing algorithms allows to dynamically simulate low vision specific characteristics of selected impairments in mixed reality. An original new approach based on massively parallelized processing combined with an efficient modeling of eye refractive errors aims to improve the accuracy of the low vision simulation.
Computational fluid dynamics (CFD) has emerged as a very important scientific and engineering research tool of the 21st century. At its core is the use of numerical methods and data structures to represent and predict...
详细信息
Computational fluid dynamics (CFD) has emerged as a very important scientific and engineering research tool of the 21st century. At its core is the use of numerical methods and data structures to represent and predict 'real-world' physics of a given fluid flow-fields. This is accomplished by applying the numerical form of the Navier-Stokes equations on modern computer platforms. The use and significance of CFD as a research and development tool has gained momentum in the field of Aerospace Engineering. CFD has and can be used to understand fluid behavior over a wide range of flow conditions, ranging from simple to extreme. While it may be feasible to conduct simple fluid flow experiments to understand the fluid flow fields; experiments designed to understand complex fluid fields are less feasible, difficult to set up and often very costly. Computational Fluid Dynamics has empowered today’s scientist and engineers with the ability represent the complex air conditions in high altitude, which is difficult to achieve with a physical experiment. Although there have been significant developments with CFD methods there still remains several challenges. Among these is the fact that with current CFD methods it is difficult to predict transition to turbulence. The propose of this research effort is to improve both the efficiency and accuracy of CFD tools. This will be accomplished by focusing on developing a robust and accurate numerical scheme that is capable of solving the Navier-Stokes Equations under a wide variety of fluid flow fields. A well-established scheme, which was initially described and referred as the Integro-Differential Scheme (IDS), is developed based on a unique combination of differential and integral forms of the complete Navier-Stocks Equations. In IDS scheme, integration form of Navier-Stocks Equations will be applied based on assumptions and used for explicit time marching. The IDS procedure confirms its predictive capability and supports its potential
parallel programming models (e.g., OpenMP) are more and more used to improve the performance of real-time applications in modern processors. Nevertheless, these processors have complex architectures, being very diffic...
详细信息
ISBN:
(数字)9781728175683
ISBN:
(纸本)9781728175683
parallel programming models (e.g., OpenMP) are more and more used to improve the performance of real-time applications in modern processors. Nevertheless, these processors have complex architectures, being very difficult to understand their timing behavior. The main challenge with most of existing works is that they apply static timing analysis for simpler models or measurement-based analysis using traditional platforms (e.g., single core) or considering only sequential algorithms. How to provide an efficient configuration for the allocation of the parallel program in the computing units of the processor is still an open challenge. This paper studies the problem of performing timing analysis on complex multi-core platforms, pointing out a methodology to understand the applications' timing behavior, and guide the configuration of the platform. As an example, the paper uses an OpenMP-based program of the Heat benchmark on a NVIDIA Jetson AGX Xavier. The main objectives are to analyze the execution time of OpenMP tasks, specify the best configuration of OpenMP directives, identify critical tasks, and discuss the predictability of the system/application. A Linux perf based measurement tool, which has been extended by our team, is applied to measure each task across multiple executions in terms of total CPU cycles, the number of cache accesses, and the number of cache misses at different cache levels, including L1, L2 and L3. The evaluation process is performed using the measurement of the performance metrics by our tool to study the predictability of the system/application.
暂无评论