Nowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming...
详细信息
Nowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming solutions for an efficient deployment for these kind of devices is a very complex task that relies on the manual management of memory transfers and configuration parameters. The programmer has to carry out a deep study of the particular data that needs to be computed at each moment, across different computing platforms, also considering architectural details. We introduce the controller concept as an abstract entity that allows the programmer to easily manage the communications and kernel launching details on hardware accelerators in a transparent way. This model also provides the possibility of defining and launching central processing unit kernels in multi-core processors with the same abstraction and methodology used for the accelerators. It internally combines different native programming models and technologies to exploit the potential of each kind of device. Additionally, the model also allows the programmer to simplify the proper selection of values for several configuration parameters that can be selected when a kernel is launched. This is done through a qualitative characterization process of the kernel code to be executed. Finally, we present the implementation of the controller model in a prototype library, together with its application in several case studies. Its use has led to reductions in the development and porting costs, with significantly low overheads in the execution times when compared to manually programmed and optimized solutions which directly use CUDA and OpenMP.
Distributed memory programming, typically through the MPI library, is the de facto standard for programming large scale parallelism, with up to millions of individual processes. Its dominant paradigm of Single Program...
详细信息
Distributed memory programming, typically through the MPI library, is the de facto standard for programming large scale parallelism, with up to millions of individual processes. Its dominant paradigm of Single Program Multiple Data (SPMD) programming is different from threaded and multicore parallelism, to an extent that students have a hard time switching models. In contrast to threaded programming, which allows for a view of the execution with central control and a central repository of data, SPMD programming has a symmetric model where all processes are active all the time, with none privileged, and where data is distributed. This model is counterintuitive to the novice parallel programmer, so care needs to be taken how to instill the proper 'mental model'. Adoption of an incorrect mental model leads to broken or inefficient code. We identify problems with the currently common way of teaching MPI, and propose a structuring of MPI courses that is geared to explicit reinforcing the symmetric model. Additionally, we advocate starting from realistic scenarios, rather than writing artificial code just to exercise newly-learned routines. (C) 2018 Published by Elsevier Inc.
The paper presents design, implementation and tuning of a hybrid parallel OpenMP+CUDA code for computation of similarity between pairs of a large number of multidimensional vectors. The problem has a wide range of app...
详细信息
The paper presents design, implementation and tuning of a hybrid parallel OpenMP+CUDA code for computation of similarity between pairs of a large number of multidimensional vectors. The problem has a wide range of applications, and consequently its optimization is of high importance, especially on currently widespread hybrid CPU+GPU systems targeted in the paper. The following are presented and tested for computation of all vector pairs: tuning of a GPU kernel with consideration of memory coalescing and using shared memory, minimization of GPU memory allocation costs, optimization of CPU-GPU communication in terms of size of data sent, overlapping CPU-GPU communication and kernel execution, concurrent kernel execution, determination of best sizes for data batches processed on CPUs and GPUs along with best GPU grid sizes. It is shown that all codes scale in hybrid environments with various relative performances of compute devices, even for a case when comparisons of various vector pairs take various amounts of time. Tests were performed on two high-performance hybrid systems with: 2 x Intel Xeon E5-2640 CPU + 2 x NVIDIA Tesla K20m and latest generation 2 x Intel Xeon CPU E5-2620 v4 + NVIDIA's Pascal generation GTX 1070 cards. Results demonstrate expected improvements and beneficial optimizations important for users incorporating such types of computations into their parallel codes run on similar systems.
Accessibility is an important issue in transport geography, land planning, and many other related fields. Accessibility problems become computationally demanding when involving high-resolution requirements. Using conv...
详细信息
Accessibility is an important issue in transport geography, land planning, and many other related fields. Accessibility problems become computationally demanding when involving high-resolution requirements. Using conventional methods, providing high-resolution accessibility analysis for real-time decision support remains a challenge. In this paper, we present a parallel processing model, named HiAccess, to solve the high-resolution accessibility analysis problems in real time. One feature of HiAccess is a fast road network construction method, in which the road network topology is determined by traversing the original road nodes only once. The parallel strategies of HiAccess are fully optimized with few repeated computations. Moreover, a simple, efficient, and highly effective map generalization method is proposed to reduce computation load without an accuracy loss. The flexibility of HiAccess enables it to work well when applied to different accessibility analysis models. To further demonstrate the applicability of HiAccess, a case study of settlement sites selection for poverty alleviation in Xiangxi, Central China, is carried out. The accessibility of jobs, health care, educational resources, and other public facilities are comprehensively analyzed for settlement sites selection. HiAccess demonstrates the striking performance of measuring high-resolution (using 100 m x 100 m grids) accessibility of a city (in total over 250k grids, roads with 232k segments, and 40 facilities) in 1 sec without preprocessing, while ArcGIS takes nearly 1 h to achieve a less satisfactory result. In additional experiments, HiAccess is tested on much larger data sets with excellent performance.
Deployed through skeleton frameworks, structured parallelism yields a clear and consistent structure across platforms by distinctly decoupling computations from the structure in a parallel programme. Structured progra...
详细信息
Deployed through skeleton frameworks, structured parallelism yields a clear and consistent structure across platforms by distinctly decoupling computations from the structure in a parallel programme. Structured programming is a viable and effective means of providing the separation of concerns, as it subdivides a system into building blocks (modules, skids or components) that can be independently created, and then used in different systems to drive multiple functionalities. Depending on its defined semantic, each building block wraps a unit of computing function, where the valid assembly of these building blocks forms a high-level structural parallel programming model. This paper proposes a grammar to build block components to execute computational functions in heterogeneous multi-core architectures. The grammar is validated against three different families of computing models: skeleton-based, general purpose, and domain-specific. In conjunction with the protocol, the grammar produces fully instrumented code for an application suite using the skeletal framework FastFlow.
This paper presents an overview of the past, present and future of the OpenMP application programming interface (API). While the API originally specified a small set of directives that guided shared memory fork-join p...
详细信息
This paper presents an overview of the past, present and future of the OpenMP application programming interface (API). While the API originally specified a small set of directives that guided shared memory fork-join parallelization of loops and program sections, OpenMP now provides a richer set of directives that capture a wide range of parallelization strategies that are not strictly limited to shared memory. As we look toward the future of OpenMP, we immediately see further evolution of the support for that range of parallelization strategies and the addition of direct support for debugging and performance analysis tools. Looking beyond the next major release of the specification of the OpenMP API, we expect the specification eventually to include support for more parallelization strategies and to embrace closer integration into its Fortran, C and, in particular, C++ base languages, which will likely require the API to adopt additional programming abstractions.
We present the most recent release of our parallel implementation of the BFS and BC algorithms for the study of large scale graphs. Although our reference platformis a high-end cluster of new generation Nvidia GPUs an...
详细信息
We present the most recent release of our parallel implementation of the BFS and BC algorithms for the study of large scale graphs. Although our reference platformis a high-end cluster of new generation Nvidia GPUs and some of our optimizations are CUDA specific, most of our ideas can be applied to other platforms offering multiple levels of parallelism. We exploit multi level parallel processing through a hybrid programming paradigm that combines highly tuned CUDA kernels, for the computations performed by each node, and explicit data exchange through the Message Passing Interface (MPI), for the communications among nodes. The results of the numerical experiments show that the performance of our code is comparable or better with respect to other state-of-the-art solutions. For the BFS, for instance, we reach a peak performance of 200 Giga Teps on a single GPU and 5.5 Terateps on 1024 Pascal GPUs. We release our source codes both for reproducing the results and for facilitating their usage as a building block for the implementation of other algorithms.
A novel parallel technique that couples the lattice-Boltzmann method and a finite volume scheme for the prediction of concentration polarisation and pore blocking in axisymmetric cross-flow membrane separation process...
详细信息
A novel parallel technique that couples the lattice-Boltzmann method and a finite volume scheme for the prediction of concentration polarisation and pore blocking in axisymmetric cross-flow membrane separation process is presented. The model uses the Lattice-Boltzmann method to solve the incompressible Navier-Stokes equations for hydrodynamics and the finite volume method to solve the convection-diffusion equation for solute particles. Concentration polarisation is modelled for micro-particles by having the diffusion coefficient defined as a function of particle concentration and shear rate. The model considers the effect of an incompressible cake formation. Pore blocking phenomenon is predicted for filtration membrane fouling by using the rate of particles arriving at the membrane surface. The simulation code is parallelised in two ways. Compute Unified Device Architecture (CUDA) is used for a cluster of graphical processing units (GPUs) and Message Passing Interface (MPI) is utilised for a cluster of central processing units (CPUs), with various parallelisation techniques to optimise memory usage for higher performance. The proposed model is validated by comparing to analytical solutions and experimental result.
Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of developing big data programs and applications. However, the jobs in these frameworks are roughly defined and package...
详细信息
Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of developing big data programs and applications. However, the jobs in these frameworks are roughly defined and packaged as executable jars without any functionality being exposed or described. This means that deployed jobs are not natively composable and reusable for subsequent development. Besides, it also hampers the ability for applying optimizations on the data flow of job sequences and pipelines. In this paper, we present the Hierarchically Distributed Data Matrix (HDM) which is a functional, strongly-typed data representation for writing composable big data applications. Along with HDM, a runtime framework is provided to support the execution, integration and management of HDM applications on distributed infrastructures. Based on the functional data dependency graph of HDM, multiple optimizations are applied to improve the performance of executing HDM jobs. The experimental results show that our optimizations can achieve improvements between 10 to 40 percent of the Job-Completion-Time for different types of applications when compared with the current state of art, Apache Spark.
We introduce a generic analytic simulation and image reconstruction software platform for multi-pinhole (MPH) SPECT systems. The platform is capable of modeling common or sophisticated MPH designs as well as complex d...
详细信息
ISBN:
(数字)9781510628380
ISBN:
(纸本)9781510628380
We introduce a generic analytic simulation and image reconstruction software platform for multi-pinhole (MPH) SPECT systems. The platform is capable of modeling common or sophisticated MPH designs as well as complex data acquisition schemes. Graphics processing unit (GPU) acceleration was utilized to make a high-performance computing software. Herein, we describe the software platform and provide verification studies of the simulation and image reconstruction software.
暂无评论