Although the OpenMP 4.0 standard has been available since 2013, support for GPUs has been absent up until very recently, with only a handful of experimental compilers available. In this work we evaluate the performanc...
详细信息
ISBN:
(纸本)9781509036837
Although the OpenMP 4.0 standard has been available since 2013, support for GPUs has been absent up until very recently, with only a handful of experimental compilers available. In this work we evaluate the performance of Cray's new NVIDIA GPU targeting implementation of OpenMP 4.0, with the mini-apps TeaLeaf, CloverLeaf and BUDE. We successfully port each of the applications, using a simple and consistent design throughout, and achieve performance on an NVIDIA K20X that is comparable to Cray's OpenACC in all cases. BUDE, a compute bound code, required 2.2x the runtime of an equivalently optimised CUDA code, which we believe is caused by an inflated frequency of control flow operations and less efficient arithmetic optimisation. Impressively, both TeaLeaf and CloverLeaf, memory bandwidth bound codes, only required 1.3x the runtime of hand-optimised CUDA implementations. Overall, we find that OpenMP 4.0 is a highly usable open standard capable of performant heterogeneous execution, making it a promising option for scientific application developers.
The Bellman operator constitutes the foundation of dynamic programming (DP). An alternative is presented by the Gauss-Seidel operator, whose evaluation, differently from that of the Bellman operator where the states a...
详细信息
The Bellman operator constitutes the foundation of dynamic programming (DP). An alternative is presented by the Gauss-Seidel operator, whose evaluation, differently from that of the Bellman operator where the states are all processed at once, updates one state at a time while incorporating into the computation the interim results. The provably better convergence rate of DP methods based on the Gauss-Seidel operator comes at the price of an inherent sequentiality, which prevents the exploitation of modern multicore systems. In this work, we propose a new operator for DP, namely, the mini-batch Bellman operator, which aims at realizing the tradeoff between the better convergence rate of the methods based on the Gauss-Seidel operator and the parallelization capability offered by the Bellman operator. After the introduction of the new operator, a theoretical analysis for validating its fundamental properties is conducted. Such properties allow one to successfully deploy the new operator in the main DP schemes, such as value iteration and modified policy iteration. We compare the convergence of the DP algorithm based on the new operator with its earlier counterparts, shedding light on the algorithmic advantages of the new formulation and the impact of the batch-size parameter on the convergence. Finally, an extensive numerical evaluation of the newly introduced operator is conducted. In accordance with the theoretical derivations, the numerical results show the competitive performance of the proposed operator and its superior flexibility, which allows one to adapt the efficiency of its iterations to different structures of MDPs and hardware setups.
A hash function maps an arbitrary length of (longer) message into a fixed length of shorter string, called message digest. Inevitably there will be a lot of different messages being hashed to the same or similar diges...
详细信息
ISBN:
(纸本)9781509015412
A hash function maps an arbitrary length of (longer) message into a fixed length of shorter string, called message digest. Inevitably there will be a lot of different messages being hashed to the same or similar digest. We call this collision or partial collision. By utilizing multiple processors from the CUNY High Performance Computing Center's facility, we locate partial collisions for MD5 and SHA-1 by brute force parallel programming in C with MPI library. The brute force method of finding a second preimage collision entails systematically computing all of the permutations, digests, and Hamming distances of the target preimage. We explore varying size target strings and the number of processors allocation and examine the effect these variables have on finding partial collisions. The results show that for the same message space the search time for the partial collisions is roughly halved for each doubling of the number of processors; and the longer the message is the better partial collisions are produced.
Cyber-physical systems (CPSs) are embedded systems that are tightly integrated with their physical environment. The correctness of a CPS depends on the output of its computations and on the timeliness of completing th...
详细信息
ISBN:
(纸本)9781509035328
Cyber-physical systems (CPSs) are embedded systems that are tightly integrated with their physical environment. The correctness of a CPS depends on the output of its computations and on the timeliness of completing the computations. This paper proposes the ForeC language for the deterministic parallel programming of CPS applications on multi-core execution platforms. ForeC's synchronous semantics is designed to greatly simplify the understanding and debugging of parallel programs. ForeC allows programmers to express many forms of parallel patterns while ensuring that programs are amenable to static timing analysis. One of ForeC's main innovation is its shared variable semantics that provides thread isolation and deterministic thread communication. Through benchmarking, we demonstrate that ForeC can achieve better parallel performance than Esterel, a widely used synchronous language for concurrent safety-critical systems, and OpenMP, a popular desktop solution for parallel programming. We demonstrate that the worst-case execution time of ForeC programs can be estimated precisely.
Automatic programming can be defined as developing software in a high abstraction level. The definition of automatic programming is not precise because what is meant by automatic programming is changing over time. The...
详细信息
ISBN:
(纸本)9781467386159
Automatic programming can be defined as developing software in a high abstraction level. The definition of automatic programming is not precise because what is meant by automatic programming is changing over time. The goal of automatic programming has the programmer set the specifications of a program and the computer generate the source code of that program. There exists a group of specification languages that vary in their properties; the Descartes specification language is known to be comprehensible and easily constructible. Descartes represents the specifications by defining a system's inputs and outputs, as well as the relationship between these as functions. Descartes has been extended to support concurrent systems. These features made Descartes to be a good basis to build this research effort on. This research effort studied automatic programming approaches and created a shortcut between specifications and implementation with all its benefits. This research created a way to transform Descartes specifications into C source code automatically. Automatic programming can apply to all fields of knowledge that can be automated; therefore, the scope of this research project was restricted to a few case studies that involve parallel programming.
The chip heat dissipations defeat the clock speed increment. Multi-core clusters and the heterogeneous platforms including accelerators become a main trend recently. parallel programming paradigms surfs on these diver...
详细信息
ISBN:
(纸本)9781509034390
The chip heat dissipations defeat the clock speed increment. Multi-core clusters and the heterogeneous platforms including accelerators become a main trend recently. parallel programming paradigms surfs on these diverse platforms: CUDA C, CUDA Fortran, OpenCL, OpenACC, OpenMP, MPI, pthread, MapReduce, and so on. The quantitative performance indexes help get a good picture of parallel programming paradigms for the applications. This study employ two examples: Pennes bioheat equations to simulating local hyperthermia destroying tumor cells and Navier-Stokes equations to simulating driven cavity flow at high Reynolds numbers via parallel programming paradigms: CUDA C, CUDA Fortran, OpenMP and MPI. parallel programming in MPI for Pennes bioheat equations shows super-linear speedup on NCHC (National Center for High-performance Computing) ALPS and significantly faster than the original author, whereas parallel programming in CUDA C framework for Navier-Stokes equations achieves around 24 times speedup on a NVIDIA C1060 GPU. We hope these results to support useful suggestions.
This paper presents an experience of Problem-based learning in a parallel programming course. The course includes the basics of parallel programming, from methodological and technological aspects to the analysis and d...
详细信息
ISBN:
(纸本)9781509036837
This paper presents an experience of Problem-based learning in a parallel programming course. The course includes the basics of parallel programming, from methodological and technological aspects to the analysis and design of parallel algorithms. The students work with an optimization problem in the field of parallel Computing. The execution time and the energy consumption of a simplified master-slave scheme in a simplified heterogeneous system are optimized, so treating it as a bi-objective optimization problem, which is addressed with sequential, shared-memory, message-passing and hybrid parallel programming. In this way, the students follow the various parts of the syllabus of the course by working with a problem in which topics studied in previous courses are combined (green computing, computational systems architecture, optimization, heuristics), and this contributes to a deeper understanding of these topics and motivates the introduction of new concepts.
HighP5 is a new high-level parallel programming language designed to help software developers to achieve three objectives simultaneously: programmer productivity, program portability, and superior program performance....
详细信息
The development of the algebra-algorithmic methodology and tools for automated design and generation of programs for graphics processing units is proposed. A particular feature of the proposed approach is the use of h...
详细信息
The development of the algebra-algorithmic methodology and tools for automated design and generation of programs for graphics processing units is proposed. A particular feature of the proposed approach is the use of high-level specifications that are close to natural-language specifications and also the application of a method that ensures the syntactical correctness of algorithms and programs being designed. The approach was implemented in a toolkit destined for interactively designing algorithm schemes and generating programs. The use of this toolkit is illustrated by the development of a parallel program in the field of meteorology.
With the current prevalence of multi-core processors in SMP cluster architectures, mixed-mode programming, using both MPI and OpenMP in the same application, is becoming increasingly important. In this paper we discus...
详细信息
ISBN:
(纸本)9781315684895;9781138028142
With the current prevalence of multi-core processors in SMP cluster architectures, mixed-mode programming, using both MPI and OpenMP in the same application, is becoming increasingly important. In this paper we discuss three methods for the parallelization of such algorithms, namely pure MPI parallelization, fine-grain hybrid MPI/OpenMP parallelization, and coarse-grain MPI/OpenMP parallelization. We propose a new hybrid parallel programming method based on architecture hierarchy on SMP cluster. We designed a hierarchical parallel algorithm on the N-body problem, and compare its performance with the traditional hybrid parallel algorithm on the Dawning 5000A cluster. The results indicate that the hierarchical hybrid parallel algorithm has better scalability and speed.
暂无评论