Application development for modern high-performance systems with graphics processing units (GPUs) currently relies on low-level programming approaches like CUDA and OpenCL, which leads to complex, lengthy and error-pr...
详细信息
Application development for modern high-performance systems with graphics processing units (GPUs) currently relies on low-level programming approaches like CUDA and OpenCL, which leads to complex, lengthy and error-prone programs. We present SkelCL-a high-level programming approach for systems with multiple GPUs and its implementation as a library on top of OpenCL. SkelCL makes three main enhancements to the OpenCL standard: (1) memory management is simplified using parallel container data types (vectors and matrices);(2) an automatic data (re)distribution mechanism allows for implicit data movements between GPUs and ensures scalability when using multiple GPUs;(3) computations are conveniently expressed using parallel algorithmic patterns (skeletons). We demonstrate how SkelCL is used to implement parallel applications, and we report experimental evaluation of our approach in terms of programming effort and performance.
This paper describes a technique for obtaining sums of floating point values that are independent of the order-of-operations, and thus attractive for use in global sums in massively parallel computations. The basic id...
详细信息
This paper describes a technique for obtaining sums of floating point values that are independent of the order-of-operations, and thus attractive for use in global sums in massively parallel computations. The basic idea described here is to convert the floating point values into a representation using a set of long integers, with enough carry-bits to allow these integers to be summed across processors without need of carries at intermediate stages, before conversion of the final sum back to a real number. This approach is being used successfully in an earth system model, in which reproducibility of results is essential. Published by Elsevier B.V.
Data dependence analysis is a very difficult task, mainly due to the limitations imposed by pointer aliasing, and by the overhead of dynamic data dependence analysis. Despite the huge effort to devise improved data de...
详细信息
Although the graphics processing unit (GPU) was originally designed to accelerate the image creation for output to display, today's general purpose GPU (GPGPU) computing offers unprecedented performance by offload...
详细信息
Although the graphics processing unit (GPU) was originally designed to accelerate the image creation for output to display, today's general purpose GPU (GPGPU) computing offers unprecedented performance by offloading computing-intensive portions of the application to the GPGPU, while running the remainder of the code on the central processing unit (CPU). The highly parallel structure of a many core GPGPU can process large blocks of data faster using multithreaded concurrent processing. A game engine has many "components" and multithreading can be used to implement their parallelism. However, effective implementation of multithreading in a multicore processor has challenges, such as data and task parallelism. In this paper, we investigate the impact of using a GPGPU with a CPU to design high-performance game engines. First, we implement a separable convolution filter (heavily used in image processing) with the GPGPU. Then, we implement a multiobject interactive game console in an eight-core workstation using a multithreaded asynchronous model (MAM), a multithreaded synchronous model (MSM), and an MSM with data parallelism (MSMDP). According to the experimental results, speedup of about 61x and 5x is achieved due to GPGPU and MSMDP implementation, respectively. Therefore, GPGPU-assisted parallel computing has the potential to improve multithreaded game engine performance.
We present Chorus, a high-level parallel programming model suitable for irregular, heap-manipulating applications like mesh refinement and epidemic simulations, and JChorus. an implementation of the model on top of Ja...
详细信息
ISBN:
(纸本)9781605587349
We present Chorus, a high-level parallel programming model suitable for irregular, heap-manipulating applications like mesh refinement and epidemic simulations, and JChorus. an implementation of the model on top of Java. One goal of Chorus is to express the dynamic and instance-dependent patterns of memory access that are common in typical irregular applications. Its other focus is locality of effects the property that in many of the same applications, typical imperative commands only affect small, local regions in the shared heap Chorus addresses dynamism and locality through the unifying abstraction of an object assembly: a local region in a shared data structure equipped with a short-lived, speculative thread of control The thread of control in an assembly can only access objects within the assembly. While objects can migrate from assembly to assembly. such migration is local-i.e., objects only move from one assembly to a neighboring one-and does not lead to aliasing. programming primitives include a merge operation, by which an assembly merges with an adjacent assembly, and a split operation, which splits an assembly into smaller ones Our abstractions are race and deadlock-free, and inherently data-centric. We demonstrate that Chorus and JChorus allow natural programming of several important applications exhibiting irregular data-parallelism. We also present an implementation of JChorus based on a many-to-one mapping of assemblies to lower-level threads, and report on preliminary performance numbers.
Tools for optimizing the performance of parallel programs on multi-architectural distributed computing systems are considered. A method for optimizing the embedding of parallel MPIprogram into computing clusters with ...
详细信息
Tools for optimizing the performance of parallel programs on multi-architectural distributed computing systems are considered. A method for optimizing the embedding of parallel MPIprogram into computing clusters with a hierarchical communication network structure is described. An adaptive approach to the delta optimization of restore points is proposed for effective fault-tolerant simulation on distributed computing systems.
Charm++ is a parallel programming system that evolved over the past 20 years to become a well-established system for programmingparallel science and engineering applications, in addition to the combinatorial search a...
详细信息
Current large-scale HPC systems consist of complex configurations with a huge number of potentially heterogeneous components. As the systems get larger, their behavior becomes more and more dynamic and unpredictable b...
详细信息
Current large-scale HPC systems consist of complex configurations with a huge number of potentially heterogeneous components. As the systems get larger, their behavior becomes more and more dynamic and unpredictable because of hard- and software re-configurations due to fault recovery and power usage optimizations. Deep software hierarchies of large, complex system software and middleware components are required to operate such systems. Therefore, porting, adapting and tuning applications to today's complex systems is a complicated and time-consuming task. Sophisticated integrated performance measurement, analysis, and optimization capabilities are required to efficiently utilize such systems. This article will summarize the state-of-the-art of scalable and portable parallel performance tools and the challenges these tools are facing on future extreme-scale and big data systems.
A parallelized version of the 3-D multi-species transport model MT3DMS was developed and tested. Specifically, the open multiprocessing (OpenMP) was utilized for communication between the processors. MT3DMS emulates t...
详细信息
A parallelized version of the 3-D multi-species transport model MT3DMS was developed and tested. Specifically, the open multiprocessing (OpenMP) was utilized for communication between the processors. MT3DMS emulates the solute transport by dividing the calculation into the flow and transport steps. In this article, a new preconditioner, derived from Symmetric Successive Over Relaxation (SSOR) was added into the generalized conjugate gradient solver. This preconditioner is well suited and appropriate for the parallel architecture. A case study in the test field at TU Bergakademie Freiberg was used to produce the results and analyze the code performance. It was observed that most of running time would be required for the advection, dispersion. As a result, the parallel version decreases significantly running time of solute transport modeling. In addition, this work provides a first attempt to demonstrate the capability and versatility of MT3DMS5P to simulate the solute transport in fractured gneiss rock. (C) 2014 Elsevier Ltd. All rights reserved.
A visit to the neighborhood PC retail store provides ample proof that we are in the multi-core era. The key differentiator among manufacturers today is the number of cores that they pack onto a single chip. The clock ...
详细信息
A visit to the neighborhood PC retail store provides ample proof that we are in the multi-core era. The key differentiator among manufacturers today is the number of cores that they pack onto a single chip. The clock frequency of commodity processors has reached its limit, however, and is likely to stay below 4 GHz for years to come. As a result, adding cores is not synonymous with increasing computational power. To take full advantage of the performance enhancements offered by the new multi-core hardware, a corresponding shift must take place in the software infrastructure - a shift to parallel computing.
暂无评论