检索结果-内蒙古大学图书馆

SkelCL: a high-level extension of OpenCL for multi-GPU systems

JOURNAL OF SUPERCOMPUTING 2014年第1期69卷 25-33页

作者： Steuwer, Michel Gorlatch, Sergei Univ Munster D-48149 Munster Germany

Application development for modern high-performance systems with graphics processing units (GPUs) currently relies on low-level programming approaches like CUDA and OpenCL, which leads to complex, lengthy and error-prone programs. We present SkelCL-a high-level programming approach for systems with multiple GPUs and its implementation as a library on top of OpenCL. SkelCL makes three main enhancements to the OpenCL standard: (1) memory management is simplified using parallel container data types (vectors and matrices);(2) an automatic data (re)distribution mechanism allows for implicit data movements between GPUs and ensures scalability when using multiple GPUs;(3) computations are conveniently expressed using parallel algorithmic patterns (skeletons). We demonstrate how SkelCL is used to implement parallel applications, and we report experimental evaluation of our approach in terms of programming effort and performance.

关键词： parallel programming GPU programming OpenCL Algorithmic skeletons SkelCL Many-cores

来源：评论

学校读者我要写书评

暂无评论

An order-invariant real-to-integer conversion sum

引用

parallel COMPUTING 2014年第5-6期40卷 140-143页

作者： Hallberg, Robert Adcroft, Alistair NOAA Geophys Fluid Dynam Lab Princeton NJ 08540 USA Princeton Univ Atmospher & Ocean Sci Program Princeton NJ 08540 USA

This paper describes a technique for obtaining sums of floating point values that are independent of the order-of-operations, and thus attractive for use in global sums in massively parallel computations. The basic idea described here is to convert the floating point values into a representation using a set of long integers, with enough carry-bits to allow these integers to be summed across processors without need of carries at intermediate stages, before conversion of the final sum back to a real number. This approach is being used successfully in an earth system model, in which reproducibility of results is essential. Published by Elsevier B.V.

关键词： Order-invariant Global sums parallel programming Fixed-point Reproducibility

来源：评论

学校读者我要写书评

暂无评论

Loop-carried dependence verification in OpenMP

Lecture Notes in Computer Science (including subseries Lectu...

引用

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2014年 8766卷 87-102页

作者： Salamanca, Juan Mattos, Luis Araujo, Guido Institute of Computing University of Campinas CampinasSão Paulo Brazil

Data dependence analysis is a very difficult task, mainly due to the limitations imposed by pointer aliasing, and by the overhead of dynamic data dependence analysis. Despite the huge effort to devise improved data dependence analysis techniques, the problem is still far from solved. Efficient methods to reduce memory and time overhead imposed by dynamic instrumentation are thus required to enable fast and correct program parallelization. This paper presents a novel dynamic loopcarried dependence checker integrated as a new extension to OpenMP, the parallel for check construct, which can be used to help programmers identify the existence of loop-carried dependences in parallel for constructs. © Springer International Publishing Switzerland 2014.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

GPU Computing to Improve Game Engine Performance

引用

JOURNAL OF ENGINEERING AND TECHNOLOGICAL SCIENCES 2014年第2期46卷 226-243页

作者： Asaduzzaman, Abu Lee, Hin Y. Wichita State Univ Dept EECS 1845 Fairmount St JB 253 Wichita KS 67260 USA

Although the graphics processing unit (GPU) was originally designed to accelerate the image creation for output to display, today's general purpose GPU (GPGPU) computing offers unprecedented performance by offloading computing-intensive portions of the application to the GPGPU, while running the remainder of the code on the central processing unit (CPU). The highly parallel structure of a many core GPGPU can process large blocks of data faster using multithreaded concurrent processing. A game engine has many "components" and multithreading can be used to implement their parallelism. However, effective implementation of multithreading in a multicore processor has challenges, such as data and task parallelism. In this paper, we investigate the impact of using a GPGPU with a CPU to design high-performance game engines. First, we implement a separable convolution filter (heavily used in image processing) with the GPGPU. Then, we implement a multiobject interactive game console in an eight-core workstation using a multithreaded asynchronous model (MAM), a multithreaded synchronous model (MSM), and an MSM with data parallelism (MSMDP). According to the experimental results, speedup of about 61x and 5x is achieved due to GPGPU and MSMDP implementation, respectively. Therefore, GPGPU-assisted parallel computing has the potential to improve multithreaded game engine performance.

关键词： Game engine GPGPU computing multicore processor parallel programming performance improvement simultaneous multithreading

来源：评论

学校读者我要写书评

暂无评论

parallel programming with Object Assemblies 09

Parallel Programming with Object Assemblies

引用

24th Annual ACM Conference on Object-Oriented programming, Systems, Languages and Applications

作者： Lublinerman, Roberto Chaudhuri, Swarat Cerny, Pavol Penn State Univ University Pk PA 16802 USA

ISBN: (纸本)9781605587349

We present Chorus, a high-level parallel programming model suitable for irregular, heap-manipulating applications like mesh refinement and epidemic simulations, and JChorus. an implementation of the model on top of Java. One goal of Chorus is to express the dynamic and instance-dependent patterns of memory access that are common in typical irregular applications. Its other focus is locality of effects the property that in many of the same applications, typical imperative commands only affect small, local regions in the shared heap Chorus addresses dynamism and locality through the unifying abstraction of an object assembly: a local region in a shared data structure equipped with a short-lived, speculative thread of control The thread of control in an assembly can only access objects within the assembly. While objects can migrate from assembly to assembly. such migration is local-i.e., objects only move from one assembly to a neighboring one-and does not lead to aliasing. programming primitives include a merge operation, by which an assembly merges with an adjacent assembly, and a split operation, which splits an assembly into smaller ones Our abstractions are race and deadlock-free, and inherently data-centric. We demonstrate that Chorus and JChorus allow natural programming of several important applications exhibiting irregular data-parallelism. We also present an implementation of JChorus based on a many-to-one mapping of assemblies to lower-level threads, and report on preliminary performance numbers.

关键词： parallel programming programming abstractions Irregular parallelism Data-parallelism Ownership

来源：评论

学校读者我要写书评

暂无评论

Optimization Tools of parallel Simulation of Nanostructures with Quantum Dots

引用

OPTOELECTRONICS INSTRUMENTATION AND DATA PROCESSING 2014年第3期50卷 260-265页

作者： Pavskii, K. V. Kurnosov, M. G. Polyakov, A. Yu. Russian Acad Sci Rzhanov Inst Semicond Phys Siberian Branch Pr Akademika Lavrenteva 13 Novosibirsk 630090 Russia Siberian State Univ Telecommun & Informat Sci Novosibirsk 630102 Russia

Tools for optimizing the performance of parallel programs on multi-architectural distributed computing systems are considered. A method for optimizing the embedding of parallel MPIprogram into computing clusters with a hierarchical communication network structure is described. An adaptive approach to the delta optimization of restore points is proposed for effective fault-tolerant simulation on distributed computing systems.

关键词： parallel program embedding fault tolerance parallel programming computing systems

来源：评论

学校读者我要写书评

暂无评论

Controlling concurrency and expressing synchronization in charm++ programs

Lecture Notes in Computer Science (including subseries Lectu...

引用

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2014年 8665卷 196-221页

作者： Kale, Laxmikant V Lifflander, Jonathan University of Illinois at Urbana-Champaign United States

Charm++ is a parallel programming system that evolved over the past 20 years to become a well-established system for programming parallel science and engineering applications, in addition to the combinatorial search applications with which it started. At its earliest point, the precursor to Charm++, the Chare Kernel, was a purely reactive specification, similar to most actor languages. This paper describes the evolution of a series of concurrency control mechanisms that have been deployed in Charm++ to tame this unrestricted concurrency in order to improve code clarity and/or to improve performance. © Springer-Verlag Berlin Heidelberg 2014.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Scalable parallel performance measurement and analysis tools - state-of-the-art and future challenges

Supercomputing Frontiers and Innovations

引用

Supercomputing Frontiers and Innovations 2014年第2期1卷 108-123页

作者： Mohr, B. Jülich Supercomputing Centre Forschungszentrum Jülich GmbH Jülich Germany

Current large-scale HPC systems consist of complex configurations with a huge number of potentially heterogeneous components. As the systems get larger, their behavior becomes more and more dynamic and unpredictable because of hard- and software re-configurations due to fault recovery and power usage optimizations. Deep software hierarchies of large, complex system software and middleware components are required to operate such systems. Therefore, porting, adapting and tuning applications to today's complex systems is a complicated and time-consuming task. Sophisticated integrated performance measurement, analysis, and optimization capabilities are required to efficiently utilize such systems. This article will summarize the state-of-the-art of scalable and portable parallel performance tools and the challenges these tools are facing on future extreme-scale and big data systems.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

MT3DMSP - A parallelized version of the MT3DMS code

引用

JOURNAL OF AFRICAN EARTH SCIENCES 2014年 100卷 1-6页

作者： Abdelaziz, Ramadan Le, Hai Ha TU Bergakad Freiberg Inst Hydrogeol Freiberg Germany TU Bergakad Freiberg Inst Geophys & Geoinformat Freiberg Germany

A parallelized version of the 3-D multi-species transport model MT3DMS was developed and tested. Specifically, the open multiprocessing (OpenMP) was utilized for communication between the processors. MT3DMS emulates the solute transport by dividing the calculation into the flow and transport steps. In this article, a new preconditioner, derived from Symmetric Successive Over Relaxation (SSOR) was added into the generalized conjugate gradient solver. This preconditioner is well suited and appropriate for the parallel architecture. A case study in the test field at TU Bergakademie Freiberg was used to produce the results and analyze the code performance. It was observed that most of running time would be required for the advection, dispersion. As a result, the parallel version decreases significantly running time of solute transport modeling. In addition, this work provides a first attempt to demonstrate the capability and versatility of MT3DMS5P to simulate the solute transport in fractured gneiss rock. (C) 2014 Elsevier Ltd. All rights reserved.

关键词： MT3DMS parallel programming SSOR-AI OpenMP MT3DMS5P

来源：评论

学校读者我要写书评

暂无评论

parallel programming IN MATLAB

PARALLEL PROGRAMMING IN MATLAB

引用

Workshop on Clusters and Computational Grids for Scientific Computing

作者： Luszczek, Piotr Univ Tennessee Knoxville TN 37996 USA

A visit to the neighborhood PC retail store provides ample proof that we are in the multi-core era. The key differentiator among manufacturers today is the number of cores that they pack onto a single chip. The clock frequency of commodity processors has reached its limit, however, and is likely to stay below 4 GHz for years to come. As a result, adding cores is not synonymous with increasing computational power. To take full advantage of the performance enhancements offered by the new multi-core hardware, a corresponding shift must take place in the software infrastructure - a shift to parallel computing.

关键词： MATLAB parallel Computing Toolbox multi-core parallel computing parallel programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：