检索结果-内蒙古大学图书馆

Toward Techniques for Auto-tuning GPU Algorithms

10th Nordic International Conference on Applied parallel Computing - State of the Art in Scientific and parallel Computing (PARA)

作者： Davidson, Andrew Owens, John Univ Calif Davis Davis CA 95616 USA

ISBN: (纸本)9783642281440;9783642281457

We introduce a variety of techniques toward autotuning data-parallel algorithms on the GPU. Our techniques tune these algorithms independent of hardware architecture, and attempt to select near-optimum parameters. We work towards a general framework for creating autotuned data-parallel algorithms, using these techniques for common algorithms with varying characteristics. Our contributions include tuning a set of algorithms with a variety of computational patterns, with the goal in mind of building a general framework from these results. Our tuning strategy focuses first on identifying the computational patterns an algorithm shows, and then reducing our tuning model based on these observed patterns.

关键词： GPU Computing Auto-Tuning Algorithms data-parallel programming CUDA

来源：评论

学校读者我要写书评

暂无评论

Scout: a data-parallel programming language for graphics processors

引用

parallel COMPUTING 2007年第10-11期33卷 648-662页

作者： McCormick, Patrick Inman, Jeff Ahrens, James Mohd-Yusof, Jamaludin Roth, Greg Cummins, Sharen Los Alamos Natl Lab Comp Computat & Stat Sci Div Los Alamos NM 87545 USA Univ Utah Dept Comp Sci Salt Lake City UT 84112 USA

Commodity graphics hardware has seen incredible growth in terms of performance, programmability, and arithmetic precision. Even though these trends have been primarily driven by the entertainment industry, the price-to-performance ratio of graphics processors (GPUs) has attracted the attention of many within the high-performance computing community. While the performance of the GPU is well suited for computational science, the programming interface, and several hardware limitations, have prevented their wide adoption. In this paper we present Scout, a data-parallel programming language for graphics processors that hides the nuances of both the underlying hardware and supporting graphics software layers. In addition to general-purpose programming constructs, the language provides extensions for scientific visualization operations that support the exploration of existing or computed data sets. Published by Elsevier B.V.

关键词： graphics processors data-parallel programming heterogeneous computing visualization

来源：评论

学校读者我要写书评

暂无评论

hiCUDA: High-Level GPGPU programming

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2011年第1期22卷 78-90页

作者： Han, Tianyi David Abdelrahman, Tarek S. Univ Toronto Edward S Rogers Sr Dept Elect & Comp Engn Toronto ON M5S 3G4 Canada

Graphics Processing Units (GPUs) have become a competitive accelerator for applications outside the graphics domain, mainly driven by the improvements in GPU programmability. Although the Compute Unified Device Architecture ( CUDA) is a simple C-like interface for programming NVIDIA GPUs, porting applications to CUDA remains a challenge to average programmers. In particular, CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host and GPU memories, and of manually optimizing the utilization of the GPU memory. Practical experience shows that the programmer needs to make significant code changes, often tedious and error-prone, before getting an optimized program. We have designed hiCUDA, a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious tasks in a simpler manner and directly to the sequential code, thus speeding up the porting process. In this paper, we describe the hiCUDA directives as well as the design and implementation of a prototype compiler that translates a hiCUDA program to a CUDA program. Our compiler is able to support real-world applications that span multiple procedures and use dynamically allocated arrays. Experiments using nine CUDA benchmarks show that the simplicity hiCUDA provides comes at no expense to performance.

关键词： CUDA GPGPU data-parallel programming directive-based language source-to-source compiler

来源：评论

学校读者我要写书评

暂无评论

FlumeJava: Easy, Efficient data-parallel Pipelines 10

FlumeJava: Easy, Efficient Data-Parallel Pipelines

引用

ACM SIGPLAN Conference on programming Language Design and Implementation

作者： Chambers, Craig Raniwala, Ashish Perry, Frances Adams, Stephen Henry, Robert R. Bradshaw, Robert Weizenbaum, Nathan Google Seattle WA USA

ISBN: (纸本)9781450300193

MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of Map Reduces, and programming and managing such pipelines can be difficult. We present Flume Java, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the Flume Java library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, Flume Java defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, Flume Java first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., Map Reduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. Flume Java is in active use by hundreds of pipeline developers within Google.

关键词： data-parallel programming MapReduce Java

来源：评论

学校读者我要写书评

暂无评论

A flexible processor mapping technique toward data localization for block-cyclic data redistribution

引用

JOURNAL OF SUPERCOMPUTING 2008年第2期45卷 151-172页

作者： Huang, Jih-Woei Chu, Chih-Ping Natl Cheng Kung Univ Dept Comp Sci & Informat Engn Tainan 701 Taiwan

Array redistribution is usually needed for more efficiently executing a data-parallel program on distributed memory multicomputers. To minimize the redistribution data transfer cost, processor mapping techniques were proposed to reduce the amount of redistributed data elements. Theses techniques demand that the beginning data elements on a processor not be redistributed in the redistribution. On the other hand, for satisfying practical computation needs, a programmer may require other data elements to be un-redistributed (localized) in the redistribution. In this paper, we propose a flexible processor mapping technique for the Block-Cyclic redistribution to allow the programmer to localize the required data elements in the redistribution. We also present an efficient redistribution method for the redistribution employing our proposed technique. The data transfer cost reduction and system performance improvement for the redistributions with data localization are analyzed and presented in our experimental results.

关键词： data-parallel programming distributed memory multicomputers HPF data distribution processor mapping MPI

来源：评论

学校读者我要写书评

暂无评论

An efficient communication scheduling method for the processor mapping technique applied data redistribution

引用

JOURNAL OF SUPERCOMPUTING 2006年第3期37卷 297-318页

作者： Huang, Jih-Woei Chu, Chih-Ping Natl Cheng Kung Univ Dept Comp Sci & Informat Engn Tainan 701 Taiwan

Array redistribution is usually required for more efficiently executing a data-parallel program on distributed memory multi-computers. In performing array redistribution using synchronous communication mode, data communications among the processors should be properly arranged to avoid incurring higher data transfer cost. Some efficient communication scheduling methods for the Block-Cyclic redistribution have been proposed. On the other hand, the processor mapping technique can help reduce the data transfer cost of redistribution. To avoid degrading the benefit of data transfer cost reduction, it is needed to construct optimal communication schedules for the redistribution in which the processor mapping technique is applied. In this paper, we present a unified approach to constructing optimal communication schedules for the processor mapping technique applied Block-Cyclic redistribution. The proposed method is founded on the processor mapping technique and can more efficiently construct the required communication schedules than other optimal scheduling methods.

关键词： parallel compiler data-parallel programming data redistribution communication scheduling processor mapping MPI

来源：评论

学校读者我要写书评

暂无评论

Debugging real-world data-parallel programs with SPiDER

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2002年第6期18卷 779-788页

作者： Fahringer, T Sowa-Pieklo, K Univ Vienna Inst Software Sci A-1090 Vienna Austria ABB Corp Res PL-31038 Krakow Poland

Debuggers are crucial to understand the global execution behavior and intricate details of a program, to control the state of many processes, to present distributed information in a concise and clear way, to observe the execution behavior, and to detect and locate programming errors. In this paper we describe the design and implementation of SPiDER which is an interactive source-level debugging system for both regular and irregular High Performance Fortran programs. SPiDER combines a base debugging system for message-passing programs with a high-level debugger that interfaces with an HPF compiler. SPiDER, in addition to conventional debugging functionality, allows a single process of a parallel program to be expected or the entire program to be examined from a global point of view. A sophisticated visualization system has been developed and included in SPiDER to visualize data distributions, data to processor mapping relationships, and array values. SPiDER enables a programmer to dynamically change data distributions as well as array values. For arrays whose distribution can change during program execution, an animated replay displays the distribution sequence together with the associated source code location. Array values can be stored at individual execution points and compared against each other to examine execution behavior (e.g. convergence behavior of a numerical algorithm). Finally, SPiDER also offers limited support to evaluate the performance of parallel programs through a graphical load diagram. SPiDER has been fully implemented and is currently being used for the development of various real-world applications. Several experiments are presented that demonstrate the usefulness of SPiDER. (C) 2002 Published by Elsevier Science B.V.

关键词： symbolic debugger data-parallel programming HPF

来源：评论

学校读者我要写书评

暂无评论

Efficient local memory sequence generation for data parallel programs using permutations

引用

JOURNAL OF SYSTEMS ARCHITECTURE 2001年第6期47卷 505-515页

作者： Huang, TC Shiu, LC Huang, JH Natl Sun Yat Sen Univ Dept Elect Engn Kaohsiung 804 Taiwan

Generating local memory access sequence is a critical issue in distributed-memory implementations of data-parallel languages. In this paper, for arrays distributed block-cyclically on multiple processors, we introduce a novel approach to the local memory access sequence generation using the theory of permutation. By compressing the active elements in a block into an integer, called compress number, and exploiting the fact that there is a repeating pattern in the access sequence, we obtain the global block cycle, Then, we show that the local block cycle can be efficiently enumerated as closed forms using the permutation of global block cycle. After decompressing the compress number in the local block cycle, the local block patterns are restored and the local memory access sequence can be quickly generated. Unlike other works, our approach incurs no run-time overhead. (C) 2001 Elsevier Science B.V. All rights reserved.

关键词： block-cyclic distribution local memory access sequence data-parallel programming block compression/decompression permutation

来源：评论

学校读者我要写书评

暂无评论

High performance fortran, version 2

引用

parallel Processing Letters 1997年第4期7卷 437-449页

作者： Schreiber, Robert Hewlett Packard Laboratories Polo Alto CA 1501 Page Mill Road United States

This paper introduces the ideas that underly the data-parallel language High Performance Fortran (HPF) and the new ideas in version 2 of HPF. It first reviews HPF's key language elements. It discusses the meaning of data parallelism and the limitations of HPF version 1 as a data-parallel programming language. The second part of the paper is a review of the development of version 2 of HPF. The extended language, under development in 1996, includes a richer data mapping capability;an extension to the independent loop that allows reduction operations in the loop range;a means for directing the mapping of computation as well as data;and a way to specify concurrent execution of several parallel tasks on disjoint subsets of processors. © World Scientific Publishing Company.

关键词： data-parallel programming High performance fortran Task parallelism

来源：评论

学校读者我要写书评

暂无评论

A data-parallel implementation of O(N) hierarchical N-body methods 96

A data-parallel implementation of O(N) hierarchical N-body m...

引用

Proceedings of the 1996 ACM/IEEE conference on Supercomputing

作者： Yu Hu S. Lennart Johnsson Division of Applied Sciences Harvard University Cambridge Massachusetts Department of Computer Sciences University of Houston Houston Texas

ISBN: (纸本)9780897918541

The O(N) hierarchical N-body algorithms and Massively parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We present a data-parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10-25%, and the overall efficiency is about 35%. The evaluation of the potential field of a system of 100 million particles takes 3 minutes and 15 minutes on a 256 node CM-5E, giving expected four and seven digits of accuracy, respectively. The speed of the code scales linearly with the number of processors and number of particles.

关键词： multipole algorithms N-body simulation data-parallel programming massively parallel processors hierarchical N-body methods

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：