We introduce a variety of techniques toward autotuning data-parallel algorithms on the GPU. Our techniques tune these algorithms independent of hardware architecture, and attempt to select near-optimum parameters. We ...
详细信息
ISBN:
(纸本)9783642281440;9783642281457
We introduce a variety of techniques toward autotuning data-parallel algorithms on the GPU. Our techniques tune these algorithms independent of hardware architecture, and attempt to select near-optimum parameters. We work towards a general framework for creating autotuned data-parallel algorithms, using these techniques for common algorithms with varying characteristics. Our contributions include tuning a set of algorithms with a variety of computational patterns, with the goal in mind of building a general framework from these results. Our tuning strategy focuses first on identifying the computational patterns an algorithm shows, and then reducing our tuning model based on these observed patterns.
Commodity graphics hardware has seen incredible growth in terms of performance, programmability, and arithmetic precision. Even though these trends have been primarily driven by the entertainment industry, the price-t...
详细信息
Commodity graphics hardware has seen incredible growth in terms of performance, programmability, and arithmetic precision. Even though these trends have been primarily driven by the entertainment industry, the price-to-performance ratio of graphics processors (GPUs) has attracted the attention of many within the high-performance computing community. While the performance of the GPU is well suited for computational science, the programming interface, and several hardware limitations, have prevented their wide adoption. In this paper we present Scout, a data-parallel programming language for graphics processors that hides the nuances of both the underlying hardware and supporting graphics software layers. In addition to general-purpose programming constructs, the language provides extensions for scientific visualization operations that support the exploration of existing or computed data sets. Published by Elsevier B.V.
Graphics Processing Units (GPUs) have become a competitive accelerator for applications outside the graphics domain, mainly driven by the improvements in GPU programmability. Although the Compute Unified Device Archit...
详细信息
Graphics Processing Units (GPUs) have become a competitive accelerator for applications outside the graphics domain, mainly driven by the improvements in GPU programmability. Although the Compute Unified Device Architecture ( CUDA) is a simple C-like interface for programming NVIDIA GPUs, porting applications to CUDA remains a challenge to average programmers. In particular, CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host and GPU memories, and of manually optimizing the utilization of the GPU memory. Practical experience shows that the programmer needs to make significant code changes, often tedious and error-prone, before getting an optimized program. We have designed hiCUDA, a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious tasks in a simpler manner and directly to the sequential code, thus speeding up the porting process. In this paper, we describe the hiCUDA directives as well as the design and implementation of a prototype compiler that translates a hiCUDA program to a CUDA program. Our compiler is able to support real-world applications that span multiple procedures and use dynamically allocated arrays. Experiments using nine CUDA benchmarks show that the simplicity hiCUDA provides comes at no expense to performance.
MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of Map Reduces, and programming and managing such pipelines can be diff...
详细信息
ISBN:
(纸本)9781450300193
MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of Map Reduces, and programming and managing such pipelines can be difficult. We present Flume Java, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the Flume Java library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, Flume Java defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, Flume Java first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., Map Reduces). The combination of high-level abstractions for paralleldata and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. Flume Java is in active use by hundreds of pipeline developers within Google.
Array redistribution is usually needed for more efficiently executing a data-parallel program on distributed memory multicomputers. To minimize the redistribution data transfer cost, processor mapping techniques were ...
详细信息
Array redistribution is usually needed for more efficiently executing a data-parallel program on distributed memory multicomputers. To minimize the redistribution data transfer cost, processor mapping techniques were proposed to reduce the amount of redistributed data elements. Theses techniques demand that the beginning data elements on a processor not be redistributed in the redistribution. On the other hand, for satisfying practical computation needs, a programmer may require other data elements to be un-redistributed (localized) in the redistribution. In this paper, we propose a flexible processor mapping technique for the Block-Cyclic redistribution to allow the programmer to localize the required data elements in the redistribution. We also present an efficient redistribution method for the redistribution employing our proposed technique. The data transfer cost reduction and system performance improvement for the redistributions with data localization are analyzed and presented in our experimental results.
Array redistribution is usually required for more efficiently executing a data-parallel program on distributed memory multi-computers. In performing array redistribution using synchronous communication mode, data comm...
详细信息
Array redistribution is usually required for more efficiently executing a data-parallel program on distributed memory multi-computers. In performing array redistribution using synchronous communication mode, data communications among the processors should be properly arranged to avoid incurring higher data transfer cost. Some efficient communication scheduling methods for the Block-Cyclic redistribution have been proposed. On the other hand, the processor mapping technique can help reduce the data transfer cost of redistribution. To avoid degrading the benefit of data transfer cost reduction, it is needed to construct optimal communication schedules for the redistribution in which the processor mapping technique is applied. In this paper, we present a unified approach to constructing optimal communication schedules for the processor mapping technique applied Block-Cyclic redistribution. The proposed method is founded on the processor mapping technique and can more efficiently construct the required communication schedules than other optimal scheduling methods.
Debuggers are crucial to understand the global execution behavior and intricate details of a program, to control the state of many processes, to present distributed information in a concise and clear way, to observe t...
详细信息
Debuggers are crucial to understand the global execution behavior and intricate details of a program, to control the state of many processes, to present distributed information in a concise and clear way, to observe the execution behavior, and to detect and locate programming errors. In this paper we describe the design and implementation of SPiDER which is an interactive source-level debugging system for both regular and irregular High Performance Fortran programs. SPiDER combines a base debugging system for message-passing programs with a high-level debugger that interfaces with an HPF compiler. SPiDER, in addition to conventional debugging functionality, allows a single process of a parallel program to be expected or the entire program to be examined from a global point of view. A sophisticated visualization system has been developed and included in SPiDER to visualize data distributions, data to processor mapping relationships, and array values. SPiDER enables a programmer to dynamically change data distributions as well as array values. For arrays whose distribution can change during program execution, an animated replay displays the distribution sequence together with the associated source code location. Array values can be stored at individual execution points and compared against each other to examine execution behavior (e.g. convergence behavior of a numerical algorithm). Finally, SPiDER also offers limited support to evaluate the performance of parallel programs through a graphical load diagram. SPiDER has been fully implemented and is currently being used for the development of various real-world applications. Several experiments are presented that demonstrate the usefulness of SPiDER. (C) 2002 Published by Elsevier Science B.V.
Generating local memory access sequence is a critical issue in distributed-memory implementations of data-parallel languages. In this paper, for arrays distributed block-cyclically on multiple processors, we introduce...
详细信息
Generating local memory access sequence is a critical issue in distributed-memory implementations of data-parallel languages. In this paper, for arrays distributed block-cyclically on multiple processors, we introduce a novel approach to the local memory access sequence generation using the theory of permutation. By compressing the active elements in a block into an integer, called compress number, and exploiting the fact that there is a repeating pattern in the access sequence, we obtain the global block cycle, Then, we show that the local block cycle can be efficiently enumerated as closed forms using the permutation of global block cycle. After decompressing the compress number in the local block cycle, the local block patterns are restored and the local memory access sequence can be quickly generated. Unlike other works, our approach incurs no run-time overhead. (C) 2001 Elsevier Science B.V. All rights reserved.
This paper introduces the ideas that underly the data-parallel language High Performance Fortran (HPF) and the new ideas in version 2 of HPF. It first reviews HPF's key language elements. It discusses the meaning ...
详细信息
The O(N) hierarchical N-body algorithms and Massively parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We present a data-parallel implementation of Anders...
详细信息
ISBN:
(纸本)9780897918541
The O(N) hierarchical N-body algorithms and Massively parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We present a data-parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10-25%, and the overall efficiency is about 35%. The evaluation of the potential field of a system of 100 million particles takes 3 minutes and 15 minutes on a 256 node CM-5E, giving expected four and seven digits of accuracy, respectively. The speed of the code scales linearly with the number of processors and number of particles.
暂无评论