We introduce a variety of techniques toward autotuning data-parallel algorithms on the GPU. Our techniques tune these algorithms independent of hardware architecture, and attempt to select near-optimum parameters. We ...
详细信息
ISBN:
(纸本)9783642281440;9783642281457
We introduce a variety of techniques toward autotuning data-parallel algorithms on the GPU. Our techniques tune these algorithms independent of hardware architecture, and attempt to select near-optimum parameters. We work towards a general framework for creating autotuned data-parallel algorithms, using these techniques for common algorithms with varying characteristics. Our contributions include tuning a set of algorithms with a variety of computational patterns, with the goal in mind of building a general framework from these results. Our tuning strategy focuses first on identifying the computational patterns an algorithm shows, and then reducing our tuning model based on these observed patterns.
MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of Map Reduces, and programming and managing such pipelines can be diff...
详细信息
ISBN:
(纸本)9781450300193
MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of Map Reduces, and programming and managing such pipelines can be difficult. We present Flume Java, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the Flume Java library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, Flume Java defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, Flume Java first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., Map Reduces). The combination of high-level abstractions for paralleldata and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. Flume Java is in active use by hundreds of pipeline developers within Google.
A set of so-called cortical images, motivated by the function of simple cells in the primary visual cortex of mammals, is computed from each of two input images and an image pyramid is constructed for each cortical im...
详细信息
ISBN:
(纸本)3540593934
A set of so-called cortical images, motivated by the function of simple cells in the primary visual cortex of mammals, is computed from each of two input images and an image pyramid is constructed for each cortical image. The two sets of cortical image pyramids are matched synchronously and an optimal mapping of the one image onto the other image is determined. The method was implemented on the Connection Machine CM-5 of the University of Groningen(1) in the data-parallel programming model and applied to the problem of face recognition.
The O(N) hierarchical N-body algorithms and Massively parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We present a data-parallel implementation of Anders...
详细信息
ISBN:
(纸本)9780897918541
The O(N) hierarchical N-body algorithms and Massively parallel Processors allow particle systems of 100 million particles or more to be simulated in acceptable time. We present a data-parallel implementation of Anderson's method and demonstrate both efficiency and scalability of the implementation on the Connection Machine CM-5/5E systems. The communication time for large particle systems amounts to about 10-25%, and the overall efficiency is about 35%. The evaluation of the potential field of a system of 100 million particles takes 3 minutes and 15 minutes on a 256 node CM-5E, giving expected four and seven digits of accuracy, respectively. The speed of the code scales linearly with the number of processors and number of particles.
暂无评论