Various existing optimization and memory consistency management techniques for GPU applications rely on memory access patterns of kernels. However, they suffer from poor practicality because they require explicit user...
详细信息
ISBN:
(纸本)9781450344937
Various existing optimization and memory consistency management techniques for GPU applications rely on memory access patterns of kernels. However, they suffer from poor practicality because they require explicit user interventions to extract kernel memory access patterns. This paper proposes an automatic memory-access-pattern analysis framework called MAPA. MAPA is based on a source-level analysis technique derived from traditional symbolic analyses and a run-time pattern selection technique. The experimental results show that MAPA properly analyzes 116 real-world OpenCL kernels from Rodinia and Parboil.
We present a simple yet effective technique for improving performance of lock-based code using the hardware lock elision (HLE) feature in Intel's upcoming Haswell processor. We also describe how to extend Haswell&...
详细信息
ISBN:
(纸本)9781450319225
We present a simple yet effective technique for improving performance of lock-based code using the hardware lock elision (HLE) feature in Intel's upcoming Haswell processor. We also describe how to extend Haswell's HLE mechanism to achieve a similar effect to our lock elision scheme entirely in hardware.
We present Tangram, a programming system for writing performance-portable programs. The language enables programmers to write computation and composition codelets, supported by tuning knobs and primitives for expressi...
详细信息
ISBN:
(纸本)9781450340922
We present Tangram, a programming system for writing performance-portable programs. The language enables programmers to write computation and composition codelets, supported by tuning knobs and primitives for expressing data parallelism and work decomposition. The compiler and runtime use a set of techniques such as hierarchical composition, coarsening, data placement, tuning, and runtime selection based on input characteristics and micro-profiling. The resulting performance is competitive with optimized vendor libraries.
Memory performance is one essential factor for tapping into the full potential of the massive parallelism of GPU. It has motivated some recent efforts in GPU cache modeling. This paper presents a new data-centric way ...
详细信息
ISBN:
(纸本)9781450340922
Memory performance is one essential factor for tapping into the full potential of the massive parallelism of GPU. It has motivated some recent efforts in GPU cache modeling. This paper presents a new data-centric way to model the performance of a system with heterogeneous memory resources. The new model is composable, meaning it can predict the performance difference due to placing data differently by profiling the execution just once.
Nodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular alg...
详细信息
ISBN:
(纸本)9781450344937
Nodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular algorithms that benefit from low-latency, asynchronous communication. This paper proposes constructs for asynchronous multi-GPU programming, and describes their implementation in a thin runtime environment called Groute. Groute also implements common collective operations and distributed work-lists, enabling the development of irregular applications without substantial programming effort. We demonstrate that this approach achieves state-of-the-art performance and exhibits strong scaling for a suite of irregular applications on 8-GPU and heterogeneous systems, yielding over 7x speedup for some algorithms.
Many time-dependent problems like molecular dynamics of protein folding require a large number of time steps. The latencies and overheads of common-purpose clusters with accelerators are too big for high-frequency ite...
详细信息
ISBN:
(纸本)9781450340922
Many time-dependent problems like molecular dynamics of protein folding require a large number of time steps. The latencies and overheads of common-purpose clusters with accelerators are too big for high-frequency iteration. We introduce an algorithmic model called Samsara parallel (or SP) which, unlike BSP, relies on asynchronous communications and can repeatedly return to earlier time steps to refine the precision of computation. This also extends a line of research called parallel-in-Time in computational chemistry and physics.
We present PARGEO, a multicore library for computational geometry algorithms. We describe two of the algorithms from PARGEO, convex hull and the smallest enclosing ball, and present a short evaluation of all implement...
详细信息
ISBN:
(纸本)9781450392044
We present PARGEO, a multicore library for computational geometry algorithms. We describe two of the algorithms from PARGEO, convex hull and the smallest enclosing ball, and present a short evaluation of all implementations currently in PARGEO.
In programming high performance applications, shared address-space platforms are preferable for fine-grained computation, while distributed address-space platforms are more suitable for coarse-grained computation. How...
详细信息
ISBN:
(纸本)9781581135886
In programming high performance applications, shared address-space platforms are preferable for fine-grained computation, while distributed address-space platforms are more suitable for coarse-grained computation. However, currently only distributed address-space systems scale beyond the low hundreds of processors. In this paper we introduce a hybrid architecture that allows users to trade off local memory usage for coherence communication, making possible larger-scale shared memory architectures. We introduce a programming model and examine possible implementations of hardware mechanisms, evaluating some of the trade-offs inherent in each. Preliminary experiments on an application with particularly fine-grained communication requirements indicate that effective placement of directives can reduce coherence communication by more than a factor of 10 for 64 processors.
暂无评论