We present Tangram, a programming system for writing performance-portable programs. The language enables programmers to write computation and composition codelets, supported by tuning knobs and primitives for expressi...
详细信息
ISBN:
(纸本)9781450340922
We present Tangram, a programming system for writing performance-portable programs. The language enables programmers to write computation and composition codelets, supported by tuning knobs and primitives for expressing data parallelism and work decomposition. The compiler and runtime use a set of techniques such as hierarchical composition, coarsening, data placement, tuning, and runtime selection based on input characteristics and micro-profiling. The resulting performance is competitive with optimized vendor libraries.
We describe two novel constructs for programmingparallel machines with multi-level memory hierarchies: call-up, which allows a child task to invoke computation on its parent, and spawn, which spawns a dynamically det...
详细信息
ISBN:
(纸本)9781450301190
We describe two novel constructs for programmingparallel machines with multi-level memory hierarchies: call-up, which allows a child task to invoke computation on its parent, and spawn, which spawns a dynamically determined number of parallel children until some termination condition in the parent is met. Together we show that these constructs allow applications with irregular parallelism to be programmed in a straightforward manner, and furthermore these constructs complement and can be combined with constructs for expressing regular parallelism. We have implemented spawn and call-up in Sequoia and we present an experimental evaluation on a number of irregular applications.
Nodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular alg...
详细信息
ISBN:
(纸本)9781450344937
Nodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular algorithms that benefit from low-latency, asynchronous communication. This paper proposes constructs for asynchronous multi-GPU programming, and describes their implementation in a thin runtime environment called Groute. Groute also implements common collective operations and distributed work-lists, enabling the development of irregular applications without substantial programming effort. We demonstrate that this approach achieves state-of-the-art performance and exhibits strong scaling for a suite of irregular applications on 8-GPU and heterogeneous systems, yielding over 7x speedup for some algorithms.
The advent of new parallel architectures has increased the need for parallel optimizing compilers to assist developers in creating efficient code. OpenUH is a state-of-the-art optimizing compiler, but it only performs...
详细信息
ISBN:
(纸本)9781605583976
The advent of new parallel architectures has increased the need for parallel optimizing compilers to assist developers in creating efficient code. OpenUH is a state-of-the-art optimizing compiler, but it only performs a limited set of optimizations for OpenMP programs due to its conservative assumptions of shared memory programming. These limitations may prevent some OpenMP applications from being fully optimized to the extent of its sequential counterpart. This paper describes our design and implementation of a parallel data flow framework, consisting of a parallel Control Flow Graph (PCFG) and a parallel SSA (PSSA) representation in OpenUH, to model data flow for OpenMP programs. This framework enables the OpenUH compiler to perform all classical scalar optimizations for OpenMP programs, in addition to conducting OpenMP specific optimizations.
We present a simple yet effective technique for improving performance of lock-based code using the hardware lock elision (HLE) feature in Intel's upcoming Haswell processor. We also describe how to extend Haswell&...
详细信息
ISBN:
(纸本)9781450319225
We present a simple yet effective technique for improving performance of lock-based code using the hardware lock elision (HLE) feature in Intel's upcoming Haswell processor. We also describe how to extend Haswell's HLE mechanism to achieve a similar effect to our lock elision scheme entirely in hardware.
When a Search Engine responds to your query, thousands of machines from around the world have cooperated to produce your result. With a global reach of hundreds-of-millions of users, Search Engines are arguably the mo...
详细信息
ISBN:
(纸本)1595931899
When a Search Engine responds to your query, thousands of machines from around the world have cooperated to produce your result. With a global reach of hundreds-of-millions of users, Search Engines are arguably the most commonly used massively-parallel computing systems on the planet. In this talk, we examine Web Search Engines as a case study of parallelprogramming in a practical context. We focus primarily on the practice of parallelprogramming, reviewing many ways in which parallelprogramming is used in a modern Search Engine. We also discuss briefly the principles of parallelprogramming, listing some of the principles that guide our use of parallelism and speculating a bit on how the mechanics of parallelism might better be automated in our context.
Interaction with physical objects often imposes latency requirements to multi-core embedded systems. One consequence is the need for synchronisation algorithms that provide predictable latency, in addition to high thr...
详细信息
ISBN:
(纸本)9781450349826
Interaction with physical objects often imposes latency requirements to multi-core embedded systems. One consequence is the need for synchronisation algorithms that provide predictable latency, in addition to high throughput. We present a synchronisation algorithm that needs at most 7 atomic memory operations per asynchronous critical section. The performance is competitive, at least, to locks.
Many time-dependent problems like molecular dynamics of protein folding require a large number of time steps. The latencies and overheads of common-purpose clusters with accelerators are too big for high-frequency ite...
详细信息
ISBN:
(纸本)9781450340922
Many time-dependent problems like molecular dynamics of protein folding require a large number of time steps. The latencies and overheads of common-purpose clusters with accelerators are too big for high-frequency iteration. We introduce an algorithmic model called Samsara parallel (or SP) which, unlike BSP, relies on asynchronous communications and can repeatedly return to earlier time steps to refine the precision of computation. This also extends a line of research called parallel-in-Time in computational chemistry and physics.
暂无评论