We describe a SIMD technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model, that avoids computing a complete table of partial sums of the relati...
详细信息
ISBN:
(纸本)9781450344937
We describe a SIMD technique for drawing values from multiple discrete distributions, such as sampling from the random variables of a mixture model, that avoids computing a complete table of partial sums of the relative probabilities. A table of alternate ("butterfly-patterned") form is faster to compute, making better use of coalesced memory accesses;from this table, complete partial sums are computed on the fly during a binary search. Measurements using CUBA 7.5 on an NVIDIA Titan Black GPU show that this technique makes an entire machine-learning application that uses a Latent Dirichlet Allocation topic model with 1024 topics about about 13% faster (when using single-precision floating-point data) or about 35% faster (when using double precision floating-point data) than doing a straightforward matrix transposition after using coalesced accesses.
A variety of programming models exist to support large-scale, distributed memory, parallel computation. These programming models have historically targeted coarse-grained applications with natural locality such as tho...
详细信息
ISBN:
(纸本)9781450301190
A variety of programming models exist to support large-scale, distributed memory, parallel computation. These programming models have historically targeted coarse-grained applications with natural locality such as those found in a variety of scientific simulations of the physical world. Fine-grained, irregular, and unstructured applications such as those found in biology, social network analysis, and graph theory are less well supported. We propose Active Pebbles, a programming model which allows these applications to be expressed naturally;an accompanying execution model ensures performance and scalability.
Various existing optimization and memory consistency management techniques for GPU applications rely on memory access patterns of kernels. However, they suffer from poor practicality because they require explicit user...
详细信息
ISBN:
(纸本)9781450344937
Various existing optimization and memory consistency management techniques for GPU applications rely on memory access patterns of kernels. However, they suffer from poor practicality because they require explicit user interventions to extract kernel memory access patterns. This paper proposes an automatic memory-access-pattern analysis framework called MAPA. MAPA is based on a source-level analysis technique derived from traditional symbolic analyses and a run-time pattern selection technique. The experimental results show that MAPA properly analyzes 116 real-world OpenCL kernels from Rodinia and Parboil.
Object-oriented programming languages like Java provide only low-level constructs (e.g., starting a thread) to describe concurrency. High-level abstractions (e.g., thread pools) are merely provided as a library. As a ...
详细信息
ISBN:
(纸本)9781450311601
Object-oriented programming languages like Java provide only low-level constructs (e.g., starting a thread) to describe concurrency. High-level abstractions (e.g., thread pools) are merely provided as a library. As a result, a compiler is not aware of the high-level semantics of a parallel library and therefore misses important optimization opportunities. This paper presents a simple source language extension based on which a compiler can perform new optimizations that are particularly effective for parallel code.
This work proposes a low-overhead half-barrier pattern to schedule fine-grain parallel loops and considers its integration in the Intel OpenMP and Cilkplus schedulers. Experimental evaluation demonstrates that the sch...
详细信息
ISBN:
(纸本)9781450349826
This work proposes a low-overhead half-barrier pattern to schedule fine-grain parallel loops and considers its integration in the Intel OpenMP and Cilkplus schedulers. Experimental evaluation demonstrates that the scheduling overhead of our techniques is 43% lower than Intel OpenMP and 12.1 x lower than Cilk. We observe 22% speedup on 48 threads, with a peak of 2.8 x speedup.
We introduce our major ideas of a wait-free, linearizable, and disjoint-access parallel NCAS library, called RTNCAS. It focuses the construction of wait-free data structure operations (DSO) in real-time circumstances....
详细信息
ISBN:
(纸本)9781450301190
We introduce our major ideas of a wait-free, linearizable, and disjoint-access parallel NCAS library, called RTNCAS. It focuses the construction of wait-free data structure operations (DSO) in real-time circumstances. RTNCAS is able to conditionally swap multiple independent words (NCAS) in an atomic manner. It allows us, furthermore, to implement arbitrary DSO by means of their sequential specification.
We explore a programming approach for concurrency that synchronizes all accesses to shared memory by default. Synchronization takes place by ensuring that all program code runs inside atomic sections even if the progr...
详细信息
ISBN:
(纸本)9781450344937
We explore a programming approach for concurrency that synchronizes all accesses to shared memory by default. Synchronization takes place by ensuring that all program code runs inside atomic sections even if the program code has external side effects. Threads are mapped to atomic sections that a programmer must explicitly split to increase concurrency. A naive implementation of this approach incurs a large amount of overhead. We show how to reduce this overhead to make the approach suitable for realistic application programs on existing hardware. We present an implementation technique based on a special-purpose software transactional memory system. To reduce the overhead, the technique exploits properties of managed, object-oriented programming languages as well as intraprocedural static analyses and uses field-level granularity locking in combination with transactional I/O to provide good scaling properties. We implemented the synchronized-by-default (SBD) approach for the Java language and evaluate its performance for six programs from the DaCapo benchmark suite. The evaluation shows that, compared to explicit synchronization, the SBD approach has an overhead between 0.4% and 102% depending on the benchmark and the number of threads, with a mean (geom.) of 23.9%.
The Pilot library is a new method for programming MPI-enabled clusters in C, targeted at novice parallel programmers. Formal elements from Communicating Sequential Processes (CSP) are used to realize a process/channel...
详细信息
ISBN:
(纸本)9781605587080
The Pilot library is a new method for programming MPI-enabled clusters in C, targeted at novice parallel programmers. Formal elements from Communicating Sequential Processes (CSP) are used to realize a process/channel model of parallel computation that reduces opportunities for deadlock and other communication errors. This simple model, plus an application programming interface (API) styled after C's formatted I/O, are designed to make the library easy to learn. The Pilot library exists as a thin layer on top of any standard Message Passing Interface (MPI) implementation, preserving MPI's portability and efficiency, with little performance overhead arising as result of Pilot's additional features.
暂无评论