We present PARGEO, a multicore library for computational geometry algorithms. We describe two of the algorithms from PARGEO, convex hull and the smallest enclosing ball, and present a short evaluation of all implement...
详细信息
ISBN:
(纸本)9781450392044
We present PARGEO, a multicore library for computational geometry algorithms. We describe two of the algorithms from PARGEO, convex hull and the smallest enclosing ball, and present a short evaluation of all implementations currently in PARGEO.
Memory performance is one essential factor for tapping into the full potential of the massive parallelism of GPU. It has motivated some recent efforts in GPU cache modeling. This paper presents a new data-centric way ...
详细信息
ISBN:
(纸本)9781450340922
Memory performance is one essential factor for tapping into the full potential of the massive parallelism of GPU. It has motivated some recent efforts in GPU cache modeling. This paper presents a new data-centric way to model the performance of a system with heterogeneous memory resources. The new model is composable, meaning it can predict the performance difference due to placing data differently by profiling the execution just once.
In programming high performance applications, shared address-space platforms are preferable for fine-grained computation, while distributed address-space platforms are more suitable for coarse-grained computation. How...
详细信息
ISBN:
(纸本)9781581135886
In programming high performance applications, shared address-space platforms are preferable for fine-grained computation, while distributed address-space platforms are more suitable for coarse-grained computation. However, currently only distributed address-space systems scale beyond the low hundreds of processors. In this paper we introduce a hybrid architecture that allows users to trade off local memory usage for coherence communication, making possible larger-scale shared memory architectures. We introduce a programming model and examine possible implementations of hardware mechanisms, evaluating some of the trade-offs inherent in each. Preliminary experiments on an application with particularly fine-grained communication requirements indicate that effective placement of directives can reduce coherence communication by more than a factor of 10 for 64 processors.
In this tutorial participants learn how to build their own parallelprogramming language features by developing them as language extensions in the ableC [4] extensible C compiler framework. By implementing new paralle...
详细信息
ISBN:
(纸本)9781450362252
In this tutorial participants learn how to build their own parallelprogramming language features by developing them as language extensions in the ableC [4] extensible C compiler framework. By implementing new parallelprogramming abstractions as language extensions one can build on an existing host language and thus avoid re-implementing common language features such as the type checking and code generation of arithmetic expressions and control flow statements. Using ableC, one can build expressive language features that fit seamlessly into the C11 host language.
Given the sophistication of recent type systems, unification-based type-checking and inference can be a time-consuming phase of compilation-especially when union types are combined with subtyping. It is natural to con...
详细信息
ISBN:
(纸本)9781450340922
Given the sophistication of recent type systems, unification-based type-checking and inference can be a time-consuming phase of compilation-especially when union types are combined with subtyping. It is natural to consider improving performance through parallelism, but these algorithms are challenging to parallelize due to complicated control structure and difficulties representing data in a way that is both efficient and supports concurrency. We provide techniques that address these problems based on the LVish approach to deterministic-by-default parallelprogramming. We extend LVish with Saturating LVars, the first LVars implemented to release memory during the object's lifetime. Our design allows us to achieve a parallel speedup on worst-case (exponential) inputs of Hindley-Milner inference, and on the Typed Racket type-checking algorithm, which yields up an 8.46x parallel speedup on 14 cores for type-checking examples drawn from the Racket repository.
Our goal in this work was to identify and quantify the overheads of tracing parallel applications. We investigate several different sources of overhead related to tracing: trace instrumentation, periodic writing of tr...
详细信息
ISBN:
(纸本)9781595936028
Our goal in this work was to identify and quantify the overheads of tracing parallel applications. We investigate several different sources of overhead related to tracing: trace instrumentation, periodic writing of trace files to disk, differing trace buffer sizes, system changes, and increasing numbers of processors in the target application. We encountered overheads as large as 26.7% for writing the trace file to disk. We found that buffer sizes can make a difference in the overheads, and that differences in system software can also contribute to the level of the perturbation. Our results show that the overhead of instrumentation correlates strongly with the number of events, while the overhead of writing the trace buffer increases with increasing numbers of processors.
Boosted transactions offer an attractive method that enables programmers to create larger transactions that scale well and offer deadlock-free guarantees. However, as boosted transactions get larger, they become more ...
详细信息
ISBN:
(纸本)9781605583976
Boosted transactions offer an attractive method that enables programmers to create larger transactions that scale well and offer deadlock-free guarantees. However, as boosted transactions get larger, they become more susceptible to conflicts and aborts. We describe a linear-time algorithm to detect transactions that cannot make progress, which transactions need to be aborted, and when. The algorithm guarantees zero false positives with minimal aborts. Our proposals, as implemented in DSTM2, increase the transactional throughput of the system, often by more than 30%.
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallelprogramming on may-core architectures. We show that the NB-FEB primitive is universal, s...
详细信息
ISBN:
(纸本)9781605583976
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallelprogramming on may-core architectures. We show that the NB-FEB primitive is universal, scalable and feasible. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently mitigates performance degradation due to synchronization "hot spots" (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility).
Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithm...
详细信息
ISBN:
(纸本)9781450349826
Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We present a new concurrent work queue, the Broker Queue, a highly efficient, linearizable queue for fine-granular work distribution on the GPU. We evaluate its usability and benefits in contrast to existing queuing algorithms. Our queue is up to one order of magnitude faster than non-blocking queues, and outperforms simpler queue designs that are unfit for fine-granular work distribution.
暂无评论