We present Dynamic Out-of-Order Java (DOJ), a dynamic parallelization approach. In DOJ, a developer annotates code blocks as tasks to decouple these blocks from the parent execution thread. the DOJ compiler then analy...
详细信息
ISBN:
(纸本)9781450311601
We present Dynamic Out-of-Order Java (DOJ), a dynamic parallelization approach. In DOJ, a developer annotates code blocks as tasks to decouple these blocks from the parent execution thread. the DOJ compiler then analyzes the code to generate heap examiners that ensure the parallel execution preserves the behavior of the original sequential program. Heap examiners dynamically extract heap dependences between code blocks and determine when it is safe to execute a code block. We have implemented DOJ and evaluated it on twelve benchmarks. We achieved an average compilation speedup of 31.15x over OoOJava and an average execution speedup of 12.73x over sequential versions of the benchmarks.
In this paper, we propose an OpenCL framework for heterogeneous CPU/GPU clusters, and show that the framework achieves both high performance and ease of programming. the framework provides an illusion of a single syst...
详细信息
ISBN:
(纸本)9781450311601
In this paper, we propose an OpenCL framework for heterogeneous CPU/GPU clusters, and show that the framework achieves both high performance and ease of programming. the framework provides an illusion of a single system for the user. It allows the application to utilize multiple heterogeneous compute devices, such as multicore CPUs and GPUs, in a remote node as if they were in a local node. No communication API, such as the MPI library, is required in the application source. We implement the OpenCL framework and evaluate its performance on a heterogeneous CPU/GPU cluster that consists of one host node and nine compute nodes using eleven OpenCL benchmark applications.
Graphs are the de facto data structures for many applications, and efficient graph processing is a must for the application performance. GPUs have an order of magnitude higher computational power and memory bandwidth ...
详细信息
ISBN:
(纸本)9781450311601
Graphs are the de facto data structures for many applications, and efficient graph processing is a must for the application performance. GPUs have an order of magnitude higher computational power and memory bandwidth compared to CPUs and have been adopted to accelerate several common graph algorithms. However, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. To address those difficulties, we propose a programming framework named Medusa to simplify graph processing on GPUs. Medusa offers a small set of APIs, based on which developers can define their application logics by writing sequential code without awareness of GPU architectures. the Medusa runtime system automatically executes the developer defined APIs in parallel on the GPU, with a series of graph-centric optimizations. this poster gives an overview of Medusa, and presents some preliminary results.
We present CaCUDA - a GPGPU kernel abstraction and a parallelprogramming framework for developing highly efficient large scale scientific applications using stencil computations on hybrid CPU/GPU architectures. CaCUD...
详细信息
ISBN:
(纸本)9781450311601
We present CaCUDA - a GPGPU kernel abstraction and a parallelprogramming framework for developing highly efficient large scale scientific applications using stencil computations on hybrid CPU/GPU architectures. CaCUDA is built upon the Cactus computational toolkit, an open source problem solving environment designed for scientists and engineers. Due to the flexibility and extensibility of the Cactus toolkit, the addition of a GPGPU programming framework required no changes to the Cactus infrastructure, guaranteeing that existing features and modules will continue to work without modification. CaCUDA was tested and benchmarked using a 3D CFD code based on a finite difference discretization of Navier-Stokes equations.
Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and w...
详细信息
ISBN:
(纸本)9781450311601
Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter. We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. this level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.
We present a tool for mesh-partitioning parallelization of numerical programs working iteratively on an unstructured mesh. this conventional method splits a mesh into sub-meshes, adding some overlap on the boundaries ...
详细信息
ISBN:
(纸本)9780897919067
We present a tool for mesh-partitioning parallelization of numerical programs working iteratively on an unstructured mesh. this conventional method splits a mesh into sub-meshes, adding some overlap on the boundaries of the sub-meshes. the program is then run in SPMD mode on a parallel architecture with distributed memory. It is necessary to add calls to communication routines at a few carefully selected locations in the code. the tool presented here uses the data-dependence information to mechanize the placement of these synchronizations. Additionally, we see that there is not a unique solution for placing these synchronizations, and performance depends on this choice.
SIMD instructions are common in CPUs for years now. Using these instructions effectively requires not only vectorization of code, but also modifications to the data layout. However, automatic vectorization techniques ...
详细信息
ISBN:
(纸本)9781450311601
SIMD instructions are common in CPUs for years now. Using these instructions effectively requires not only vectorization of code, but also modifications to the data layout. However, automatic vectorization techniques are often not powerful enough and suffer from restricted scope of applicability;hence, programmers often vectorize their programs manually by using intrinsics: compiler-known functions that directly expand to machine instructions. they significantly decrease programmer productivity by enforcing a very error-prone and hard-to-read assembly-like programming style. Furthermore, intrinsics are not portable because they are tied to a specific instruction set. In this paper, we show how a C-like language can be extended to allow for portable and efficient SIMD programming. Our extension puts the programmer in total control over where and how control-flow vectorization is triggered. We present a type system and a formal semantics of our extension and prove the soundness of the type system. Using our prototype implementation IVL that targets Intel's MIC architecture and SSE instruction set, we show that the generated code is roughly on par with handwritten intrinsic code.
this poster is a case study on the application of a novel programming model, called Concurrent Collections (CnC), to the implementation of an asynchronous-parallel algorithm for computing the Cholesky factorization of...
详细信息
ISBN:
(纸本)9781605587080
this poster is a case study on the application of a novel programming model, called Concurrent Collections (CnC), to the implementation of an asynchronous-parallel algorithm for computing the Cholesky factorization of dense matrices. In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. We demonstrate the performance potential of CnC in this poster, by showing that our Cholesky implementation nearly matches or exceeds competing vendor-tuned codes and alternative programming models. We conclude that the CnC model is well-suited for expressing asynchronous-parallel algorithms on emerging multicore systems.
暂无评论