We present a solution to the reaching definitions problem for programs with explicit lexically specified parallel constructs, such as cobegin/coend or parallel-sections, both with and without explicit synchronization ...
详细信息
ISBN:
(纸本)9780897915892
We present a solution to the reaching definitions problem for programs with explicit lexically specified parallel constructs, such as cobegin/coend or parallel-sections, both with and without explicit synchronization operations, such as Post, Wait or Advance. the reaching definitions information for sequential programs is used to solve many standard optimization problems. In parallel programs, this information can also be used to explicitly direct communication and data ownership. Although work has been done on analyzing parallel programs to detect data races, little work has been done on optimizing such programs. We show how the memory consistency model specified by an explicitly parallelprogramming language can influence the complexity of the reaching definitions problem. By selecting the ''weakest'' memory consistency semantics, we can efficiently solve the reaching definitions problem for correct programs.
Graphs are the de facto data structures for many applications, and efficient graph processing is a must for the application performance. GPUs have an order of magnitude higher computational power and memory bandwidth ...
详细信息
ISBN:
(纸本)9781450311601
Graphs are the de facto data structures for many applications, and efficient graph processing is a must for the application performance. GPUs have an order of magnitude higher computational power and memory bandwidth compared to CPUs and have been adopted to accelerate several common graph algorithms. However, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. To address those difficulties, we propose a programming framework named Medusa to simplify graph processing on GPUs. Medusa offers a small set of APIs, based on which developers can define their application logics by writing sequential code without awareness of GPU architectures. the Medusa runtime system automatically executes the developer defined APIs in parallel on the GPU, with a series of graph-centric optimizations. this poster gives an overview of Medusa, and presents some preliminary results.
this paper presents a new combined pointer and escape analysis for multithreaded programs. the algorithm uses a new abstraction called parallel interaction graphs to analyze the interactions between threads and extrac...
详细信息
ISBN:
(纸本)9781581133462
this paper presents a new combined pointer and escape analysis for multithreaded programs. the algorithm uses a new abstraction called parallel interaction graphs to analyze the interactions between threads and extract precise points-to, escape, and action ordering information for objects accessed by multiple threads. the analysis is compositional, analyzing each method or thread once to extract a parameterized analysis result that can be specialized for use in any context. It is also capable of analyzing programs. that use the unstructured form of multithreading present in languages such as Java and standard threads packages such as POSIX threads. We have implemented the analysis in the MIT Flex compiler for Java and used the extracted information to 1) verify that programs correctly use region-based allocation constructs, 2) eliminate dynamic checks associated withthe use of regions, and 3) eliminate unnecessary synchronization. Our experimental results show that analyzing the interactions between threads significantly increases the effectiveness of the region analysis and region check elimination, but has little effect for synchronization elimination.
We present Dynamic Out-of-Order Java (DOJ), a dynamic parallelization approach. In DOJ, a developer annotates code blocks as tasks to decouple these blocks from the parent execution thread. the DOJ compiler then analy...
详细信息
ISBN:
(纸本)9781450311601
We present Dynamic Out-of-Order Java (DOJ), a dynamic parallelization approach. In DOJ, a developer annotates code blocks as tasks to decouple these blocks from the parent execution thread. the DOJ compiler then analyzes the code to generate heap examiners that ensure the parallel execution preserves the behavior of the original sequential program. Heap examiners dynamically extract heap dependences between code blocks and determine when it is safe to execute a code block. We have implemented DOJ and evaluated it on twelve benchmarks. We achieved an average compilation speedup of 31.15x over OoOJava and an average execution speedup of 12.73x over sequential versions of the benchmarks.
ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. the library is an advanced ...
详细信息
ISBN:
(纸本)9781581135886
ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. the library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism when necessary. these details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer, We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP V2200, SGI Origin 3800, IBM Regatta-HPC and IBM RS6000 SP cluster.
We have modified the C language to support a programming model based on a shared address space with physically distributed memory. Withthis model users can write programs in which the nodes of a massively parallel pr...
详细信息
We have modified the C language to support a programming model based on a shared address space with physically distributed memory. Withthis model users can write programs in which the nodes of a massively parallel processor can access remote memory without message passing. AC provides support for distributed arrays as well as pointers to distributed data. Simple array references and pointer dereferencing are sufficient to generate low-overhead remote reads and writes. We have implemented these ideas in a compiler based on the GNU C compiler and targeted at Cray Research's T3D. Initial performance measurements show that AC generates code for remote accesses which is considerably faster than that of the native compiler for structures up to about 16 words in size and virtually equivalent for larger transfers.
In this paper, we propose an OpenCL framework for heterogeneous CPU/GPU clusters, and show that the framework achieves both high performance and ease of programming. the framework provides an illusion of a single syst...
详细信息
ISBN:
(纸本)9781450311601
In this paper, we propose an OpenCL framework for heterogeneous CPU/GPU clusters, and show that the framework achieves both high performance and ease of programming. the framework provides an illusion of a single system for the user. It allows the application to utilize multiple heterogeneous compute devices, such as multicore CPUs and GPUs, in a remote node as if they were in a local node. No communication API, such as the MPI library, is required in the application source. We implement the OpenCL framework and evaluate its performance on a heterogeneous CPU/GPU cluster that consists of one host node and nine compute nodes using eleven OpenCL benchmark applications.
this paper presents a new parallel volume rendering algorithm and implementation, based on shear warp factorization, for shared address space multiprocessors. Starting from an existing parallel shear-warp renderer, we...
详细信息
ISBN:
(纸本)9780897919067
this paper presents a new parallel volume rendering algorithm and implementation, based on shear warp factorization, for shared address space multiprocessors. Starting from an existing parallel shear-warp renderer, we use increasingly detailed performance measurements on real machines and simulators to understand performance bottlenecks. this leads us to a new parallel implementation that substantially outperforms and out-scales the old one on a range of shared address space platforms, from bus-based centralized memory machine to hardware-coherent distributed memory machines to networks of computers connected by page-based shared virtual memory. the results demonstrate that real time volume rendering is promising on general purpose multiprocessors, and illustrate the utility of tool hierarchies in conjunction with algorithmic and application knowledge to understand memory system interactions and improve parallel algorithms.
the proceedings contain 25 papers. the topics discussed include: order-sorted dependency pairs;macros for context-free grammars;inferring precise polymorphic type dependencies in logic programs;a type system for safe ...
ISBN:
(纸本)9781605581170
the proceedings contain 25 papers. the topics discussed include: order-sorted dependency pairs;macros for context-free grammars;inferring precise polymorphic type dependencies in logic programs;a type system for safe memory management and its proof of correctness;programming with proofs and explicit contexts;towards execution time estimation in abstract machine-based languages;similarity-based reasoning in qualified logic programming;classifying integrity checking methods with regard to inconsistency tolerance;comprehending finite maps for algorithmic debugging of higher-order functional programs;parallel execution of multi-set constraint rewrite rules;a rewriting framework for the composition of access control policies;global difference constraint propagation for finite domain solvers;and dynamic variable elimination during propagation solving.
暂无评论