Our goal in this work was to identify and quantify the overheads of tracing parallel applications. We investigate several different sources of overhead related to tracing: trace instrumentation, periodic writing of tr...
详细信息
ISBN:
(纸本)9781595936028
Our goal in this work was to identify and quantify the overheads of tracing parallel applications. We investigate several different sources of overhead related to tracing: trace instrumentation, periodic writing of trace files to disk, differing trace buffer sizes, system changes, and increasing numbers of processors in the target application. We encountered overheads as large as 26.7% for writing the trace file to disk. We found that buffer sizes can make a difference in the overheads, and that differences in system software can also contribute to the level of the perturbation. Our results show that the overhead of instrumentation correlates strongly with the number of events, while the overhead of writing the trace buffer increases with increasing numbers of processors.
When a Search Engine responds to your query, thousands of machines from around the world have cooperated to produce your result. With a global reach of hundreds-of-millions of users, Search Engines are arguably the mo...
详细信息
ISBN:
(纸本)1595931899
When a Search Engine responds to your query, thousands of machines from around the world have cooperated to produce your result. With a global reach of hundreds-of-millions of users, Search Engines are arguably the most commonly used massively-parallel computing systems on the planet. In this talk, we examine Web Search Engines as a case study of parallelprogramming in a practical context. We focus primarily on the practice of parallelprogramming, reviewing many ways in which parallelprogramming is used in a modern Search Engine. We also discuss briefly the principles of parallelprogramming, listing some of the principles that guide our use of parallelism and speculating a bit on how the mechanics of parallelism might better be automated in our context.
The proceedings contains 14 papers from the conference on the proceedings of the acmsigplansymposium on principles and practice of parallelprogramming, PPOPP. Topics discussed include: reference idempotency analysi...
详细信息
The proceedings contains 14 papers from the conference on the proceedings of the acmsigplansymposium on principles and practice of parallelprogramming, PPOPP. Topics discussed include: reference idempotency analysis: a framework for optimizing speculative execution;pointer and escape analysis for multithread programs;language support for motion-order matrices;efficient load balancing for wide-area divide-and-conquer applications;scalable queue-based spin locks with timeout;contention ellimination by replication of sequential sections in distributed shared memory programs;and accurate data redistribution cost estimation in software distributes shared memory systems.
We present KUMQUAT, a system for automatically generating data-parallel implementations of UNIX shell commands and pipelines. The generated parallel versions split input streams, execute multiple instantiations of the...
详细信息
ISBN:
(纸本)9781450392044
We present KUMQUAT, a system for automatically generating data-parallel implementations of UNIX shell commands and pipelines. The generated parallel versions split input streams, execute multiple instantiations of the original pipeline commands to process the splits in parallel, then combine the resulting parallel outputs to produce the final output stream. KumQUAT automatically synthesizes the combine operators, with a domain-specific combiner language acting as a strong regularizer that promotes efficient inference of correct combiners. We present experimental results that show that these combiners enable the effective parallelization of our benchmark scripts.
Implicit parallelism with Ordered Transactions (IPOT) is all extension of sequential or explicitly parallelprogramming models to support speculative parallelization. The key idea is to specify opportunities for paral...
详细信息
ISBN:
(纸本)9781595936028
Implicit parallelism with Ordered Transactions (IPOT) is all extension of sequential or explicitly parallelprogramming models to support speculative parallelization. The key idea is to specify opportunities for parallelization in a sequential program using annotations similar to transactions. Unlike explicit parallelism, IPOT annotations do not require the absence of data dependence, since the parallelization relies oil runtime support for speculative execution. IPOT as a parallelprogramming model is determinate, i.e., program semantics are independent of the thread scheduling. For optimization, non-determinism can be introduced selectively. We describe the programming model of IPOT and an online tool that recommends boundaries of ordered transactions by observing a sequential execution. On three example HPC workloads we demonstrate that Our method is effective in identifying opportunities for fine-grain parallelization. Using the automated task recommendation tool, we were able to perform the parallelization of each program within a few hours.
We introduce a structure-aware parallel technique for context-bounded analysis of concurrent programs. The key intuition consists in decomposing the set of concurrent traces into symbolic subsets that are separately e...
详细信息
ISBN:
(纸本)9781450368186
We introduce a structure-aware parallel technique for context-bounded analysis of concurrent programs. The key intuition consists in decomposing the set of concurrent traces into symbolic subsets that are separately explored by multiple instances of the same decision procedure running in parallel. The decision procedures work on different partitions of the search space without cooperating, whence distribution follows effortlessly. Our experiments on a selection of complex multi-threaded programs show significant analysis speedups and scalability, and greater performance gains than with general-purpose parallel solvers.
We present a perfectly balanced, "merge-based" parallel method for computing sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format, a...
详细信息
ISBN:
(纸本)9781450340922
We present a perfectly balanced, "merge-based" parallel method for computing sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format, a predominant in-memory representation for general-purpose sparse linear algebra computations. Our CsrMV performs an equitable multi-partitioning of the input dataset, ensuring that no single thread can be overwhelmed by assignment to (a) arbitrarily-long rows or (b) an arbitrarily-large number of zerolength rows. This parallel decomposition requires neither offline preprocessing nor specialized/ancillary data formats. We evaluate our method on both CPU and GPU microarchitecture across an enormous corpus of diverse real world matrix datasets. We show that traditional CsrMV methods are inconsistent performers subject to order-of-magnitude slowdowns, whereas the performance response of our method is substantially impervious to row-length heterogeneity.
In this tutorial participants learn how to build their own parallelprogramming language features by developing them as language extensions in the ableC [4] extensible C compiler framework. By implementing new paralle...
详细信息
ISBN:
(纸本)9781450362252
In this tutorial participants learn how to build their own parallelprogramming language features by developing them as language extensions in the ableC [4] extensible C compiler framework. By implementing new parallelprogramming abstractions as language extensions one can build on an existing host language and thus avoid re-implementing common language features such as the type checking and code generation of arithmetic expressions and control flow statements. Using ableC, one can build expressive language features that fit seamlessly into the C11 host language.
ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced ...
详细信息
ISBN:
(纸本)9781581135886
ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism when necessary. These details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer, We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP V2200, SGI Origin 3800, IBM Regatta-HPC and IBM RS6000 SP cluster.
Standard cache-oblivious recursive divide-and-conquer algorithms for evaluating dynamic programming recurrences have optimal serial cache complexity but often have lower parallelism compared with iterative wavefront a...
详细信息
ISBN:
(纸本)9781450344937
Standard cache-oblivious recursive divide-and-conquer algorithms for evaluating dynamic programming recurrences have optimal serial cache complexity but often have lower parallelism compared with iterative wavefront algorithms due to artificial dependencies among subtasks. Very recently cache-oblivious recursive wavefront (COW) algorithms have been introduced which do not have any artificial dependencies. Though COW algorithms are based on fork-join primitives, they extensively use atomic operations, and as a result, performance guarantees provided by state-of-the-art schedulers for programs with fork-join primitives do not apply. In this work, we show how to systematically transform standard cache-oblivious recursive divide-and-conquer algorithms into recursive wavefront algorithms to achieve optimal parallel cache complexity and high parallelism under state-of-the-art schedulers for fork join programs. Unlike COW algorithms these new algorithms do not use atomic operations. Instead, they use closed-form formulas to compute at what time each recursive function must be launched in order to achieve high parallelism without losing cache performance. The resulting implementations are arguably much simpler than implementations of known COW algorithms.
暂无评论