Traditional debuggers are of limited value for modern scientific codes that manipulate large complex data structures. this paper discusses a novel debug-time assertion, called a "Statistical Assertion", that...
详细信息
ISBN:
(纸本)9781450311601
Traditional debuggers are of limited value for modern scientific codes that manipulate large complex data structures. this paper discusses a novel debug-time assertion, called a "Statistical Assertion", that allows a user to reason about large data structures, and the primitives are parallelised to provide an efficient solution. We present the design and implementation of statistical assertions, and illustrate the debugging technique with a molecular dynamics simulation. We evaluate the performance of the tool on a 12,000 cores Cray XE6.
the proceedings contain 19 papers. the topics discussed include: symbolic evaluation graphs and term rewriting a general methodology for analyzing logic programs;automatic synthesis of specifications for first order c...
ISBN:
(纸本)9781450315227
the proceedings contain 19 papers. the topics discussed include: symbolic evaluation graphs and term rewriting a general methodology for analyzing logic programs;automatic synthesis of specifications for first order curry programs;goal-directed execution of answer set programs;layered fixed point logic;a polynomial time ?-calculus with multithreading and side effects;modeling datalog fact assertion and retraction in linear logic;regular expression sub-matching using partial derivatives;functional semantics of parsing actions, and left recursion elimination as continuation passing;linear dependent types in a call-by-value scenario;transparent function types: clearing up opacity;exception handling for copyless messaging;declarative distributed advertisement system for iDTV: an industrial experience;task-oriented programming in a pure functional language;and a linear concurrent constraint approach for the automatic verification of access permissions.
the proceedings contain 39 papers. the topics discussed include: ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms;programmingthe memory hierarchy revisited: supporting ir...
ISBN:
(纸本)9781450301190
the proceedings contain 39 papers. the topics discussed include: ordered vs. unordered: a comparison of parallelism and work-efficiency in irregular algorithms;programmingthe memory hierarchy revisited: supporting irregular parallelism in sequoia;compact data structure and scalable algorithms for the sparse grid technique;a domain-specific approach to heterogeneous parallelism;Copperhead: compiling an embedded data parallel language;OoOJava: software out-of-order execution;SpiceC: scalable parallelism via implicit copying and explicit commit;inferring ownership transfer for efficient message passing;all-window profiling and composable models of cache sharing;ULCC: a user-level facility for optimizing shared cache performance on multicores;ScalaExtrap: trace-based communication extrapolation for SPMD programs;and GRace: a low-overhead mechanism for detecting data races in GPU programs.
Traditional debuggers are of limited value for modern scientific codes that manipulate large complex data structures. this paper discusses a novel debug-time assertion, called a "Statistical Assertion", that...
详细信息
Traditional debuggers are of limited value for modern scientific codes that manipulate large complex data structures. this paper discusses a novel debug-time assertion, called a "Statistical Assertion", that allows a user to reason about large data structures, and the primitives are parallelised to provide an efficient solution. We present the design and implementation of statistical assertions, and illustrate the debugging technique with a molecular dynamics simulation. We evaluate the performance of the tool on a 12,000 cores Cray XE6.
Domain-expert productivity programmers desire scalable application performance, but usually must rely on efficiency programmers who are experts in explicit parallelprogramming to achieve it. Since such programmers ar...
详细信息
ISBN:
(纸本)9781450311601
Domain-expert productivity programmers desire scalable application performance, but usually must rely on efficiency programmers who are experts in explicit parallelprogramming to achieve it. Since such programmers are rare, to maximize reuse of their work we propose encapsulating their strategies in mini-compilers for domain-specific embedded languages (DSELs) glued together by a common high-level host language familiar to productivity programmers. the nontrivial applications that use these DSELs perform up to 98% of peak attainable performance, and comparable to or better than existing hand-coded implementations. Our approach is unique in that each mini-compiler not only performs conventional compiler transformations and optimizations, but includes imperative procedural code that captures an efficiency expert's strategy for mapping a narrow domain onto a specific type of hardware. the result is source-and performance-portability for productivity programmers and parallel performance that rivals that of hand-coded efficiency-language implementations of the same applications. We describe a framework that supports our methodology and five implemented DSELs supporting common computation kernels. Our results demonstrate that for several interesting classes of problems, efficiency-level parallel performance can be achieved by packaging efficiency programmers' expertise in a reusable framework that is easy to use for both productivity programmers and efficiency programmers.
the proliferation of data in graph form calls for the development of scalable graph algorithms that exploit parallel processing environments. One such problem is the computation of a graph's minimum spanning fores...
详细信息
Existing concurrency platforms for dynamic multithreading do not provide repeatable parallel random-number generators. this paper proposes that a mechanism called pedigrees be built into the runtime system to enable e...
详细信息
ISBN:
(纸本)9781450311601
Existing concurrency platforms for dynamic multithreading do not provide repeatable parallel random-number generators. this paper proposes that a mechanism called pedigrees be built into the runtime system to enable efficient deterministic parallel random-number generation. Experiments withthe open-source MIT Cilk runtime system show that the overhead for maintaining pedigrees is negligible. Specifically, on a suite of 10 benchmarks, the relative overhead of Cilk with pedigrees to the original Cilk has a geometric mean of less than 1%. We persuaded Intel to modify its commercial C/C++ compiler, which provides the Cilk Plus concurrency platform, to include pedigrees, and we built a library implementation of a deterministic parallel random-number generator called DOTMIX that compresses the pedigree and then "RC6-mixes" the result. the statistical quality of DOTMIX is comparable to that of the popular Mersenne twister, but somewhat slower than a nondeterministic parallel version of this efficient and high-quality serial random-number generator. the cost of calling DOTMIX depends on the "spawn depth" of the invocation. For a naive Fibonacci calculation with n = 40 that calls DOTMIX in every node of the computation, this "price of determinism" is about a factor of 2.3 in running time over the nondeterministic Mersenne twister, but for more realistic applications with less intense use of random numbers - such as a maximal-independent-set algorithm, a practical samplesort program, and a Monte Carlo discrete-hedging application from QuantLib - the observed "price" was at most 2 1%, and sometimes much less. Moreover, even if overheads were several times greater, applications using DOTMIX should be amply fast for debugging purposes, which is a major reason for desiring repeatability.
the virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that...
详细信息
ISBN:
(纸本)9781450311601
the virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. this has been referred to as internal determinism, and implies a sequential semantics-i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable. In this paper we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performan
the proliferation of data in graph form calls for the development of scalable graph algorithms that exploit parallel processing environments. One such problem is the computation of a graph's minimum spanning fores...
详细信息
the proliferation of data in graph form calls for the development of scalable graph algorithms that exploit parallel processing environments. One such problem is the computation of a graph's minimum spanning forest (MSF). Past research has proposed several parallel algorithms for this problem, yet none of them scales to large, high-density graphs. In this paper we propose a novel, scalable, parallel MSF algorithm for undirected weighted graphs. Our algorithm leverages Prim's algorithm in a parallel fashion, concurrently expanding several subsets of the computed MSF. Our effort focuses on minimizing the communication among different processors without constraining the local growth of a processor's computed subtree. In effect, we achieve a scalability that previous approaches lacked. We implement our algorithm in CUDA, running on a GPU and study its performance using real and synthetic, sparse as well as dense, structured and unstructured graph data. Our experimental study demonstrates that our algorithm outperforms the previous state-of-the-art GPU-based MSF algorithm, while being several order of magnitude faster than sequential CPU-based algorithms.
暂无评论