This paper considers the problem of formal verification of MPI programs operating under a fixed test harness for safety properties without building verification models. In our approach, we directly model-check the MPI...
详细信息
This paper considers the problem of formal verification of MPI programs operating under a fixed test harness for safety properties without building verification models. In our approach, we directly model-check the MPI/C source code, executing its interleavings with the help of a verification scheduler. Unfortunately, the total feasible number of interleavings is exponential, and impractical to examine even for our modest goals. Our earlier publications formalized and implemented a partial order reduction approach that avoided exploring equivalent interleavings, and presented a verification tool called ISP. This paper presents algorithmic and engineering innovations to ISP, including the use of OpenMP parallelization, that now enables it to handle practical MPI programs, including: (i) ParMETIS - a widely used hypergraph partitioner, and (ii) MADRE - a Memory Aware Data Re-distribution Engine, both developed outside our group. Over these benchmarks, ISP has automatically verified up to 14K lines of MPI/C code, producing error traces of deadlocks and assertion violations within seconds.
Assuming that the multicore revolution plays out the way the microprocessor industry expects, it seems that within a decade most programming will involve parallelism at some level. One needs to ask how this affects th...
详细信息
ISBN:
(纸本)9781605583976
Assuming that the multicore revolution plays out the way the microprocessor industry expects, it seems that within a decade most programming will involve parallelism at some level. One needs to ask how this affects the the way we teach computer science, or even how we have people think about computation. With regards to teaching there seem to be three basic choices: (1) we only train a small number of experts in parallel computation who develop a collection of libraries, and everyone else just uses them; (2) we leave our core curriculum pretty much as is, but add some advanced courses on parallelism or perhaps tack on a few lectures at the end of existing courses; or (3) we start teaching parallelism from the start and embed it throughout the curriculum with the idea of getting students to think about parallelism as the most natural form of computation and sequential computation as a special *** talk will examine some of the implications of the third option. It will argue that thinking about parallelism, when treated in an appropriate way, might be as easy or easier that thinking sequentially. A key prerequisite, however, is to identify what the core ideas in parallelism are and how they might be layered and integrated with existing concepts. Another more difficult issue is how to cleanly integrate these ideas among courses. After all much of the success of sequential computation follows from the concept of a random access machine and its ability to serve as a simple, albeit imperfect, interface between programming languages, algorithm analysis, and hardware design. The talk will go through an initial list of some core ideas in parallelism, and an approach to integrating these ideas between parallel algorithms, programming languages, and, to some extent, hardware. This requires, however, moving away from the concept of a machine model as a interface for thinking about computation.
This paper proposes an optimization method of data saving for application-level checkpointing based on the live-variable analysis method for MPI programs. We presents the implementation of a source-to-source precompil...
详细信息
ISBN:
(纸本)9781595939609
This paper proposes an optimization method of data saving for application-level checkpointing based on the live-variable analysis method for MPI programs. We presents the implementation of a source-to-source precompiler (CAC) for automating application-level checkpointing based on the optimization method. The experiment shows that CAC is capable of automating application-level checkpointing correctly and reducing checkpoint data effectively.
As high-performance computing enters the mainstream, parallelprogramming mechanisms (including the Message Passing Interface, or MPI) must be supported in new environments such as C# and the Common Language Infrastru...
详细信息
ISBN:
(纸本)9781595939609
As high-performance computing enters the mainstream, parallelprogramming mechanisms (including the Message Passing Interface, or MPI) must be supported in new environments such as C# and the Common Language Infrastructure (CLI). Making effective use of MPI with the CLI requires an interface that reflects the high-level object-oriented nature of C# and that also supports its programming idioms. However, for performance reasons, this high-level functionality must ultimately be mapped to low-level native MPI libraries. In addition to abstraction penalty concerns, avoiding unwanted overhead in this mapping process is significantly complicated by the safety and portability features of the CLI virtual machine, such as garbage collection and just-in-time compilation. In this paper, we describe our approach to using features of C# and the CLI-such as reflection, unsafe code regions, and run-time code generation-to realize an elegant, yet highly efficient, C# interface to MPI. Experimental results demonstrate that there is no appreciable overhead introduced by our approach when compared to the native MS-MPI library.
Low-Density Parity-Check (LDPC) codes are powerful error correcting codes (ECC). They have recently been adopted by several data communication standards such as DVB-S2 and WiMax. LDPCs are represented by bipartite gra...
详细信息
ISBN:
(纸本)9781595939609
Low-Density Parity-Check (LDPC) codes are powerful error correcting codes (ECC). They have recently been adopted by several data communication standards such as DVB-S2 and WiMax. LDPCs are represented by bipartite graphs, also called Tanner graphs, and their decoding demands very intensive computation. For that reason, VLSI dedicated architectures have been investigated and developed over the last few years. This paper proposes a new approach for LDPC decoding on graphics processing units (GPUs). Efficient data structures and an new algorithm are proposed to represent the Tanner graph and to perform LDPC decoding according to the stream-based computing model. GPUs were programmed to efficiently implement the proposed algorithms by applying data-parallel intensive computing. Experimental results show that GPUs perform LDPC decoding nearly three orders of magnitude faster than modem CPUs. Moreover, they lead to the conclusion that GPUs with their tremendous processing power can be considered as a consistent alternative to state-of-the-art hardware LDPC decoders.
Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue ...
详细信息
ISBN:
(纸本)9781595939609
Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward's effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.
This paper investigates adding transactions with nested parallelism and nested transactions to a dynamically multithreaded parallelprogramming language that generates only series-parallel programs. We describe XConfl...
详细信息
ISBN:
(纸本)9781595939609
This paper investigates adding transactions with nested parallelism and nested transactions to a dynamically multithreaded parallelprogramming language that generates only series-parallel programs. We describe XConflict, a data structure that facilitates conflict detection for a software transactional memory system which supports transactions with nested parallelism and unbounded nesting depth. For languages that use a Cilk-like work-stealing scheduler, XConflict answers concurrent conflict queries in O(1) time and can be maintained efficiently. In particular, for a program with T-1 work and a span (or critical-path length) of T, the running time on p processors of the program augmented with XConflict is only O(T-1/Tp+p). Using XConflict, we describe CWSTM, a runtime-system design for software transactional memory which supports transactions with nested parallelism and unbounded nesting depth of transactions. The CWSTM design provides transactional memory with eager updates, eager conflict detection, strong atomicity, and lazy cleanup on aborts. In the restricted case when no transactions abort and there are no concurrent readers, CWSTM executes a transactional computation on p processors also in time O(T-1/Tp+p). Although this bound holds only under rather optimistic assumptions, to our knowledge, this result is the first theoretical performance bound on a TM system that supports transactions with nested parallelism which is independent of the maximum nesting depth of transactions.
暂无评论