Two concurrent accesses to a shared variable that are unordered by synchronization are said to be a data race if at least one access is a write. Data races cause shared memory parallel programs to behave unpredictably...
详细信息
ISBN:
(纸本)9781538683859
Two concurrent accesses to a shared variable that are unordered by synchronization are said to be a data race if at least one access is a write. Data races cause shared memory parallel programs to behave unpredictably. This paper describes ROMP - a tool for detecting data races in executions of scalable parallel applications that employ OpenMP for node-level parallelism. The complexity of OpenMP, which includes primitives for managing data environments, SPMD and SIMD parallelism, work sharing, tasking, mutual exclusion, and ordering, presents a formidable challenge for data race detection. ROMP is a hybrid data race detector that tracks accesses, access orderings and mutual exclusion. Unlike other OpenMP race detectors, ROMP detects races with respect to concurrency rather than implementation threads. Experiments show that ROMP yields precise race reports for a broader set of OpenMP constructs than prior state-of-the-art race detectors.
Collective operations are used in MPI programs to express common communication patterns, collective computations, or synchronization. In many collectives, such as MPI_Allreduce, the intra-node component of the collect...
详细信息
ISBN:
(纸本)9781538683859
Collective operations are used in MPI programs to express common communication patterns, collective computations, or synchronization. In many collectives, such as MPI_Allreduce, the intra-node component of the collective lies on the critical path, as the inter-node communication cannot start until the intra-node component has completed. With increasing number of core counts in each node, intra-node optimizations that leverage shared memory become more important. In this paper, we focus on the performance benefit of optimizing intra-node collectives using POSIX shared memory for synchronization and data sharing. We implement several collectives using basic primitives or steps as building blocks. Key components of our implementation include a dedicated intra- node collectives layer, careful layout of the data structures, as well as optimizations to exploit the memory hierarchy to balance parallelism and latencies of data movement. A comparison of our implementation on top of MPICH shows significant performance speedups with respect to the original MPICH implementation, MVAPICH, and OpenMPI.
Integer sorting on multicores and GPUs can be realized by a variety of approaches that include variants of distribution-based methods such as radix-sort, comparison-oriented algorithms such as deterministic regular sa...
详细信息
Due to the end of the Moore's law in clocking and Dennard's scaling, we are reaching very crippling limits with our current von Neumann processor paradigms. All the help is sought from both technology and arch...
详细信息
Due to the end of the Moore's law in clocking and Dennard's scaling, we are reaching very crippling limits with our current von Neumann processor paradigms. All the help is sought from both technology and architectures to innovate and engender new processing paradigms that can overcome those limitations and define the future of computing. New ideas and directions ranged from neuromorphic processors, to analog, mersisters, quantum and the use of nano photonics. This talk will examine a number of these emerging directions and work by the community including ours and evaluate some of the associated implications for the future of computing.
A performance vs. practicality trade-off exists between user-level threading techniques. The community has settled mostly on a black-and-white perspective; fully fledged threads assume that suspension is imminent and ...
详细信息
ISBN:
(纸本)9781538683859
A performance vs. practicality trade-off exists between user-level threading techniques. The community has settled mostly on a black-and-white perspective; fully fledged threads assume that suspension is imminent and incur overheads when suspension does not take place, and run-to-completion threads are more lightweight but less practical since they cannot suspend. Gray areas exist, however, whereby threads can start with minimal capabilities and then can be dynamically promoted to acquire additional capabilities when needed. This paper investigates the full spectrum of threading techniques from a performance vs. practicality trade-off perspective on modern multicore and many-core systems. Our results indicate that achieving the best trade-off highly depends on the suspension likelihood; dynamic promotion is more appropriate when suspension is unlikely and represents a solid replacement for run to completion, thanks to its lower programming constraints, while fully fledged threads remain the technique of choice when suspension likelihood is high.
Finding the genome of new species remains as one of the most crucial tasks in molecular biology. To achieve that end, de novo sequence assembly feeds from the vast amount of data provided by Next-Generation Sequenci...
详细信息
Finding the genome of new species remains as one of the most crucial tasks in molecular biology. To achieve that end, de novo sequence assembly feeds from the vast amount of data provided by Next-Generation Sequencing technology. Therefore, genome assemblers demand a high amount of computational resources, and parallel implementations of those assemblers are readily available. This paper presents a comparison of three well-known de novo genome assemblers: Velvet, ABySS and SOAPdenovo, all of them using de Bruijn graphs and having a parallel implementation. We based our analysis on parallel execution time, scalability, quality of assembly, and sensitivity to the choice of a critical parameter ( k -mer size). We found one of the tools clearly stands out for providing faster execution time and better quality in the output. Also, all assemblers are mildly sensitive to the choice of k-mer size and they all show limited scalability. We expect the findings of this paper provide a guide to the development of new algorithms and tools for scalable parallel genome sequence assemblers.
Non-volatile memory (NVM) provides a scalable solution to replace DRAM as main memory. Because of relatively high latency and low bandwidth of NVM (comparing with DRAM), NVM often pairs with DRAM to build a heterogene...
详细信息
ISBN:
(纸本)9781538683859
Non-volatile memory (NVM) provides a scalable solution to replace DRAM as main memory. Because of relatively high latency and low bandwidth of NVM (comparing with DRAM), NVM often pairs with DRAM to build a heterogeneous main memory system (HMS). Deciding data placement on NVM-based HMS is critical to enable future NVM-based HPC. In this paper, we study task-parallel programs, and introduce a runtime system to address the data placement problem on NVM-based HMS. Leveraging semantics and execution mode of task-parallel programs, we efficiently characterize memory access patterns of tasks and reduce data movement overhead. We also introduce a performance model to predict performance for tasks with various data placements on HMS. Evaluating with a set of HPC benchmarks, we show that our runtime system achieves higher performance than a conventional HMS-oblivious runtime (24% improvement on average) and two state-of-the-art HMS-aware solutions (16% and 11% improvement on average, respectively).
Provides an abstract of the keynote presentation and may include a brief professional biography of the presenter. The complete presentation was not made available for publication as part of the conference proceedings.
ISBN:
(纸本)9781538655566;9781538655559
Provides an abstract of the keynote presentation and may include a brief professional biography of the presenter. The complete presentation was not made available for publication as part of the conference proceedings.
Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
ISBN:
(纸本)9781538655566;9781538655559
Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record.
Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separatio...
详细信息
ISBN:
(纸本)9781450344937
Task-based programming offers an elegant way to express units of computation and the dependencies among them, making it easier to distribute the computational load evenly across multiple cores. However, this separation of problem decomposition and parallelism requires a sufficiently large input problem to achieve satisfactory efficiency on a given number of cores. Unfortunately, finding a good match between input size and core count usually requires significant experimentation, which is expensive and sometimes even impractical. In this paper, we propose an automated empirical method for finding the isoefficiency function of a task based program, binding efficiency, core count, and the input size in one analytical expression. This allows the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we not only find (i) the actual isoefficiency function but also (ii) the function one would yield if the program execution was free of resource contention and (iii) an upper bound that could only be reached if the program was able to maintain its average parallelism throughout its execution. The difference between the three helps to explain low efficiency, and in particular, it helps to differentiate between resource contention and structural conflicts related to task dependencies or scheduling. The insights gained can be used to co-design programs and shared system resources.
暂无评论