this paper proposes an efficient parallel algorithm for an important class of dynamic programming problems that includes Viterbi, Needleman-Wunsch, Smith-Waterman, and Longest Common Subsequence. In dynamic programmin...
详细信息
Blue Gene/Q (BG/Q) is an early representative of increasing scale and thread count that will characterize future HPC systems: large counts of nodes, cores, and threads; and a rich programming environment with many deg...
详细信息
ISBN:
(纸本)9781450319225
Blue Gene/Q (BG/Q) is an early representative of increasing scale and thread count that will characterize future HPC systems: large counts of nodes, cores, and threads; and a rich programming environment with many degrees of freedom in parallel computing optimization. So it is both a challenge and an opportunity to it to accelerate the seismic imaging applications to the unprecedented levels that will significantly advance the technologies for the oil and gas industry. In this work we aim to address two important questions: how HPC systems with high levels of scale and thread count will perform in real applications; and how systems with many degrees of freedom in parallelprogramming can be calibrated to achieve optimal performance. Based on BG/Q's architecture features and RTM workload characteristics, we developed massive domain partition, MPI , and SIMD Our detailed deep analyses in various aspects of optimization also provide valuable experience and insights into how can be utilized to facilitate the advance of seismic imaging technologies. Our BG/Q RTM solution achieved a 14.93x speedup over the BG/P implementation. Our multi-level parallelism strategies for Reverse Time Migration (RTM) seismic imaging computing on BG/Q provides an example of how HPC systems like BG/Q can accelerate applications to a new level.
this paper proposes a new parallel execution model where programmers augment a sequential program with pieces of code called serializers that dynamically map computational operations into serialization sets of depende...
详细信息
ISBN:
(纸本)9781605583976
this paper proposes a new parallel execution model where programmers augment a sequential program with pieces of code called serializers that dynamically map computational operations into serialization sets of dependent operations. A runtime system executes operations in the same serialization set in program order, and may concurrently execute operations in different sets. Because serialization sets establish a logical ordering on all operations, the resulting parallel execution is predictable and deterministic. We describe the API and design of Prometheus, a C++ library that implements the serialization set abstraction through compile-time template instantiation and a runtime support library. We evaluate a set of parallel programs running on the x86_64 and SPARC-V9 instruction sets and study their performance on multi-core, symmetric multiprocessor, and ccNUMA parallel machines. By contrast with conventional parallel execution models, we find that Prometheus programs are significantly easier to write, test, and debug, and their parallel execution achieves comparable performance.
this paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based...
详细信息
ISBN:
(纸本)9781450332057
this paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based on OpenACC. It allows programmers to write programs for multiple accelerators using a uniform programming model whether they are in shared or distributed memory systems. We implement a prototype of our model and evaluate its performance with a GPU-based supercomputer using three benchmark applications.
State-of-the-art MPI libraries rely on locks to guarantee thread-safety. this discourages application developers from using multiple threads to perform MPI operations. In this paper, we propose a high performance, loc...
详细信息
ISBN:
(纸本)9781450326568
State-of-the-art MPI libraries rely on locks to guarantee thread-safety. this discourages application developers from using multiple threads to perform MPI operations. In this paper, we propose a high performance, lock-free multiendpoint MPI runtime, which can achieve up to 40% improvement for point-to-point operation and one representative collective operation with minimum or no modifications to the existing applications.
We introduce a new approach to automatically extract an idealized logical structure from a parallel execution trace. We use this structure to define intuitive metrics such as the lateness of a process involved in a pa...
详细信息
ISBN:
(纸本)9781450326568
We introduce a new approach to automatically extract an idealized logical structure from a parallel execution trace. We use this structure to define intuitive metrics such as the lateness of a process involved in a parallel execution. By analyzing and illustrating traces in terms of logical steps, we leverage a developer's understanding of the happened-before relations in a parallel program. this technique can uncover dependency chains, elucidate communication patterns, and highlight sources and propagation of delays, all of which may be obscured in a traditional trace visualization.
this paper evaluates features of graph coloring algorithms implemented on graphics processing units (GPUs), comparing coloring heuristics and thread decompositions. As compared to prior work on graph coloring for othe...
详细信息
ISBN:
(纸本)9781450301190
this paper evaluates features of graph coloring algorithms implemented on graphics processing units (GPUs), comparing coloring heuristics and thread decompositions. As compared to prior work on graph coloring for other parallel architectures, we find that the large number of cores and relatively high global memory bandwidth of a GPU lead to different strategies for the parallel implementation. Specifically, we find that a simple uniform block partitioning is very effective on GPUs and our parallel coloring heuristics lead to the same or fewer colors than prior approaches for distributed-memory cluster architecture. Our algorithm resolves many coloring conflicts across partitioned blocks on the GPU by iterating through the coloring process, before returning to the CPU to resolve remaining conflicts. Withthis approach we get as few color (if not fewer) than the best sequential graph coloring algorithm and performance is close to the fastest sequential graph coloring algorithms which have poor color quality.
Many low-level optimizations for NVIDIA GPU can only be implemented in native hardware assembly (SASS). However, programming in SASS is unproductive and not portable. To simplify low-level GPU programming, we present ...
详细信息
Tensor computations present significant performance challenges that impact a wide spectrum of applications. Efforts on improving the performance of tensor computations include exploring data layout, execution scheduli...
详细信息
ISBN:
(纸本)9781450368186
Tensor computations present significant performance challenges that impact a wide spectrum of applications. Efforts on improving the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. this work presents a benchmark suite for arbitrary-order sparse tensor kernels using state-of-the-art tensor formats: coordinate (COO) and hierarchical coordinate (HiCOO). It demonstrates a set of reference tensor kernel implementations and some observations on Intel CPUs and NVIDIA GPUs. the full paper can be referred to at http://***/abs/2001.00660.
In this paper, we propose a straightforward solution to the problems of compositional parallelprogramming by using skeletons as the uniform mechanism for structured composition. In our approach parallel programs are ...
详细信息
In this paper, we propose a straightforward solution to the problems of compositional parallelprogramming by using skeletons as the uniform mechanism for structured composition. In our approach parallel programs are constructed by composing procedures in a conventional base language using a set of high-level, predefined, functional, parallel computational forms known as skeletons. the ability to compose skeletons provides us withthe essential tools for building further and more complex application-oriented skeletons specifying important aspects of parallel computation. Compared withthe process network based composition approach, such as PCN, the skeleton approach abstracts away the fine details of connecting communication ports to the higher level mechanism of making data distributions conform, thus avoiding the complexity of using lower level ports as the means of interaction. thus, the framework provides a natural integration of the compositional programming approach withthe data parallelprogramming paradigm.
暂无评论