In general, the performance of parallel graph processing is determined by three pairs of critical parameters, namely synchronous or asynchronous execution mode (Sync or Async), Push or Pull communication mechanism (Pu...
详细信息
ISBN:
(纸本)9781450362252
In general, the performance of parallel graph processing is determined by three pairs of critical parameters, namely synchronous or asynchronous execution mode (Sync or Async), Push or Pull communication mechanism (Push or Pull), and Data-driven or Topology-driven traversing scheme (DD or TD), which increases the complexity and sophistication of programming and system implementation of GPU. Existing graph-processing frameworks mainly use a single combination in the entire execution for a given application, but we have observed their variable and suboptimal performance. In this paper, we present SEP-Graph, a highly efficient software framework for graph-processing on GPU. the hybrid execution mode is automatically switched among three pairs of parameters, with an objective to achieve the shortest execution time in each iteration. We also apply a set of optimizations to SEP-Graph, considering the characteristics of graph algorithms and underlying GPU architectures. We show the effectiveness of SEP-Graph based on our intensive and comparative performance evaluation on NVIDIA 1080, P100, and V100 GPUs. Compared with existing and representative GPU graph-processing framework Groute and Gunrock, SEP-Graph can reduce execution time up to 45.8 times and 39.4 times.
this tutorial provides a hands-on introduction to quantum computing. It will feature the three pillars, architectures, programming, and algorithms/applications of quantum computing. Its focus is on the applicability o...
详细信息
ISBN:
(纸本)9781450362252
this tutorial provides a hands-on introduction to quantum computing. It will feature the three pillars, architectures, programming, and algorithms/applications of quantum computing. Its focus is on the applicability of problems to quantum computing from a practical point, with only the necessary foundational coverage of the physics and theoretical aspects to understand quantum computing. Simulation software will be utilized complemented by access to actual quantum computers to prototype problem solutions. this should develop a better understanding of how problems are transformed into quantum algorithms and what programming language support is best suited for a given application area. As a first of its kind, to the best of our knowledge, the tutorial includes hands-on programming experience with IBM Q and D-Wave hardware.
Traditional concurrent programming involves manipulating shared mutable state. Alternatives to this programming style are communicating sequential processes (CSP) [1] and actor [2] models, which share data via explici...
详细信息
this paper addresses the problem of provably efficient and practically good on-the-fly determinacy race detection in task parallel programs that use futures. Prior works on determinacy race detection have mostly focus...
详细信息
ISBN:
(纸本)9781450362252
this paper addresses the problem of provably efficient and practically good on-the-fly determinacy race detection in task parallel programs that use futures. Prior works on determinacy race detection have mostly focused on either task parallel programs that follow a series-parallel dependence structure or ones with unrestricted use of futures that generate arbitrary dependences. In this work, we consider a restricted use of futures and show that we can detect races more efficiently than with general use of futures. Specifically, we present two algorithms: MultiBags and MultiBags+. MultiBags targets programs that use futures in a restricted fashion and runs in time O(T-1 alpha(m, n)), where T-1 is the sequential running time of the program, a is the inverse Ackermann's function, m is the total number of memory accesses, n is the dynamic count of places at which parallelism is created. Since a is a very slowly growing function (upper bounded by 4 for all practical purposes), it can be treated as a close-to-constant overhead. MultiBags+ is an extension of MultiBags that target programs with general use of futures. It runs in time O((T-1 + k(2))alpha(m, n)) where T-1, alpha, m and n are defined as before, and k is the number of future operations in the computation. We implemented both algorithms and empirically demonstrate their efficiency.
Ensemble Kalman filter (EnKF) is one of the most important methods for data assimilation, which is widely applied to the reconstruction of observed historical data for providing initial conditions of numerical atmosph...
详细信息
ISBN:
(纸本)9781450362252
Ensemble Kalman filter (EnKF) is one of the most important methods for data assimilation, which is widely applied to the reconstruction of observed historical data for providing initial conditions of numerical atmospheric and oceanic models. Withthe improvement of data resolution and the increase in the amount of model data, the scalability of recent parallel implementations suffers from high overhead on data transfer. In this paper, we propose, S-EnKF: a scalable and distributed EnKF adaptation for modern clusters. With an in-depth analysis of new requirements brought forward by recent frameworks and limitations of current designs, we present a co-design of S-EnKF. For fully exploiting the resources available in modern parallel file systems, we design a concurrent access approach to accelerate the process of reading large amounts of background data. through a deeper investigation of the data dependence relations, we modify EnKF's workflow to maximize the overlap of file reading and local analysis with a new multi-stage computation approach. Furthermore, we push the envelope of performance further with aggressive co-design of auto-tuning through tradeoff between the benefit on runtime and the cost on processors based on classic cost models. the experimental evaluation of S-EnKF demonstrates nearly ideal strong scalability on up to 12,000 processors. the largest run sustains a performance of 3x-speedup compared with P-EnKF, which represents the state-of-art parallel implementation of EnKF.
the use of futures provides a flexible way to express parallelism and can generate arbitrary dependences among parallel subcomputations. the additional flexibility that futures provide comes with a cost, however. When...
详细信息
ISBN:
(纸本)9781450362252
the use of futures provides a flexible way to express parallelism and can generate arbitrary dependences among parallel subcomputations. the additional flexibility that futures provide comes with a cost, however. When scheduled using classic work stealing, a program with futures, compared to a program that uses only fork-join parallelism, can incur a much higher number of "deviations," a metric for evaluating the performance of parallel executions. All prior works assume a parsimonious work-stealing scheduler, however, where a worker thread (surrogate of a processor) steals work only when its local deque becomes empty. In this work, we investigate an alternative scheduling approach, called ProWS, where the workers perform proactive work stealing when handling future operations. We show that ProWS, for programs that use futures, can provide provably efficient execution time and equal or better bounds on the number of deviations compared to classic parsimonious work stealing. Given a computation with T-1 work and T-infinity span, ProWS executes the computation on P processors in expected time O(T-1/P + T-infinity lg P), with an additional lg P overhead on the span term compared to the parsimonious variant. For structured use of futures, where each future is single touch with no race on the future handle, the algorithm incurs O(PT infinity 2) deviations, matching that of the parsimonious variant. For general use of futures, the algorithm incurs O(m(k)T(infinity) + PT infinity lg P) deviations, where m(k) is the maximum number of future touches that are logically parallel. Compared to the bound for the parsimonious variant, O(kT(infinity) + PT infinity), with k being the total number of touches in the entire computation, this bound is better assuming m(k) = Omega(P lg P) and is smaller than k, which holds true for all the benchmarks we examined.
this work proposes Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO balances edges and vertices for graphs with a power-law degree distr...
详细信息
Recent studies have shown promising performance benefits of pipelined stencil applications. An important factor for the computing eficiency of such pipelines is the granularity of a task. We presents GOPipe, the first...
详细信息
暂无评论