Writing high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application cl...
详细信息
ISBN:
(纸本)9781450344449
Writing high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application class. These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand. To address this problem, we have implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL. Compared to state-of-the-art handwritten CUDA implementations of eight graph applications, code generated by the IrGL compiler is up to 5.95x times faster (median 1.4x) for five applications and never more than 30% slower for the others. Throughput optimizations contribute an improvement up to 4.16x (median 1.4x) to the performance of unoptimized IrGL code.
Writing high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application cl...
详细信息
Writing high-performance GPU implementations of graph algorithms can be challenging. In this paper, we argue that three optimizations called throughput optimizations are key to high-performance for this application class. These optimizations describe a large implementation space making it unrealistic for programmers to implement them by hand. To address this problem, we have implemented these optimizations in a compiler that produces CUDA code from an intermediate-level program representation called IrGL. Compared to state-of-the-art handwritten CUDA implementations of eight graph applications, code generated by the IrGL compiler is up to 5.95x times faster (median 1.4x) for five applications and never more than 30% slower for the others. Throughput optimizations contribute an improvement up to 4.16x (median 1.4x) to the performance of unoptimized IrGL code.
The Galois system can automatically parallelize irregular algorithms written in a serial programming model and execute them efficiently on nonuniform memory access (NUMA) machines. Experimental results for five comple...
详细信息
The Galois system can automatically parallelize irregular algorithms written in a serial programming model and execute them efficiently on nonuniform memory access (NUMA) machines. Experimental results for five complex irregular algorithms show that the system scales up to 420x on large NUMA systems at 512 threads.
We describe a system that uses automated planning to synthesize correct and efficient parallel graph programs from high-level algorithmic specifications. Automated planning allows us to use constraints to declarativel...
详细信息
ISBN:
(纸本)9781450334686
We describe a system that uses automated planning to synthesize correct and efficient parallel graph programs from high-level algorithmic specifications. Automated planning allows us to use constraints to declaratively encode program transformations such as scheduling, implementation selection, and insertion of synchronization. Each plan emitted by the planner satisfies all constraints simultaneously, and corresponds to a composition of these transformations. In this way, we obtain an integrated compilation approach for a very challenging problem domain. We have used this system to synthesize parallel programs for four graph problems: triangle counting, maximal independent set computation, preflow-push maxflow, and connected components. Experiments on a variety of inputs show that the synthesized implementations perform competitively with hand-written, highly-tuned code.
Precise pointer analysis is a problem of interest to both the compiler and the program verification community. Flow-sensitivity is an important dimension of pointer analysis that affects the precision of the final res...
详细信息
ISBN:
(纸本)9781479910212
Precise pointer analysis is a problem of interest to both the compiler and the program verification community. Flow-sensitivity is an important dimension of pointer analysis that affects the precision of the final result computed. Scaling flow-sensitive pointer analysis to millions of lines of code is a major challenge. Recently, staged flow-sensitive pointer analysis has been proposed, which exploits a sparse representation of program code created by staged analysis. In this paper we formulate the staged flow-sensitive pointer analysis as a graph-rewriting problem. Graph-rewriting has already been used for flow-insensitive analysis. However, formulating flow-sensitive pointer analysis as a graph-rewriting problem adds additional challenges due to the nature of flow-sensitivity. We implement our parallel algorithm using Intel Threading Building Blocks and demonstrate considerable scaling (upto 2.6x) for 8 threads on a set of 10 benchmarks. Compared to the sequential implementation of staged flow-sensitive analysis, a single threaded execution of our implementation performs better in 8 of the benchmarks.
Betweenness centrality is an important metric in the study of social networks, and several algorithms for computing this metric exist in the literature. This paper makes three contributions. First, we show that the pr...
详细信息
ISBN:
(纸本)9781450319225
Betweenness centrality is an important metric in the study of social networks, and several algorithms for computing this metric exist in the literature. This paper makes three contributions. First, we show that the problem of computing betweenness centrality can be formulated abstractly in terms of a small set of operators that update the graph. Second, we show that existing parallel algorithms for computing betweenness centrality can be viewed as implementations of different schedules for these operators, permitting all these algorithms to be formulated in a single framework. Third, we derive a new asynchronous parallel algorithm for betweenness centrality that (i) works seamlessly for both weighted and unweighted graphs, (ii) can be applied to large graphs, and (iii) is able to extract large amounts of parallelism. We implemented this algorithm and compared it against a number of publicly available implementations of previous algorithms on two different multicore architectures. Our results show that the new algorithm is the best performing one in most cases, particularly for large graphs and large thread counts, and is always competitive against other algorithms.
Precise pointer analysis is a problem of interest to both the compiler and the program verification community. Flow-sensitivity is an important dimension of pointer analysis that affects the precision of the final res...
详细信息
ISBN:
(纸本)9781479910212
Precise pointer analysis is a problem of interest to both the compiler and the program verification community. Flow-sensitivity is an important dimension of pointer analysis that affects the precision of the final result computed. Scaling flow-sensitive pointer analysis to millions of lines of code is a major challenge. Recently, staged flow-sensitive pointer analysis has been proposed, which exploits a sparse representation of program code created by staged analysis. In this paper we formulate the staged flow-sensitive pointer analysis as a graph-rewriting problem. Graph-rewriting has already been used for flow-insensitive analysis. However, formulating flow-sensitive pointer analysis as a graph-rewriting problem adds additional challenges due to the nature of *** implement our parallel algorithm using Intel Threading Building Blocks and demonstrate considerable scaling (upto 2.6x) for 8 threads on a set of 10 benchmarks. Compared to the sequential implementation of staged flow-sensitive analysis, a single threaded execution of our implementation performs better in 8 of the benchmarks.
Algorithms in new application areas like machine learning and network analysis use "irregular" data structures such as graphs, trees and sets. Writing efficient parallel code in these problem domains is very...
详细信息
ISBN:
(纸本)9781450315616
Algorithms in new application areas like machine learning and network analysis use "irregular" data structures such as graphs, trees and sets. Writing efficient parallel code in these problem domains is very challenging because it requires the programmer to make many choices: a given problem can usually be solved by several algorithms, each algorithm may have many implementations, and the best choice of algorithm and implementation can depend not only on the characteristics of the parallel platform but also on properties of the input data such as the structure of the graph. One solution is to permit the application programmer to experiment with different algorithms and implementations without writing every variant from scratch. Auto-tuning to find the best variant is a more ambitious solution. These solutions require a system for automatically producing efficient parallel implementations from high-level specifications. Elixir, the system described in this paper, is the first step towards this ambitious goal. Application programmers write specifications that consist of an operator, which describes the computations to be performed, and a schedule for performing these computations. Elixir uses sophisticated inference techniques to produce efficient parallel code from such specifications. We used Elixir to automatically generate many parallel implementations for three irregular problems: breadth-first search, single source shortest path, and betweenness-centrality computation. Our experiments show that the best generated variants can be competitive with handwritten code for these problems from other research groups;for some inputs, they even outperform the handwritten versions.
Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent a...
详细信息
ISBN:
(数字)9783642311253
ISBN:
(纸本)9783642311246;9783642311253
Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent activities, aborting and rolling back conflicting tasks. However, parallelism in irregular algorithms is very complex. In a regular algorithm like dense matrix multiplication, the amount of parallelism can usually be expressed as a function of the problem size, so it is reasonably straightforward to determine how many processors should be allocated to execute a regular algorithm of a certain size (this is called the processor allocation problem). In contrast, parallelism in irregular algorithms can be a function of input parameters, and the amount of parallelism can vary dramatically during the execution of the irregular algorithm. Therefore, the processor allocation problem for irregular algorithms is very difficult. In this paper, we describe the first systematic strategy for addressing this problem. Our approach is based on a construct called the conflict graph, which (i) provides insight into the amount of parallelism that can be extracted from an irregular algorithm, and (ii) can be used to address the processor allocation problem for irregular algorithms. We show that this problem is related to a generalization of the unfriendly seating problem and, by extending Turan's theorem, we obtain a worst-case class of problems for optimistic parallelization, which we use to derive a lower bound on the exploitable parallelism. Finally, using some theoretically derived properties and some experimental facts, we design a quick and stable control strategy for solving the processor allocation problem heuristically.
Computations on unstructured graphs are challenging to parallelize because dependences in the underlying algorithms are usually complex functions of runtime data values, thwarting static parallelization. One promising...
详细信息
ISBN:
(纸本)9781450304900
Computations on unstructured graphs are challenging to parallelize because dependences in the underlying algorithms are usually complex functions of runtime data values, thwarting static parallelization. One promising general-purpose parallelization strategy for these algorithms is optimistic parallelization. This paper identifies the optimization of optimistically parallelized graph programs a:: a new application area, and develops the first shape analysis for addressing this problem. Our shape analysis identifies failsafe points in the program after which the execution is guaranteed not to abort and backup copies of modified data are not needed;additionally, the analysis can be used to eliminate redundant conflict checking. It uses two key ideas: a novel top-down heap abstraction that controls state space explosion, and a strategy for predicate discovery that exploits common patterns of data structure usage. We implemented the shape analysis in TVLA, and used it to optimize benchmarks from the Lonestar suite. The optimized programs were executed on the Galois system. The analysis was successful in eliminating all costs related to rollback logging for our benchmarks. Additionally, it reduced the number of lock acquisitions by a factor ranging from 10x to 50x, depending on the application and the number of threads. These optimizations were effective in reducing the running times of the benchmarks by factors of 2x to 12 x
暂无评论