Semi-automatic parallelization provides abstractions that simplify the programming effort and allow the user to make decisions that cannot be made by tools. However, abstractions for general-purpose systems usually do...
详细信息
In recent years, high-performance computing systems are equipped with not only host processors but also accelerators, and becoming more heterogeneous as well as becoming larger in scale. The task parallel execution mo...
详细信息
Efficient, scalable and productive parallel programming is a major challenge for exploiting the future multiprocessor SoC platforms. This article presents the MultiFlex programming environment which has been developed...
详细信息
This paper introduces Taskflow to address the critical question of “How can we make it easier to implement and deploy parallel computer-aided design (CAD) algorithms on large heterogeneous nodes with high performance...
详细信息
ISBN:
(数字)9781665423243
This paper introduces Taskflow to address the critical question of “How can we make it easier to implement and deploy parallel computer-aided design (CAD) algorithms on large heterogeneous nodes with high performance and simultaneous high productivity?” parallelizing CAD is an extremely challenging job. Modern CAD applications exhibit unique computational patterns and user requirements that need very strategic decomposition to benefit from parallelism. Taskflow assists researchers and developers in the implementation complexity of parallel algorithms by introducing a new high-level programming model supported by an efficient runtime. By capitalizing on emerging parallelism comprising manycore central processing units (CPUs), graphics processing units (GPUs), and custom accelerators, Taskflow enables CAD to achieve new performance and productivity milestones that were previously out of reach.
In the sciences, it is common to use the so-called "big operator" notation to express the iteration of a binary operator (the reducer) over a collection of values. Such a notation typically assumes that the ...
详细信息
parallel graph processing is central to analytical computer science applications, and GPUs have proven to be an ideal platform for parallel graph processing. Existing GPU graph processing frameworks present performanc...
详细信息
parallel graph processing is central to analytical computer science applications, and GPUs have proven to be an ideal platform for parallel graph processing. Existing GPU graph processing frameworks present performance improvements but often neglect two issues: the unpredictability of a given input graph and the energy consumption of the graph processing. Our prototype software, EEGraph (Energy Efficiency of Graph processing), is a flexible system consisting of several graph processing algorithms with configurable parameters for vertex update synchronization, vertex activation, and memory management along with a lightweight software-based GPU energy measurement scheme. We observe relationships between different configurations of our software, performance, and GPU energy for processing in-memory and out-of-memory graphs. The ideal parameters are discovered for specific input graphs by analyzing the observed relationships. We also present the utility of subgraph generation to predict the performance and energy consumption of complete graph configurations. EEGraph improves upon state-of-the-art GPU-based graph processing software by 2.08 times for performance and 1.60 times for GPU energy for processing in-memory graph datasets. Additionally, EEGraph improves upon the state-of-the-art by 3.30 times for performance and 1.63 times for GPU energy for processing large out-of-memory graph datasets.
Sisal 3.2 is a new input language of system of functional programming (SFP) which is under development at the Institute of Informatics Systems in Novosibirsk as an interactive visual environment for supporting of scie...
详细信息
Sisal 3.2 is a new input language of system of functional programming (SFP) which is under development at the Institute of Informatics Systems in Novosibirsk as an interactive visual environment for supporting of scientific parallel programming. This paper contains an overview of Sisal 3.2 and a description of its new features compared with previous versions of the SFP input language such as the multidimensional array support, new abstractions like parametric types and generalised procedures, more flexible user-defined reductions, improved interoperability with other programming languages and specification of several optimising source text annotations.
Some recent papers showed that many sequential iterative algorithms can be directly parallelized, by identifying the dependences between the input objects. This approach yields many simple and practical parallel algor...
详细信息
ISBN:
(纸本)9781450391467
Some recent papers showed that many sequential iterative algorithms can be directly parallelized, by identifying the dependences between the input objects. This approach yields many simple and practical parallel algorithms, but there are still challenges to achieve work-efficiency and high-parallelism. Work-efficiency means that the number of operations is asymptotically the same as the best sequential solution. This can be hard for certain problems where the number of dependences between objects is asymptotically more than optimal sequential work, and we cannot even afford the cost to generate them. To achieve high-parallelism, we always want it to process as many objects as possible in parallel. The goal is to achieve (O) over tilde (D) span for a problem with the deepest dependence length D. We refer to this property as round-efficiency. This paper presents work-efficient and round-efficient algorithms for a variety of classic problems and propose general approaches to do so. To efficiently parallelize many sequential iterative algorithms, we propose the phase-parallel framework. The framework assigns a rank to each object and processes the objects based on the order of their ranks. All objects with the same rank can be processed in parallel. To enable work-efficiency and high parallelism, we use two types of general techniques. Type 1 algorithms aim to use range queries to extract all objects with the same rank to avoid evaluating all the dependences. We discuss activity selection, and Dijkstra's algorithm using Type 1 framework. Type 2 algorithms aim to wake up an object when the last object it depends on is finished. We discuss activity selection, longest increasing subsequence (LIS), greedy maximal independent set (MIS), and many other algorithms using Type 2 framework. All of our algorithms are (nearly) work-efficient and round-efficient, and some of them (e.g., LIS) are the first to achieve the both. Many of them improve the previous best bounds. Moreover,
Graph pattern matching is a fundamental task in many graph analytics and graph mining applications. As an NP-hard problem, it is often a performance bottleneck in these applications. Previous work has proposed to use ...
详细信息
ISBN:
(纸本)9781665454445
Graph pattern matching is a fundamental task in many graph analytics and graph mining applications. As an NP-hard problem, it is often a performance bottleneck in these applications. Previous work has proposed to use GPU to accelerate the computation. However, we find that the existing GPU solutions fail to show a performance advantage over the state-of-the-art CPU implementation due to their subgraphcentric design. This work proposes a novel stack-based graph pattern matching system on GPU that avoids the synchronization and memory consumption issues of the previous subgraph-centric systems. We also propose a two-level work-stealing and a loopunrolling technique to improve the inter-warp and intra-warp GPU resource utilization of our system. The experiments show that our system significantly advances the state-of-the-art for graph pattern matching on GPU.
The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize strided computation regions of a s...
详细信息
ISBN:
(纸本)9781665454445
The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize strided computation regions of a sparse code. In this work, we propose a locality-based codelet mining (LCM) algorithm that efficiently searches for strided and partially strided regions in sparse matrix computations for vectorization. We also present a classification of partially strided codelets and a differentiation-based approach to generate codelets from memory accesses in the sparse computation. LCM is implemented as an inspector-executor framework called LCM I/E that generates vectorized code for the sparse matrix-vector multiplication (SpMV), sparse matrix times dense matrix (SpMM), and sparse triangular solver (SpTRSV). LCM I/E outperforms the MKL library with an average speedup of 1.67x, 4.1x, and 1.75x for SpMV, SpTRSV, and SpMM, respectively. It is also faster than the state-of-the-art inspector-executor framework Sympiler [1] for the SpTRSV kernel with an average speedup of 1.9x.
暂无评论