Given the increasing importance of efficient data intensive computing, we find that modern processor designs are not well suited to the irregular memory access patterns found in these algorithms. This research focuses...
详细信息
ISBN:
(纸本)9781450351362
Given the increasing importance of efficient data intensive computing, we find that modern processor designs are not well suited to the irregular memory access patterns found in these algorithms. This research focuses on mapping the compiler's instruction cost scheduling logic to hardware managed concurrency controls in order to minimize pipeline stalls. In this manner, the hardware modules managing the low-latency thread concurrency can be directly understood by modern compilers. We introduce a thread context switching method that is managed directly via a set of hardware-based mechanisms that are coupled to the compiler instruction scheduler. As individual instructions from a thread execute, their respective cost is accumulated into a control register. Once the register reaches a pre-determined saturation point, the thread is forced to context switch. We evaluate the performance benefits of our approach using a series of 24 benchmarks that exhibit performance acceleration of up to 14.6X.
This work optimizes tensor-times-dense matrix multiply (TTM) for general sparse and semi-sparse tensors on CPU and NVIDIA GPU platforms. TTM is a computational kernel in tensor methods-based data analytics and data mi...
详细信息
This work optimizes tensor-times-dense matrix multiply (TTM) for general sparse and semi-sparse tensors on CPU and NVIDIA GPU platforms. TTM is a computational kernel in tensor methods-based data analytics and data mining applications, such as the popular Tucker decomposition. We first design an in-place sequential SPTTM to avoid explicit data reorganizing between a tensor and a matrix in its conventional approach. We further optimize SPTTM on NVIDIA GPU platforms. Five approaches including employing fine thread granularity, arranging coalesced memory access, rank blocking, and using fast GPU shared memory are developed for GPU-SPTTM. We also optimize semi-sparse tensor-times-dense matrix multiply (SSPTTM) to take advantage of the inside dense sub-structures. The optimized SPTTM and SSPTTM are applied to Tucker decomposition to improve its overall performance. Our sequential SPTTM is 3-120x faster than the SPTTM from Tensor Toolbox library. GPU-SPTTM obtains 6-19x speedup on NVIDIA K40c and 23-67x speedup on NVIDIA P100 over CPU-SPTTM respectively. Our GPU-SPTTM is 3.9x faster than the state-of-the-art GPU implementation. Our SSPTTM implementations outperform SPTTMS by up to 4.5x, which handles the input semi-sparse tensor in a general way. Tucker decomposition achieves up to 3.2x speedup after applying the optimized TTMS. The code will be publicly released in PARTI! library: https://***/hpcgarage/ParTI. (C) 2018 Elsevier Inc. All rights reserved.
Nodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular alg...
详细信息
Nodes with multiple GPUs are becoming the platform of choice for high-performance computing. However, most applications are written using bulk-synchronous programming models, which may not be optimal for irregular algorithms that benefit from low-latency, asynchronous communication. This article proposes constructs for asynchronous multi-GPU programming and describes their implementation in a thin runtime environment called Groute. Groute also implements common collective operations and distributed work-lists, enabling the development of irregular applications without substantial programming effort. We demonstrate that this approach achieves state-of-the-art performance and exhibits strong scaling for a suite of irregular applications on eight-GPU and heterogeneous systems, yielding over 7x speedup for some algorithms.
Task pools can be used to achieve the dynamic load balancing that is required for an efficient parallel implementation of irregular applications. However, the performance strongly depends on a task pool implementation...
详细信息
ISBN:
(纸本)9781424416936
Task pools can be used to achieve the dynamic load balancing that is required for an efficient parallel implementation of irregular applications. However, the performance strongly depends on a task pool implementation that is well suited for the specific application. This paper introduces an adaptive task pool implementation that enables a step-wise transition between the common strategies of central and distributed task pools. The influence of the task size on the parallel performance is investigated and it is shown that the adaptive implementation provides the flexibility to adapt to different situations. Performance results from benchmark programs and from an irregular application for anomalous diffusion simulation are presented to demonstrate the need for an adaptive strategy. It is shown that profiling information about the overhead of the task pool implementation can be used to determine an optimal task pool strategy.
We develop a new framework for analyzing recursive methods that perform traversals over trees, called tree dependence analysis. This analysis translates dependence analysis techniques for regular programs to the irreg...
详细信息
ISBN:
(纸本)9781450334686
We develop a new framework for analyzing recursive methods that perform traversals over trees, called tree dependence analysis. This analysis translates dependence analysis techniques for regular programs to the irregular space, identifying the structure of dependences within a recursive method that traverses trees. We develop a dependence test that exploits the dependence structure of such programs, and can prove that several locality- and parallelism-enhancing transformations are legal. In addition, we extend our analysis with a novel path-dependent, conditional analysis to refine the dependence test and prove the legality of transformations for a wider range of algorithms. We then use these analyses to show that several common algorithms that manipulate trees recursively are amenable to several locality- and parallelism-enhancing transformations. This work shows that classical dependence analysis techniques, which have largely been confined to nested loops over array data structures, can be extended and translated to work for complex, recursive programs that operate over pointer-based data structures.
Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such programs may involve all (or large num...
详细信息
ISBN:
(纸本)9781605587080
Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such programs may involve all (or large numbers) of the participating processes, the actual communication operations are usually sparse in nature. As a result, communication phases are typically expressed explicitly using point-to-point communication operations or collective operations. We define the dynamic sparse data-exchange (DSDE) problem and derive bounds in the well known LogGP model. While current approaches work well with static applications, they run into limitations as modern applications grow in scale, and as the problems that are being solved become increasingly irregular and dynamic. To enable the compact and efficient expression of the communication phase, we develop suitable sparse communication protocols for irregular applications at large scale. We discuss different irregular applications and show the sparsity in the communication for real-world input data. We discuss the time and memory complexity of commonly used protocols for the DSDE problem and develop NBX-a novel fast algorithm with constant memory overhead for solving it. Algorithm NBX improves the runtime of a sparse data-exchange among 8,192 processors on BlueGene/P by a factor of 5.6. In an application study, we show improvements of up to a factor of 28.9 for a parallel breadth first search on 8,192 BlueGene/P processors.
irregular or pointer-based structures such as graphs and trees are commonly used in algorithms dealing with sparse data. Given their reliance on pointers, these algorithms are difficult to analyze and the structure of...
详细信息
ISBN:
(纸本)9781450306980
irregular or pointer-based structures such as graphs and trees are commonly used in algorithms dealing with sparse data. Given their reliance on pointers, these algorithms are difficult to analyze and the structure of their memory accesses is obfuscated which makes the extraction of parallelism difficult. In this work, we present a framework that is capable of reasoning about the semantics of the dynamic data footprints of operations to determine their potential overlap. We leverage the knowledge the programmer has about access patterns for the algorithm but is currently unable to express. This knowledge allows our runtime to make either a parallelization decision or throttle concurrency to improve performance in Software Transactional Memories (STMs) [6]. Our framework relies on programmer-supplied predicates that are appropriately evaluated at runtime and utilized to probabilistically assert certain properties about data footprints. We present simple abstractions and a low-overhead runtime to support our framework. We demonstrate our work by parallelizing a graph-coloring benchmark and by improving the transactional performance of benchmarks from the STAMP suite.
Branch-and-Bound (B&B) algorithms are well-known tree-based exploratory methods for solving to optimality NP-hard discrete optimization problems. The construction of the B&B tree and its exploration are perfor...
详细信息
Branch-and-Bound (B&B) algorithms are well-known tree-based exploratory methods for solving to optimality NP-hard discrete optimization problems. The construction of the B&B tree and its exploration are performed using four operators: branching, bounding, selection and pruning. Such algorithms are irregular which makes challenging their parallel design and implementation on GPU accelerators. Among the few existing related works, we have recently revisited on GPU the bounding operator. The reported results show that speedups up to × 100 can be obtained on recent GPU cards. In this paper, we address the GPU-based design and implementation of B&B algorithms considering the branching and pruning operators as well as the bounding one. The proposed template transforms the unpredictable and irregular workload associated to the explored B&B tree into regular data-parallel kernels optimized for the SIMD-based execution model of GPUs. Thread divergence and uncoalesced memory accesses are considered in the optimization process. The proposed approach has been experimented on the Flow-Shop scheduling problem and compared to another GPU-based strategy and to a cluster of workstations (COWs) based approach. The reported results demonstrate the efficiency of the proposed approach over the two other ones. Speedups up to × 160 are obtained for large problem instances using an Nvidia Tesla C2050 hardware configuration.
暂无评论