We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is base...
详细信息
Stagnation of Moore's law has led to the increased adoption of parallel programming for enhancing performance of scientific applications. Frequently occurring code and design patterns in scientific applications ar...
详细信息
ISBN:
(纸本)9781665462082
Stagnation of Moore's law has led to the increased adoption of parallel programming for enhancing performance of scientific applications. Frequently occurring code and design patterns in scientific applications are often used for transforming serial code to parallel. But, identifying these patterns is not easy. To this end, we propose using Graph Neural Networks for modeling code flow graphs to identify patterns in such parallel code. Additionally, identifying the runtime parameters for best performing parallel code is also challenging. We propose a pattern-guided deep learning based tuning approach, to help identify the best runtime parameters for OpenMP loops. Overall, we aim to identify commonly occurring patterns in parallel loops and use these patterns to guide auto-tuning efforts. We validate our hypothesis on 20 different applications from Polybench, and STREAM benchmark suites. This deep learning-based approach can identify the considered patterns with an overall accuracy of 91%. We validate the usefulness of using patterns for auto-tuning on tuning the number of threads, scheduling policies and chunk size on a single socket system, and the thread count and affinity on a multi-socket machine. Our approach achieves geometric mean speedups of $1.1\times$ and $4.7\times$ respectively over default OpenMP configurations, compared to brute-force speedups of $1.27\times$ and $4.93\times$ respectively.
Collective operations are common features of parallel programming models that are frequently used in High-Performance (HPC) and machine/ deep learning (ML/ DL) applications. In strong scaling scenarios, collective ope...
详细信息
In this study, we introduce a methodology for automatically transforming user applications in the radar and communication domain written in $\boldsymbol{\mathrm{C}/\mathrm{C}++}$ based on dynamic profiling to a para...
详细信息
In this study, we introduce a methodology for automatically transforming user applications in the radar and communication domain written in $\boldsymbol{\mathrm{C}/\mathrm{C}++}$ based on dynamic profiling to a parallel representation targeted for a heterogeneous SoC. We present our approach for instrumenting the user application binary during the compilation process with barrier synchronization primitives that enable runtime system schedule and execute independent tasks concurrently over the available compute resources. We demonstrate the capabilities of our integrated compile time and runtime flow through task-level parallel and functionally correct execution of real-life applications. We perform validation of our integrated system by executing four distinct applications each carrying various degrees of task level parallelism over the Xeon-based multi-core homogeneous processor. We use the proposed compilation and code transformation methodology to re-target each application for execution on a heterogeneous SoC composed of three ARM cores and one FFT accelerator that is emulated on the Xilinx Zynq Ultra $\mathbf{Scale}+$ platform. We demonstrate our runtime's ability to process application binary, dispatch independent tasks over the available compute resources of the emulated SoC on the Zynq FPGA based on three different scheduling heuristics. Finally we demonstrate execution of each application individually with task level parallelism on the Zynq FPGA and execution of workload scenarios composed of multiple instances of the same application as well as mixture of two distinct applications to demonstrate ability to realize both application and task level parallel execution. Our integrated approach offers a path forward for application developers to take full advantage of the target SoC without requiring users to become hardware and parallel programming experts.
Graph pattern matching is a fundamental task in many graph analytics and graph mining applications. As an NP-hard problem, it is often a performance bottleneck in these applications. Previous work has proposed to use ...
详细信息
Graph pattern matching is a fundamental task in many graph analytics and graph mining applications. As an NP-hard problem, it is often a performance bottleneck in these applications. Previous work has proposed to use GPU to accelerate the computation. However, we find that the existing GPU solutions fail to show a performance advantage over the state-of-the-art CPU implementation due to their subgraph-centric design. This work proposes a novel stack-based graph pattern matching system on GPU that avoids the synchronization and memory consumption issues of the previous subgraph-centric systems. We also propose a two-level work-stealing and a loop-unrolling technique to improve the inter-warp and intra-warp GPU resource utilization of our system. The experiments show that our system significantly advances the state-of-the-art for graph pattern matching on GPU.
With the boom in the development of computer hardware and software, social media, IoT platforms, and communications, there has been an exponential growth in the volume of data produced around the world. Among these da...
详细信息
One of the various techniques employed for image feature extraction is the Local Neighborhood Difference Pattern, also called as LNDP. LNDP considers the relationship between neighbors of a central pixel with its adja...
详细信息
One of the various techniques employed for image feature extraction is the Local Neighborhood Difference Pattern, also called as LNDP. LNDP considers the relationship between neighbors of a central pixel with its adjacent pixels and transforms this mutual relationship of all the neighboring pixels into a binary pattern. It has proven to be a powerful and effective descriptor for texture analysis. A parallel implementation of LNDP using Compute Unified Device Architecture (CUDA) has been proposed in this paper. A speedup of about 1000 times has been achieved through a shared memory parallel implementation for large images. Thus, an efficacious and efficient implementation has resulted in an increased execution speed and reduced execution time.
As the hardware industry moves towards using specialized heterogeneous many-cores to avoid the effects of the power wall, software developers are finding it hard to deal with the complexity of these systems. This arti...
详细信息
The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize strided computation regions of a s...
详细信息
The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize strided computation regions of a sparse code. In this work, we propose a locality-based codelet mining (LCM) algorithm that efficiently searches for strided and partially strided regions in sparse matrix computations for vectorization. We also present a classification of partially strided codelets and a differentiation-based approach to generate codelets from memory accesses in the sparse computation. LCM is implemented as an inspector-executor framework called LCM I/E that generates vectorized code for the sparse matrix-vector multiplication (SpMV), sparse matrix times dense matrix (SpMM), and sparse triangular solver (SpTRSV). LCM I/E outperforms the MKL library with an average speedup of 1.67×, 4.1×, and 1.75× for SpMV, SpTRSV, and SpMM, respectively. It is also faster than the state-of-the-art inspector-executor framework Sympiler [1] for the SpTRSV kernel with an average speedup of 1.9×.
In high-performance computing, picking the right number of threads to gain a good speedup is important, as many OS-level parameters are influenced by even slight adjustments in thread count. These parameters are requi...
详细信息
暂无评论