In the contemporary landscape of computer architecture, the demand for efficient parallel programming persists, needing robust optimization techniques. Traditional optimizing compilers have historically been pivotal i...
详细信息
Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained tuning advice at the kernel level, if any...
详细信息
ISBN:
(纸本)9781728186139
Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained tuning advice at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To relieve users of the burden of interpreting performance counters and analyzing bottlenecks, GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a program's structure and the GPU to match inefficiency patterns with optimization strategies. To quantify the potential benefits of each optimization strategy, we developed PC sampling-based performance models to estimate its speedup. Our experiments with benchmarks and applications show that GPA provides insightful reports to guide performance optimization. Using GPA, we obtained speedups on a Volta V100 GPU ranging from 1.01x to 3.58x, with a geometric mean of 1.22x.
Traffic monitoring and vehicle counting systems that use surveillance cameras employ several computer vision techniques, one of which is object tracking, which approximates the trajectory of the vehicle throughout the...
详细信息
ISBN:
(纸本)9781665400268
Traffic monitoring and vehicle counting systems that use surveillance cameras employ several computer vision techniques, one of which is object tracking, which approximates the trajectory of the vehicle throughout the scene. However, a major challenge in processing videos from network camera feeds is the irregular and low frame rates, affecting the performance of object tracking. In this paper, we present a concurrent implementation framework intended to increase the input network video frame rate.
Python has become a widely used programming language for research, not only for small one-off analyses, but also for complex application pipelines running at supercomputer-scale. Modern parallel programming frameworks...
详细信息
ISBN:
(纸本)9781665440660
Python has become a widely used programming language for research, not only for small one-off analyses, but also for complex application pipelines running at supercomputer-scale. Modern parallel programming frameworks for Python present users with a more granular unit of management than traditional Unix processes and batch submissions: the Python function. We review the challenges involved in running native Python functions at scale, and present techniques for dynamically determining a minimal set of dependencies and for assembling a lightweight function monitor (LFM) that captures the software environment and manages resources at the granularity of single functions. We evaluate these techniques in a range of environments, from campus cluster to supercomputer, and show that our advanced dependency management planning and dynamic resource management methods provide superior performance and utilization relative to coarser-grained management approaches, achieving several-fold decrease in execution time for several large Python applications.
Message passing model, represented by MPI (Message Passing Interface), is the principal parallel programming tool for distributed computer systems. The most of MPI-programs contain collective communications, which inv...
详细信息
Message passing model, represented by MPI (Message Passing Interface), is the principal parallel programming tool for distributed computer systems. The most of MPI-programs contain collective communications, which involve all the processes of a parallel program. Effectiveness of collective communications substantially effects on total time of program execution. In this work, we consider the problem of design of adaptive algorithms of collective communications on the example of barrier synchronization, which refers to one of the most common types of collective communications. We developed adaptive algorithm of barrier synchronization, which suboptimally selects barrier synchronization scheme in parallel MPI-programs among such algorithms as Central Counter, Combining Tree and Dissemination Barrier. The adaptive algorithm chooses the barrier algorithm with the minimal evaluation of execution time in the model LogP. Model LogP considers performance of computational resources and interconnect for point-to-point communications. Proposed algorithm has been implemented for MPI. We present the results of experiments on cluster systems, analyse dependency of algorithm selection on LogP parameters values. In particular, for the number of processes less than 20 adaptive algorithm selects Combining Tree, while for a larger number of processes adaptive algorithm selects Dissemination Barrier. Developed algorithm minimizes average time of barrier synchronization by 4%, in comparison with the most common determined barrier algorithms. (C) 2021 The Authors. Published by Elsevier B.V.
SUNRAY-1D is a one-dimensional large signal code for analyzing the beam-wave interaction in helix traveling wave tubes (TWT5). In order to improve the performance of SUNRAY-1D, parallelization of few of its modules ha...
详细信息
ISBN:
(纸本)9781665441056
SUNRAY-1D is a one-dimensional large signal code for analyzing the beam-wave interaction in helix traveling wave tubes (TWT5). In order to improve the performance of SUNRAY-1D, parallelization of few of its modules has been initiated. parallel implementation of space charge force module using MPI (message passing interface) has been successful. Improvements, in terms of increased accuracy and reduced computational time, have been the key benefits achieved.
Cloud Warehouses have been expanding their computational resources to cover the growing offloading of tenants' applications. Currently, cloud nodes integrate heterogeneous resources, such as CPU and GPU, so they c...
详细信息
ISBN:
(纸本)9781665443111
Cloud Warehouses have been expanding their computational resources to cover the growing offloading of tenants' applications. Currently, cloud nodes integrate heterogeneous resources, such as CPU and GPU, so they can exploit different types and levels of parallelism available in the applications. However, heterogeneous cloud nodes bring challenges to the software development process, since the programmer must be aware of each device's specifications, analyze and distribute the code over the available devices. Even though OpenCL supports transparent programming on heterogeneous devices, softening the programmer's burden, the choice of target device is still the programmer's responsibility. Given that, this work proposes a framework for the execution of OpenCL applications on a multi-tenant CPU-GPU cloud environment, responsible for transparently scheduling the applications to the best available device, without any interaction from the programmer. The framework has the goal of optimizing resource provisioning, reducing makespan and energy consumption. Considering the execution of the PolyBench benchmark suite, the framework shows reduction on makespan of 3.4x and energy savings of 33% when compared to the GPU standalone execution.
Writing efficient, scalable, and portable HPC synthetic aperture radar (SAR) applications is increasingly challenging due to the growing diversity and heterogeneity in distributed systems. Considerable developer and c...
详细信息
ISBN:
(纸本)9781665423694
Writing efficient, scalable, and portable HPC synthetic aperture radar (SAR) applications is increasingly challenging due to the growing diversity and heterogeneity in distributed systems. Considerable developer and computational resources are often spent to port applications to new HPC platforms and architectures, which is both time consuming and expensive. Domain-specific languages have been shown to be highly productive for development effort, but additionally achieving both scalable computational efficiency and platform portability remains challenging. The Halide programming language is both productive and efficient for dense data processing, supports common CPU architectures and heterogeneous resources like GPUs, and has previously been extended for distributed processing. We propose to use a distributed Halide implementation for scalable and heterogeneous HPC SAR processing. We implement a backprojection algorithm for SAR image reconstruction and demonstrate scalability on the OLCF Summit supercomputer up to 1,024 compute nodes (43,008 cores, each with 4 hardware threads) with a large 32,768x32,768 dataset, and up to 8 distributed GPUs with a 8,192x 8,192 dataset. Our results show excellent scaling and portability to heterogeneous resources, and motivate additional improvements in Halide to better support distributed high-performance signal processing.
High-level programming models aim at exploiting hardware parallelism and reducing software development costs. However, their adoption on ultra-low-power multi-core microcontroller (MCU) platforms requires minimizing t...
详细信息
ISBN:
(纸本)9783030816827;9783030816810
High-level programming models aim at exploiting hardware parallelism and reducing software development costs. However, their adoption on ultra-low-power multi-core microcontroller (MCU) platforms requires minimizing the overheads of work-sharing constructs on fine-grained parallel regions. This work tackles this challenge by proposing OMP-SPMD, a streamlined approach for parallel computing enabling the OpenMP syntax for the Single-Program Multiple-Data (SPMD) paradigm. To assess the performance improvement, we compare our solution with two alternatives: a baseline implementation of the OpenMP runtime based on the fork-join paradigm (OMP-base) and a version leveraging hardware-specific optimizations (OPM-opt). We benchmarked these libraries on a parallel Ultra-Low Power (PULP) MCU, highlighting that hardware-specific optimizations improve OMP-base performance up to 69%. At the same time, OMP-SPMD leads to an extra improvement up to 178%.
This paper presents a spell checker project based on Levenshtein distance and evaluates the system's performance on both parallel and sequential implementations. The Levenshtein algorithm approaches are presented ...
详细信息
ISBN:
(纸本)9783030916077;9783030916084
This paper presents a spell checker project based on Levenshtein distance and evaluates the system's performance on both parallel and sequential implementations. The Levenshtein algorithm approaches are presented in this paper: Levenshtein Matrix Distance, Levenshtein Vector Distance, Levenshtein automaton (along with an optimised version), Levenshtein trie and the performance evaluation is performed using three edit distances. Each edit distance is evaluated based on a set of misspelt words, so the results are relevant for various cases. For this scenario, the Levenshtein trie, along with the Levenshtein automaton, performed the best in both sequential and parallel versions for a large amount of misspelt words.
暂无评论