Message passing model, represented by MPI (Message Passing Interface), is the principal parallel programming tool for distributed computer systems. The most of MPI-programs contain collective communications, which inv...
详细信息
Message passing model, represented by MPI (Message Passing Interface), is the principal parallel programming tool for distributed computer systems. The most of MPI-programs contain collective communications, which involve all the processes of a parallel program. Effectiveness of collective communications substantially effects on total time of program execution. In this work, we consider the problem of design of adaptive algorithms of collective communications on the example of barrier synchronization, which refers to one of the most common types of collective communications. We developed adaptive algorithm of barrier synchronization, which suboptimally selects barrier synchronization scheme in parallel MPI-programs among such algorithms as Central Counter, Combining Tree and Dissemination Barrier. The adaptive algorithm chooses the barrier algorithm with the minimal evaluation of execution time in the model LogP. Model LogP considers performance of computational resources and interconnect for point-to-point communications. Proposed algorithm has been implemented for MPI. We present the results of experiments on cluster systems, analyse dependency of algorithm selection on LogP parameters values. In particular, for the number of processes less than 20 adaptive algorithm selects Combining Tree, while for a larger number of processes adaptive algorithm selects Dissemination Barrier. Developed algorithm minimizes average time of barrier synchronization by 4%, in comparison with the most common determined barrier algorithms. (C) 2021 The Authors. Published by Elsevier B.V.
SUNRAY-1D is a one-dimensional large signal code for analyzing the beam-wave interaction in helix traveling wave tubes (TWT5). In order to improve the performance of SUNRAY-1D, parallelization of few of its modules ha...
详细信息
ISBN:
(纸本)9781665441056
SUNRAY-1D is a one-dimensional large signal code for analyzing the beam-wave interaction in helix traveling wave tubes (TWT5). In order to improve the performance of SUNRAY-1D, parallelization of few of its modules has been initiated. parallel implementation of space charge force module using MPI (message passing interface) has been successful. Improvements, in terms of increased accuracy and reduced computational time, have been the key benefits achieved.
Cloud Warehouses have been expanding their computational resources to cover the growing offloading of tenants' applications. Currently, cloud nodes integrate heterogeneous resources, such as CPU and GPU, so they c...
详细信息
ISBN:
(纸本)9781665443111
Cloud Warehouses have been expanding their computational resources to cover the growing offloading of tenants' applications. Currently, cloud nodes integrate heterogeneous resources, such as CPU and GPU, so they can exploit different types and levels of parallelism available in the applications. However, heterogeneous cloud nodes bring challenges to the software development process, since the programmer must be aware of each device's specifications, analyze and distribute the code over the available devices. Even though OpenCL supports transparent programming on heterogeneous devices, softening the programmer's burden, the choice of target device is still the programmer's responsibility. Given that, this work proposes a framework for the execution of OpenCL applications on a multi-tenant CPU-GPU cloud environment, responsible for transparently scheduling the applications to the best available device, without any interaction from the programmer. The framework has the goal of optimizing resource provisioning, reducing makespan and energy consumption. Considering the execution of the PolyBench benchmark suite, the framework shows reduction on makespan of 3.4x and energy savings of 33% when compared to the GPU standalone execution.
Writing efficient, scalable, and portable HPC synthetic aperture radar (SAR) applications is increasingly challenging due to the growing diversity and heterogeneity in distributed systems. Considerable developer and c...
详细信息
ISBN:
(纸本)9781665423694
Writing efficient, scalable, and portable HPC synthetic aperture radar (SAR) applications is increasingly challenging due to the growing diversity and heterogeneity in distributed systems. Considerable developer and computational resources are often spent to port applications to new HPC platforms and architectures, which is both time consuming and expensive. Domain-specific languages have been shown to be highly productive for development effort, but additionally achieving both scalable computational efficiency and platform portability remains challenging. The Halide programming language is both productive and efficient for dense data processing, supports common CPU architectures and heterogeneous resources like GPUs, and has previously been extended for distributed processing. We propose to use a distributed Halide implementation for scalable and heterogeneous HPC SAR processing. We implement a backprojection algorithm for SAR image reconstruction and demonstrate scalability on the OLCF Summit supercomputer up to 1,024 compute nodes (43,008 cores, each with 4 hardware threads) with a large 32,768x32,768 dataset, and up to 8 distributed GPUs with a 8,192x 8,192 dataset. Our results show excellent scaling and portability to heterogeneous resources, and motivate additional improvements in Halide to better support distributed high-performance signal processing.
This paper presents a spell checker project based on Levenshtein distance and evaluates the system's performance on both parallel and sequential implementations. The Levenshtein algorithm approaches are presented ...
详细信息
ISBN:
(纸本)9783030916077;9783030916084
This paper presents a spell checker project based on Levenshtein distance and evaluates the system's performance on both parallel and sequential implementations. The Levenshtein algorithm approaches are presented in this paper: Levenshtein Matrix Distance, Levenshtein Vector Distance, Levenshtein automaton (along with an optimised version), Levenshtein trie and the performance evaluation is performed using three edit distances. Each edit distance is evaluated based on a set of misspelt words, so the results are relevant for various cases. For this scenario, the Levenshtein trie, along with the Levenshtein automaton, performed the best in both sequential and parallel versions for a large amount of misspelt words.
High-level programming models aim at exploiting hardware parallelism and reducing software development costs. However, their adoption on ultra-low-power multi-core microcontroller (MCU) platforms requires minimizing t...
详细信息
ISBN:
(纸本)9783030816827;9783030816810
High-level programming models aim at exploiting hardware parallelism and reducing software development costs. However, their adoption on ultra-low-power multi-core microcontroller (MCU) platforms requires minimizing the overheads of work-sharing constructs on fine-grained parallel regions. This work tackles this challenge by proposing OMP-SPMD, a streamlined approach for parallel computing enabling the OpenMP syntax for the Single-Program Multiple-Data (SPMD) paradigm. To assess the performance improvement, we compare our solution with two alternatives: a baseline implementation of the OpenMP runtime based on the fork-join paradigm (OMP-base) and a version leveraging hardware-specific optimizations (OPM-opt). We benchmarked these libraries on a parallel Ultra-Low Power (PULP) MCU, highlighting that hardware-specific optimizations improve OMP-base performance up to 69%. At the same time, OMP-SPMD leads to an extra improvement up to 178%.
Modern multi-core servers are powerful enough to process multi-gigabit live packet streams on the network data plane. However, in most cases network programmers must build their applications from scratch, by implement...
详细信息
ISBN:
(纸本)9781728181042
Modern multi-core servers are powerful enough to process multi-gigabit live packet streams on the network data plane. However, in most cases network programmers must build their applications from scratch, by implementing both the interfaces towards the lower hardware level and the proper mechanisms for parallel programming. Data Stream Processing (DaSP) frameworks have recently emerged as promising approaches to overcome the above issues and to let programmers simply focus on the logic of the application to develop. However, DaSP platforms are generally not designed for the networking domain, in terms of both performance and functions. In this paper, we selected the WindFlow DaSP framework and built suitable extensions to attach multiple (accelerated) packet sources of data to it. We then implemented a simple monitoring application on top of WindFlow and carried out stress tests with synthetic and real traffic. The results prove that performance scale linearly with the processing cores so that the application was able to process the whole amount of live data up to nearly 20 Gbps rate.
The article discusses a way to improve the efficiency of complex query execution in modern DBMS. The method is based on the use of tree structures, key hash codes, and the possibility of optimization based on partitio...
详细信息
ISBN:
(纸本)9781665404761
The article discusses a way to improve the efficiency of complex query execution in modern DBMS. The method is based on the use of tree structures, key hash codes, and the possibility of optimization based on partitioning. As an additional aspect of optimization, the method of parallel operation of the proposed method is described.
Inspired by earlier work on Augur, Vate is a probabilistic programming language for the construction of JVM based probabilistic models with an Object-Oriented interface. As a compiled language it is able to examine th...
详细信息
ISBN:
(纸本)9781450382984
Inspired by earlier work on Augur, Vate is a probabilistic programming language for the construction of JVM based probabilistic models with an Object-Oriented interface. As a compiled language it is able to examine the dependency graph of the model to produce optimised code that can be dynamically targeted to different platforms. Using Gibbs Sampling, Metropolis-Hastings and variable marginalisation it can handle a range of model types and is able to efficiently infer values, estimate probabilities, and execute models.
Despite of the widespread implementation of agent-based models in ecological modeling and another several areas, modelers have been concerned by the time consuming of these type of models. This paper presents a strate...
详细信息
ISBN:
(纸本)9783030869601;9783030869595
Despite of the widespread implementation of agent-based models in ecological modeling and another several areas, modelers have been concerned by the time consuming of these type of models. This paper presents a strategy to parallelize an agent-based model of spatial distribution of biological species, operating in a multi-stage synchronous distributed memory mode, as a way to obtain gains in the performance while reducing the need for synchronization. A multiprocessing implementation divides the environment (a rectangular grid corresponding to the study area) into stage-subsets, according to the number of defined or available processes. In order to ensure that there is no information loss, each stage-subset is extended with an overlapping section from each one of its neighbouring stage-subsets. The effect of the size of this overlapping on the quality of the simulations is studied. These results seem to indicate that it is possible to establish an optimal trade-off between the level of redundancy and the synchronization frequency. The reported paralellization method was tested in a standalone multicore machine but may be seamlessly scalable to a computation cluster.
暂无评论