Heterogeneous systems with several kinds of devices, such as multi-core CPUs, GPUs, FPGAs, among others, are now commonplace. Exploiting all these devices with device-oriented programming models, such as CUDA or OpenC...
详细信息
Heterogeneous systems with several kinds of devices, such as multi-core CPUs, GPUs, FPGAs, among others, are now commonplace. Exploiting all these devices with device-oriented programming models, such as CUDA or OpenCL, requires expertise and knowledge about the underlying hardware to tailor the application to each specific device, thus degrading performance portability. Higher-level proposals simplify the programming of these devices, but their current implementations do not have an efficient support to solve problems that include frequent bursts of computation and communication, or input/output operations. In this work we present CtrlEvents, a new heterogeneous runtime solution which automatically overlaps computation and communication whenever possible, simplifying and improving the efficiency of data-dependency analysis and the coordination of both device computations and host tasks that include generic I/O operations. Our solution outperforms other state-of-the-art implementations for most situations, presenting a good balance between portability, programmability and efficiency. (c) 2023 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons .org /licenses /by /4 .0/).
Different high performance techniques, such as profiling, tracing, and instrumentation, have been used to tune and enhance the performance of parallel applications. However, these techniques do not show how to explore...
详细信息
Different high performance techniques, such as profiling, tracing, and instrumentation, have been used to tune and enhance the performance of parallel applications. However, these techniques do not show how to explore the potential of parallelism in a given application. Animating and visualizing the execution process of a sequential algorithm provide a thorough understanding of its usage and functionality. In this work, an interactive web-based educational animation tool was developed to assist users in analyzing sequential algorithms to detect parallel regions regardless of the used parallel programming model. The tool simplifies algorithms' learning, and helps students to analyze programs efficiently. Our statistical t-test study on a sample of students showed a significant improvement in their perception of the mechanism and parallelism of applications and an increase in their willingness to learn algorithms and parallel programming.
The article outlines a contemporary method for creating software for multi -processor computers. It describes the identification of parallelizable sequential code structures. Three structures were found and then caref...
详细信息
The article outlines a contemporary method for creating software for multi -processor computers. It describes the identification of parallelizable sequential code structures. Three structures were found and then carefully examined. The algorithms used to determine whether or not certain parts of code may be parallelized result from static analysis. The techniques demonstrate how, if possible, existing sequential structures might be transformed into parallel -running programs. A dynamic evaluation is also a part of our process, and it can be used to assess the efficiency of the parallel programs that are developed. As a tool for sequential programs, the algorithms have been implemented in C#. All proposed methods were discussed using a common benchmark.
In the rapidly evolving landscape of heterogeneous computing, the efficiency of data movement between CPUs and GPUs can make or break system performance. Despite advancements in parallel processing, existing methods f...
详细信息
ISBN:
(数字)9798331523893
ISBN:
(纸本)9798331523909
In the rapidly evolving landscape of heterogeneous computing, the efficiency of data movement between CPUs and GPUs can make or break system performance. Despite advancements in parallel processing, existing methods for managing data transfers—particularly in GPU offloading scenarios—suffer from significant inefficiencies. These inefficiencies are particularly evident in nucleation list precomputation for non-equilibrium solidification models, where redundant data movements and complex dynamic work-sharing in OpenMP lead to significant performance overhead. To tackle this issue, this paper proposes a novel solution that integrates the Location-Aware Heap Static Single Assignment (LASSA) algorithm into the compilation process. This approach identifies and eliminates redundant memory copy operations, optimizing data transfers and reducing overhead. The findings reveal a dramatic performance boost, with up to a 9.6-fold increase in efficiency. By addressing the specific challenges of nucleation list precomputation, this work provides valuable insights into optimizing data movement in heterogeneous computing environments, paving the way for enhanced performance in parallel programming models.
The success of an efficient and effective aggregator-based residential demand response system in the smart grid relies on the day-ahead customer incentive pricing (CIP) and the load shifting protocols. An artificial n...
详细信息
The success of an efficient and effective aggregator-based residential demand response system in the smart grid relies on the day-ahead customer incentive pricing (CIP) and the load shifting protocols. An artificial neural network model is designed to generate the day-ahead CIP for the aggregator based on historical data. Load scheduling is proposed as a day-ahead optimization problem that is solved using a blocked sliding window technique using parallel computing. With the assumptions made, the proposed algorithm improved the aggregator performance by reducing the overall simulation time from 275 to 45 min and increasing the aggregator forecast profits and customer savings by 11.85% and 35.99% compared to the previous genetic algorithm-based approach.
Concealing secret information in an image so that any perceptible evidence of the image alteration is insignificant, is known as image steganography. Image steganography can be implemented with either spatial or trans...
详细信息
Concealing secret information in an image so that any perceptible evidence of the image alteration is insignificant, is known as image steganography. Image steganography can be implemented with either spatial or transform domain techniques. Spatial domain-based algorithms, generally the most widely used ones, refer to the process of embedding the secret information in the least significant bit positions of the cover image pixels. This paper proposes a chaotic tent map-based bit embedding as a novel steganography algorithm with a multicore implementation. The potential reasons for using chaotic maps in image steganography are sensitivity of these functions to initial conditions and control parameters. The computational complexity of the sequential least significant bit algorithm is known to be O(n). Hence, time complexity of the encryption/decryption algorithm is also a very important aspect. With the advantages offered by multicore processors, the proposed steganography algorithm can now be explicitly parallelized using the OpenMP API. As a pre-embedding operation, the quality of the randomness of the chaotic number sequences is tested with a NIST cryptographic test suite. The quality of the stego image is validated with statistical parameters such as structural similarity index (SSIM), mean square error (MSE) and peak signal-to-noise ratio (PSNR). Moreover, exploiting data parallelism inherent in the algorithm, multicore implementation of the algorithm with OpenMP API has also been reported. Proposed parallel version of the technique has been tested on five test samples of images for scalability analysis and results indicate significant speed up as compared to the sequential implementation of the technique.
The C++ language continually evolves through formal specifications established by its standards committee, proposing new features to maintain $\mathrm{C}++$ as a relevant programming language while improving usability...
详细信息
ISBN:
(数字)9798331524937
ISBN:
(纸本)9798331524944
The C++ language continually evolves through formal specifications established by its standards committee, proposing new features to maintain $\mathrm{C}++$ as a relevant programming language while improving usability, performance, and portability across platforms. With the addition of parallel Standard Template Library (STL) algorithms in C++17, programmers can now leverage parallel processing capabilities via vendor-neutral parallel execution policies. This study presents an adaptation of the NAS parallel Benchmarks (NPB)—a well-established suite of applications for evaluating parallel architectures-by porting its sequential C-style code to use C++ STL abstractions and performance-portable parallelism features. Our goals are to (1) assess the suitability of C++ STL for scientific applications like the ones in the NPB and (2) provide a comparative performance and portability of STL algorithms’ parallel execution policies across different multicore architectures (x86 and AArch64). Results indicate that the performance of parallel STL algorithms is often close to that of optimized handwritten versions (OpenMP, Intel TBB, and FastFlow) on different architectures, with notable shortfalls. Across all NPB benchmarks, the STL algorithms’ geometric mean shows sequential execution times that are between 3.76% and $\mathrm{6. 9 \%}$ higher, while parallel executions may reach a geometric mean of up to $\mathrm{2 1. 2 1 \%}$ higher execution time.
K-means is a popular clustering algorithm with significant applications in numerous scientific and engineering areas. One drawback of K-means is its inability to identify non-linearly separable clusters, which may lea...
详细信息
With the continuous increase in data size and model complexity, the computational workload has grown rapidly, posing a significant challenge to the capabilities of computer data processing and simulation calculations....
详细信息
ISBN:
(数字)9798350361674
ISBN:
(纸本)9798350361681
With the continuous increase in data size and model complexity, the computational workload has grown rapidly, posing a significant challenge to the capabilities of computer data processing and simulation calculations. Therefore, parallel programming based on multicore and cluster architectures has become one of the mainstream technologies to enhance program execution efficiency and numerical computation efficiency. The theoretical foundations of parallel program computing have been applied in various aspects of engineering applications and theoretical simulations. In this paper, a new parallel PID anti-integral saturation controller is designed for a second-order closed-loop control system of unmanned aerial vehicles (UAVs). It compares and analyzes the runtime, execution efficiency, and speedup ratio of parallel programs and general serial programs under the same scenario. The experimental results demonstrate that parallel computing significantly improves the simulation program's efficiency for PID controller anti-saturation control systems under identical scenarios, exhibiting a high speedup ratio on the existing computing platform. Additionally, this study consolidates common issues encountered in MATLAB parallel program design, offering valuable insights into overcoming challenges in this domain.
Fortran's prominence in scientific computing requires strategies to ensure both that legacy codes are efficient on high-performance computing systems, and that the language remains attractive for the development o...
详细信息
暂无评论