The popularity of multicore processors and the rise of High Performance Computing as a Service (HPCaaS) have made parallel programming essential to fully utilize the performance of multicore systems. OpenMP, a widely ...
详细信息
ISBN:
(数字)9798350386776
ISBN:
(纸本)9798350386783
The popularity of multicore processors and the rise of High Performance Computing as a Service (HPCaaS) have made parallel programming essential to fully utilize the performance of multicore systems. OpenMP, a widely adopted shared-memory parallel programming model, is favored for its ease of use. However, it is still challenging to assist and accelerate automation of its parallelization. Although existing automation tools such as Cetus and DiscoPoP to simplify the parallelization, there are still limitations when dealing with complex data dependencies and control flows. Inspired by the success of deep learning in the field of Natural Language Processing (NLP), this study adopts a Transformer-based model to tackle the problems of automatic parallelization of OpenMP instructions. We propose a novel Transformer-based multimodal model, ParaMP, to improve the accuracy of OpenMP instruction classification. The ParaMP model not only takes into account the sequential features of the code text, but also incorporates the code structural features and enriches the input features of the model by representing the Abstract Syntax Trees (ASTs) corresponding to the codes in the form of binary trees. In addition, we built a BTCode dataset, which contains a large number of C/C++ code snippets and their corresponding simplified AST representations, to provide a basis for model training. Experimental evaluation shows that our model outperforms other existing automated tools and models in key performance metrics such as F1 score and recall. This study shows a significant improvement on the accuracy of OpenMP instruction classification by combining sequential and structural features of code text, which will provide a valuable insight into deep learning techniques to programming tasks.
GPU-based HPC clusters are attracting more sci-entific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures,...
详细信息
ISBN:
(数字)9798350364606
ISBN:
(纸本)9798350364613
GPU-based HPC clusters are attracting more sci-entific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures, a popular choice for an application developer is to utilize directive-based parallel programming models, such as OpenMP. However, even with OpenMP, the developer must choose from among many strategies for exploiting a GPU or a CPU. This paper introduces a new graph-based program representation for optimization of OpenMP applications. The originality of this work lies in the augmentations of Abstract Syntax Trees (ASTs) and the introduction of edge weights to account for loop and condition information. We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of OpenMP code regions across CPUs and GPUs. Various transformations utilizing collapse and data transfer between the CPU and GPU are used to construct the dataset. The trained model is used to determine which transformation provides the best performance. Results indicate that our approach is effective and has normalized RMSE as low as
$4\times 10^{-3}$
to at most
$1\times 10^{-2}$
in its runtime predictions.
With the development of the GPU, parallel languages are widely used for developing modern parallel applications. Given its low energy cost and programmable hardware, the FPGA emerges as a promising candidate to run GP...
详细信息
ISBN:
(数字)9798350364606
ISBN:
(纸本)9798350364613
With the development of the GPU, parallel languages are widely used for developing modern parallel applications. Given its low energy cost and programmable hardware, the FPGA emerges as a promising candidate to run GPU applications. Therefore, executing applications described in GPU programming languages on FPGA can offer new opportunities in terms of performance and energy efficiency. However, the gap between GPU programming languages and hardware description languages (HDL) poses a significant challenge for this transition. To overcome this problem, existing works have attempted to bridge this gap through high-level synthesis (HLS) or soft GPU. In this paper, we examine how HLS and soft GPU compile GPU languages for FPGA by discussing the detailed compilation and execution flow of two representative works: Intel FPGA SDK for OpenCL and Vortex. This paper also evaluates the coverage of both approaches and discusses methods for addressing the challenges each approach faces. Consequently, this paper explores the challenges HLS and GPU encounter, aiming to identify new problems and opportunities each approach introduces.
Not all data sharing patterns can benefit from the write invalidate strategy in multi-core systems. When handling serialized synchronizations, such as lock, barrier etc., long delays and large amount of cache coherenc...
详细信息
ISBN:
(数字)9798331531409
ISBN:
(纸本)9798331531416
Not all data sharing patterns can benefit from the write invalidate strategy in multi-core systems. When handling serialized synchronizations, such as lock, barrier etc., long delays and large amount of cache coherence traffic caused by invalidations introduce performance bottlenecks. This paper introduces a privately non-cacheable strategy (PNCS), which favors the maintenance of shared data exhibiting write-once characteristic, like locks, used in synchronizations among threads by not caching the corresponding data block in the private caches but the shared last-level cache. While cooperating with traditional cache coherence protocol MESI, PNCS implements data forwarding of a shared lock to the next sharer waiting in the request queue at the last-level cache (LLC) without incurring another round of LLC lookup. Simulation results reported that PNCS can accelerate acquisition of the variable by the requesters while cutting down the invalidation traffic during a synchronization phase. By creating applications that involve large-scale thread synchronization under parallel programming directives, results demonstrated that PNCS scales within multi-core systems. In the scenario of 64-thread lock synchronization, the average latency of contending requests can be reduced to about 73% of that in “cache lock”, which is a strict write invalidate strategy proposed by Intel.
A compiler’s intermediate representation (IR) defines a program’s execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan (which instruction...
详细信息
A compiler’s intermediate representation (IR) defines a program’s execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan (which instructions to execute and when) with a semantically-equivalent one that increases the program’s performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. In more detail, this paper shows that the PDG over-constrains the optimization space when used for parallel code. We propose the parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans (and therefore parallel IRs) must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select d
Strong-motion processing holds paramount importance in earthquake engineering and disaster risk management systems. By leveraging parallel loops and task-parallelism techniques, we address computational challenges pos...
详细信息
ISBN:
(数字)9798350364606
ISBN:
(纸本)9798350364613
Strong-motion processing holds paramount importance in earthquake engineering and disaster risk management systems. By leveraging parallel loops and task-parallelism techniques, we address computational challenges posed by large-scale accelerographic datasets. Through experimentation with more than one million data points from six real-world seismic events, our approach achieved speedups of up to 2.9x, demonstrating the effectiveness of parallel programming in accelerating seismic data processing. Our findings highlight the significance of parallel programming techniques in advancing seismological research and enhancing earthquake mitigation strategies.
Task-based execution frameworks, such as parallel programming libraries, computational workflow systems, and function-as-a-service platforms, enable the composition of distinct tasks into a single, unified application...
详细信息
ISBN:
(数字)9798350365610
ISBN:
(纸本)9798350365627
Task-based execution frameworks, such as parallel programming libraries, computational workflow systems, and function-as-a-service platforms, enable the composition of distinct tasks into a single, unified application designed to achieve a computational goal and abstract the parallel and distributed execution of those tasks on arbitrary hardware. Research into these task executors has accelerated as computational sciences increasingly need to take advantage of parallel compute and/or heterogeneous hardware. However, the lack of evaluation standards makes it challenging to compare and contrast novel systems against existing implementations. Here, we introduce TaPS, the Task Performance Suite, to support continued research in distributed task executor frameworks. TaPS provides (1) a unified, modular interface for writing and evaluating applications using arbitrary execution frameworks and data management systems and (2) an initial set of reference synthetic and real-world science applications. We discuss how the design of TaPS supports the reliable evaluation of frameworks and demonstrate TaPS through a survey of benchmarks using the provided reference applications.
The projection process of the LiDAR 3D Point Cloud data is one of the crucial steps in Computer vision applications. It involves several steps to achieve the finalized accurate results. Many current studies leverage t...
详细信息
ISBN:
(数字)9798350372977
ISBN:
(纸本)9798350372984
The projection process of the LiDAR 3D Point Cloud data is one of the crucial steps in Computer vision applications. It involves several steps to achieve the finalized accurate results. Many current studies leverage the benefits of using a GPU in the computing capability. This paper presents a comparative study of the implementation and testing of this process on single-core, multi-core CPU and GPU architectures. The computational efficiency of each platform is evaluated through a series of benchmarks, including data extraction, segmentation, and trans-formation tasks. Our analysis reveals the inherent parallelization benefits of GPUs in handling large-scale point cloud data, while also considering the accessibility of multi-core CPUs. Also, a comparison between the NVIDIA RTX 3070 and NVIDIA RTX 4060 is provided. The RTX 3070 showed roughly a speed up of 8 times over the RTX 4060. In addition, the Multi-core implementation outperforms up to 10 times over the single-core. These results overall show the benefits of using the Multi-core and the GPU accelerating approaches to this application, with the availability for further improvements.
This paper introduces an aspect-oriented library aimed to support efficient execution of Java applications on multi-core systems. The library is coded in AspectJ and provides a set of parallel programming abstractions...
详细信息
ISBN:
(纸本)9780769551173
This paper introduces an aspect-oriented library aimed to support efficient execution of Java applications on multi-core systems. The library is coded in AspectJ and provides a set of parallel programming abstractions that mimics the OpenMP standard. The library supports the migration of sequential Java codes to multi-core machines with minor changes to the base code, intrinsically supports the sequential semantics of OpenMP and provides improved integration with object-oriented mechanisms. The aspect-oriented nature of library enables the encapsulation of parallelism-related code into well-defined modules. The approach makes the parallelisation and the maintenance of large-scale Java applications more manageable. Furthermore, the library can be used with plain Java annotations and can be easily extended with application-specific mechanisms in order to tune application performance. The library has a competitive performance, in comparison with traditional parallel programming in Java, and enhances programmability, since it allows an independent development of parallelism-related code.
Nowadays, in the different areas of knowledge, there is an increase in the amount of information needed to process, reason why many solutions have been generated for the implementation of high-performance computing, t...
详细信息
ISBN:
(数字)9798350379945
ISBN:
(纸本)9798350379952
Nowadays, in the different areas of knowledge, there is an increase in the amount of information needed to process, reason why many solutions have been generated for the implementation of high-performance computing, these available solutions depend on many factors, from the use of available different architectures. This research work presents a method for the configuration of a low-cost solution for the implementation of asolution based on HPC, using the OpenMP and OpenMPI libraries. The processes necessary for the implementation of programs to exploit these two libraries that are used in the application of parallel programming are described. As a result, the study presents the application of the methodology using file compression, which was implemented Huffman's algorithm, the results demonstrate the optimization in parallel work working with OpenMP and OpenMPI libraries, which allows working with all processors available in the different computer architectures that are available. The study indicates the mode of use and application of the methodology described.
暂无评论