检索结果-内蒙古大学图书馆

International Conference on Electronics and Information Technology (EIT)

作者： Jinrui Ying Lei Wang Panlong Wang Zhiyong Gao Bowen Liu Zhongyuan University of Technology's School of Cyber Security Zhengzhou China College of Computer and Artificial Intelligence and School of Software Zhengzhou University Zhengzhou China

ISBN: (数字)9798350369151

ISBN: (纸本)9798350369168

The performance bottleneck of math library functions has long been a common issue among major manufacturers. To overcome these bottlenecks, this paper proposes the Rlibm-OMP method, which combines the RLibm fast polynomial method with the OpenMP parallel programming model based on Sunway architecture. This approach is aimed at generating polynomial replacements for foundational library functions. The RLibm fast polynomial method aims to produce polynomial replacements that yield correct results for foundational functions across all inputs and various rounding modes. However, due to several adaptation issues when porting external platform algorithms to the Sunway platform, performance often degrades. Thus, an optimization was carried out using the OpenMP parallel programming model based on Sunway to optimize polynomial evaluations. By integrating the RLibm fast polynomial method with the OpenMP parallel programming model, it is possible not only to generate correctly rounded polynomials but also to achieve efficient parallel computations on the Sunway platform. Experimental results indicate that the polynomial results for 32-bit floating-point functions are not only accurate but also, on average, 12% faster than those achieved using the Rlibm fast polynomial method alone.

关键词： Adaptation models Codes Accuracy parallel programming Computational modeling Computer architecture Polynomials Libraries Resource management Optimization

来源：评论

学校读者我要写书评

暂无评论

Performance Portability of the Chapel Language on Heterogeneous Architectures

Performance Portability of the Chapel Language on Heterogene...

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Josh Milthorpe Xianghao Wang Ahmad Azizi Oak Ridge National Laboratory Oak Ridge Tennessee USA Australian National University Canberra Australia

ISBN: (数字)9798350364606

ISBN: (纸本)9798350364613

A performance-portable application can run on a variety of different hardware platforms, achieving an acceptable level of performance without requiring significant rewriting for each platform. Several performance-portable programming models are now suitable for high-performance scientific application development, including OpenMP and Kokkos. Chapel is a parallel programming language that supports the productive development of high-performance scientific applications and has recently added support for GPU architectures through native code generation. Using three mini-apps - BabelStream, miniBUDE, and TeaLeaf - we evaluate the Chapel language's performance portability across various CPU and GPU platforms. In our evaluation, we replicate and build on previous studies of performance portability using mini-apps, comparing Chapel against OpenMP, Kokkos, and the vendor programming models CUDA and HIP. We find that Chapel achieves comparable performance portability to OpenMP and Kokkos and identify several implementation issues that limit Chapel's performance portability on certain platforms.

关键词： Measurement Distributed processing Codes parallel programming Conferences Graphics processing units Hardware

来源：评论

学校读者我要写书评

暂无评论

Implementing a Reduction Clause to Overcome Critical Section Deficiencies in parallel Computing

Implementing a Reduction Clause to Overcome Critical Section...

引用

Control & Automation, Electronics, Robotics, Internet of Things, and Artificial Intelligence (CERIA), IEEE International Conference on

作者： Sadly Syamsuddin Jufri Jufri Asmah Akhriana Ahyuna Ahyuna Baharuddin Rahman Risnayanti Andi Djamro Indra Samsie dept. of Informatics Dipa Makassar University Makassar Indonesia dept. of Informatics Management Dipa Makassar University Makassar Indonesia dept. of Information System Dipa Makassar University Makassar Indonesia

ISBN: (数字)9798331511166

ISBN: (纸本)9798331511173

parallel programming is a method we can use to increase processor performance in parallel computers because almost all multicore computers integrate more than one individual processor and cache memory. There are several techniques that can be used in parallel computing, one of which is by utilizing the Critical Section to protect the occurrence of data inconsistencies in the parallel computing process. There are advantages and disadvantages to using Critical Section, namely Mutually Exclusive where no work can be done simultaneously in a Critical Section on one parallel program. So in this journal we will try to discuss solutions to protect data inconsistencies in parallel computing without having to experience Mutually Exclusive, namely by using the Reduction Clause method. The results of this study show that Reduction Clause can function without the need for a waiting process and offset parallel looping with correct results without data inconsistency and also increase parallel computing work time when compared to us using Critical Section.

关键词： Computers parallel programming Multicore processing Source coding Scalability parallel processing System recovery Synchronization Internet of Things Robots

来源：评论

学校读者我要写书评

暂无评论

Enhancing parallelization with OpenMP through Multi-Modal Transformer Learning

Enhancing Parallelization with OpenMP through Multi-Modal Tr...

引用

International Conference on Computer Engineering and Applications (ICCEA)

作者： Yuehua Chen Huaqiang Yuan Fengyao Hou Peng Hu Dongguan University of Technology Dongguan China Institute of High Energy Physics Chinese Academy of Sciences Beijing China Spallation Neutron Source Science Center Dongguan China

ISBN: (数字)9798350386776

ISBN: (纸本)9798350386783

The popularity of multicore processors and the rise of High Performance Computing as a Service (HPCaaS) have made parallel programming essential to fully utilize the performance of multicore systems. OpenMP, a widely adopted shared-memory parallel programming model, is favored for its ease of use. However, it is still challenging to assist and accelerate automation of its parallelization. Although existing automation tools such as Cetus and DiscoPoP to simplify the parallelization, there are still limitations when dealing with complex data dependencies and control flows. Inspired by the success of deep learning in the field of Natural Language Processing (NLP), this study adopts a Transformer-based model to tackle the problems of automatic parallelization of OpenMP instructions. We propose a novel Transformer-based multimodal model, ParaMP, to improve the accuracy of OpenMP instruction classification. The ParaMP model not only takes into account the sequential features of the code text, but also incorporates the code structural features and enriches the input features of the model by representing the Abstract Syntax Trees (ASTs) corresponding to the codes in the form of binary trees. In addition, we built a BTCode dataset, which contains a large number of C/C++ code snippets and their corresponding simplified AST representations, to provide a basis for model training. Experimental evaluation shows that our model outperforms other existing automated tools and models in key performance metrics such as F1 score and recall. This study shows a significant improvement on the accuracy of OpenMP instruction classification by combining sequential and structural features of code text, which will provide a valuable insight into deep learning techniques to programming tasks.

关键词： Deep learning Training Codes Automation Accuracy parallel programming Multicore processing

来源：评论

学校读者我要写书评

暂无评论

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

ParaGraph: Weighted Graph Representation for Performance Opt...

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Ali TehraniJamsaz Alok Mishra Akash Dutta Abid M. Malik Barbara Chapman Ali Jannesari Iowa State University Ames Iowa USA Hewlett Packard Enterprise Milpitas California USA Stony Brook University Stony Brook New York USA

ISBN: (数字)9798350364606

ISBN: (纸本)9798350364613

GPU-based HPC clusters are attracting more sci-entific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures, a popular choice for an application developer is to utilize directive-based parallel programming models, such as OpenMP. However, even with OpenMP, the developer must choose from among many strategies for exploiting a GPU or a CPU. This paper introduces a new graph-based program representation for optimization of OpenMP applications. The originality of this work lies in the augmentations of Abstract Syntax Trees (ASTs) and the introduction of edge weights to account for loop and condition information. We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of OpenMP code regions across CPUs and GPUs. Various transformations utilizing collapse and data transfer between the CPU and GPU are used to construct the dataset. The trained model is used to determine which transformation provides the best performance. Results indicate that our approach is effective and has normalized RMSE as low as $4\times 10^{-3}$ to at most $1\times 10^{-2}$ in its runtime predictions.

关键词： Training Runtime parallel programming Graphics processing units Syntactics parallel processing Graph neural networks

来源：评论

学校读者我要写书评

暂无评论

Comparative Analysis of Executing GPU Applications on FPGA: HLS vs. Soft GPU Approaches

Comparative Analysis of Executing GPU Applications on FPGA: ...

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Chihyo Ahn Shinnung Jeong Liam Paul Cooper Nicholas Parnenzini Hyesoon Kim Georgia Institute of Technology Atlanta USA Yonsei University Seoul Republic of Korea

ISBN: (数字)9798350364606

ISBN: (纸本)9798350364613

With the development of the GPU, parallel languages are widely used for developing modern parallel applications. Given its low energy cost and programmable hardware, the FPGA emerges as a promising candidate to run GPU applications. Therefore, executing applications described in GPU programming languages on FPGA can offer new opportunities in terms of performance and energy efficiency. However, the gap between GPU programming languages and hardware description languages (HDL) poses a significant challenge for this transition. To overcome this problem, existing works have attempted to bridge this gap through high-level synthesis (HLS) or soft GPU. In this paper, we examine how HLS and soft GPU compile GPU languages for FPGA by discussing the detailed compilation and execution flow of two representative works: Intel FPGA SDK for OpenCL and Vortex. This paper also evaluates the coverage of both approaches and discusses methods for addressing the challenges each approach faces. Consequently, this paper explores the challenges HLS and GPU encounter, aiming to identify new problems and opportunities each approach introduces.

关键词： parallel languages parallel programming Pipelines Graphics processing units Hardware User experience Kernel

来源：评论

学校读者我要写书评

暂无评论

PNCS: A Privately Non-Cacheable Strategy for Synchronization in Multi-Core Systems

PNCS: A Privately Non-Cacheable Strategy for Synchronization...

引用

Electronic Information Engineering and Computer Science (EIECS), 2021 International Conference on

作者： Tongtong Guo Yu Zhou Anzhou Lai Liang Yang Jian Shao The 58th Research Institute of China Electronics Technology Group Corporation Wuxi China CETC Suntai Information Technology Co. Ltd. Wuxi China

ISBN: (数字)9798331531409

ISBN: (纸本)9798331531416

Not all data sharing patterns can benefit from the write invalidate strategy in multi-core systems. When handling serialized synchronizations, such as lock, barrier etc., long delays and large amount of cache coherence traffic caused by invalidations introduce performance bottlenecks. This paper introduces a privately non-cacheable strategy (PNCS), which favors the maintenance of shared data exhibiting write-once characteristic, like locks, used in synchronizations among threads by not caching the corresponding data block in the private caches but the shared last-level cache. While cooperating with traditional cache coherence protocol MESI, PNCS implements data forwarding of a shared lock to the next sharer waiting in the request queue at the last-level cache (LLC) without incurring another round of LLC lookup. Simulation results reported that PNCS can accelerate acquisition of the variable by the requesters while cutting down the invalidation traffic during a synchronization phase. By creating applications that involve large-scale thread synchronization under parallel programming directives, results demonstrated that PNCS scales within multi-core systems. In the scenario of 64-thread lock synchronization, the average latency of contending requests can be reduced to about 73% of that in “cache lock”, which is a strict write invalidate strategy proposed by Intel.

关键词： Protocols Costs parallel programming Chiplets Simulation Coherence Computer architecture Delays Maintenance Proposals

来源：评论

学校读者我要写书评

暂无评论

The parallel Semantics Program Dependence Graph

arXiv

引用

arXiv 2024年

作者： Homerding, Brian Patel, Atmn Deiana, Enrico Armenio Su, Yian Tan, Zujun Xu, Ziyang Godala, Bhargav Reddy August, David I. Campanoni, Simone Northwestern University United States Princeton University United States

A compiler’s intermediate representation (IR) defines a program’s execution plan by encoding its instructions and their relative order. Compiler optimizations aim to replace a given execution plan (which instructions to execute and when) with a semantically-equivalent one that increases the program’s performance for the target architecture. Alternative representations of an IR, like the Program Dependence Graph (PDG), aid this process by capturing the minimum set of constraints that semantically-equivalent execution plans must satisfy. parallel programming like OpenMP extends a sequential execution plan by adding the possibility of running instructions in parallel, creating a parallel execution plan. Recently introduced parallel IRs, like TAPIR, explicitly encode a parallel execution plan. These new IRs finally make it possible for compilers to change the parallel execution plan expressed by programmers to better fit the target parallel architecture. Unfortunately, parallel IRs do not help compilers in identifying the set of parallel execution plans that preserve the original semantics. In other words, we are still lacking an alternative representation of parallel IRs to capture the minimum set of constraints that parallel execution plans must satisfy to be semantically-equivalent. Unfortunately, the PDG is not an ideal candidate for this task as it was designed for sequential code. In more detail, this paper shows that the PDG over-constrains the optimization space when used for parallel code. We propose the parallel Semantics Program Dependence Graph (PS-PDG) to precisely capture the salient program constraints that all semantically-equivalent parallel execution plans (and therefore parallel IRs) must satisfy. This paper defines the PS-PDG, justifies the necessity of each extension to the PDG, and demonstrates the increased optimization power of the PS-PDG over an existing PDG-based automatic-parallelizing compiler. Compilers can now rely on the PS-PDG to select d

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

parallelizing Accelerographic Records Processing

Parallelizing Accelerographic Records Processing

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Ronaldo Canizales Luis Mixco Jedidiah McClurg Department of Computer Science Colorado State University Colorado USA Observatorio de Amenazas y Recursos Naturales Ministerio de Medio Ambiente y Recursos Naturales San Salvador El Salvador

ISBN: (数字)9798350364606

ISBN: (纸本)9798350364613

Strong-motion processing holds paramount importance in earthquake engineering and disaster risk management systems. By leveraging parallel loops and task-parallelism techniques, we address computational challenges posed by large-scale accelerographic datasets. Through experimentation with more than one million data points from six real-world seismic events, our approach achieved speedups of up to 2.9x, demonstrating the effectiveness of parallel programming in accelerating seismic data processing. Our findings highlight the significance of parallel programming techniques in advancing seismological research and enhancing earthquake mitigation strategies.

关键词： Distributed processing parallel programming Disasters Scalability Prevention and mitigation Earthquake engineering Seismology

来源：评论

学校读者我要写书评

暂无评论

Performance Analysis of LiDAR Data Processing on Multi-Core CPU and GPU Architectures

Performance Analysis of LiDAR Data Processing on Multi-Core ...

引用

Computing and Machine Intelligence (ICMI), International Conference on

作者： Mohammad S. Alzyout Abd Alrahman AL Nounou Yashwanth Naidu Tikkisetty Shadi Alawneh Electrical and Computer Engineering Department Oakland University Rochester MI USA

ISBN: (数字)9798350372977

ISBN: (纸本)9798350372984

The projection process of the LiDAR 3D Point Cloud data is one of the crucial steps in Computer vision applications. It involves several steps to achieve the finalized accurate results. Many current studies leverage the benefits of using a GPU in the computing capability. This paper presents a comparative study of the implementation and testing of this process on single-core, multi-core CPU and GPU architectures. The computational efficiency of each platform is evaluated through a series of benchmarks, including data extraction, segmentation, and trans-formation tasks. Our analysis reveals the inherent parallelization benefits of GPUs in handling large-scale point cloud data, while also considering the accessibility of multi-core CPUs. Also, a comparison between the NVIDIA RTX 3070 and NVIDIA RTX 4060 is provided. The RTX 3070 showed roughly a speed up of 8 times over the RTX 4060. In addition, the Multi-core implementation outperforms up to 10 times over the single-core. These results overall show the benefits of using the Multi-core and the GPU accelerating approaches to this application, with the availability for further improvements.

关键词： Point cloud compression Laser radar Three-dimensional displays parallel programming Graphics processing units Computer architecture Performance analysis

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：