检索结果-内蒙古大学图书馆

24th International Conference on parallel and Distributed Computing, Applications and Technologies, PDCAT 2023

作者： Pornmaneerattanatri, Soratouch Takahashi, Keichi Kashiwa, Yutaro Ichikawa, Kohei Iida, Hajimu Nara Institute of Science and Technology Nara630-0192 Japan Tohoku University Sendai980-8578 Japan

ISBN: (纸本)9789819982103

parallel programming is essential to utilize multi-core processors but remains challenging because it requires extensive knowledge of both software and hardware. Various automatic parallelization tools based on static analysis have been developed to ease the development of parallel programs. However, hand-parallelized codes still outperform auto-parallelized codes. Meanwhile, transformer-based large language models have made ground-breaking progress in coder understanding and generation tasks. In this paper, we fine-tune a transformer-based code understanding model, CodeT5, to create a model for automatically identifying parallelizable for-loops. The trained model helps developers to identify independent for-loops that can be potentially parallelized using tools such as OpenMP to improve the program performance. Our model is trained over 90,908 for-loops collected from 9 million C/C++ source files of public GitHub repositories, and achieves a 0.895 F1 score in identifying parallelizable for-loops in public GitHub projects and a 0.713 F1 score in the NAS parallel Benchmark suite. © 2024, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Improving the MPI Remote Memory Access Model for Distributed-memory Systems by Implementing One-sided Broadcast 27

Improving the MPI Remote Memory Access Model for Distributed...

引用

27th International Conference on Soft Computing and Measurements, SCM 2024

作者： Abuelsoud, Mohamed Paznikov, Aleksej A. Saint Petersburg Electrotechnical University 'LETI' Department of Computer Science and Engineering Saint Petersburg Russia

ISBN: (纸本)9798350363708

Currently, processing large volumes of expanding data efficiently and consistently is a significant challenge. Traditional distributed-memory high-performance computers (HPC) based on message-passing model struggle with inherent synchronization difficulties, limiting their ability to keep pace. Remote Memory Access (RMA, also known as one-sided MPI communications) allows a process to directly read from or write to the memory of another process, bypassing the need for message exchange. Unfortunately, there is no collective operation interface in the current MPI RMA standard. However, RMA has the potential to reduce synchronization costs by enabling concurrent access to shared data structures, distributed among MPI processes' memories. Existing onesided MPI standards offer a linear interface only that hampers parallelization and far from efficient. To bridge this gap, we propose an algorithm design for efficient collective (parallelizable) operations in the RMA paradigm. Our study primarily examines the benefits of collective operations using the broadcast algorithm as an example. Our implementations surpass traditional methods, demonstrating the promising potential of this technique, as more performance tests indicate. © 2024 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

DSParLib: A C plus plus Template Library for Distributed Stream parallelism

引用

INTERNATIONAL JOURNAL OF parallel programming 2022年第5-6期50卷 454-485页

作者： Loff, Junior Hoffmann, Renato B. Pieper, Ricardo Griebler, Dalvan Fernandes, Luiz G. Pontifical Catholic Univ Rio Grande do Sul PUCRS Sch Technol Porto Alegre RS Brazil Tres de Maio Fac SETREM Lab Adv Res Cloud Comp LARCC Tres De Maio Brazil

Stream processing applications deal with millions of data items continuously generated over time. Often, they must be processed in real-time and scale performance, which requires the use of distributed parallel computing resources. In C/C++, the current state-of-the-art for distributed architectures and High-Performance Computing is Message Passing Interface (MPI). However, exploiting stream parallelism using MPI is complex and error-prone because it exposes many low-level details to the programmer. In this work, we introduce a new parallel programming abstraction for implementing distributed stream parallelism named DSParLib. Our abstraction of MPI simplifies parallel programming by providing a pattern-based and building block-oriented development to inter-connect, model, and parallelize data streams found in modern applications. Experiments conducted with five different stream processing applications and the representative PARSEC Ferret benchmark revealed that DSParLib is efficient and flexible. Also, DSParLib achieved similar or better performance, required less coding, and provided simpler abstractions to express parallelism with respect to handwritten MPI programs.

关键词： parallel programming Distributed systems Stream processing parallel patterns MPI

来源：评论

学校读者我要写书评

暂无评论

An Efficient Approach to Resolving Stack Overflow of SYCL Kernel on Intel® CPUs 24

An Efficient Approach to Resolving Stack Overflow of SYCL Ke...

引用

12th International Workshop on OpenCL and SYCL, IWOCL 2024

作者： Xing, Wenwan He, Wenju Tian, Xinmin Intel China

ISBN: (纸本)9798400717901

SYCL is a parallel programming language and enables heterogeneous computing on various devices. SYCL CPU device [1] uses CPU as a device to run SYCL kernel. While most SYCL concepts, such as devices memory model, sub-group and work-group construction, can be mapped on GPU hardware, the CPU device lacks native support for them. Therefore, these concepts need to be emulated on the CPU device to ensure full hardware utilization to achieve the performance portability of SYCL programs. To facilitate task parallelism at the work-group level, the SYCL CPU device distributes the execution of SYCL work-groups to CPU threads, each of which has a restricted stack size. The SYCL device's memory model consists of three distinct memory regions. Local memory is accessible by all the work-items in a single work-group. Private memory is accessible to a work-item. The CPU device doesn't have dedicated hardware to support local and private memory. Therefore, they are emulated by allocating a block of memory for each of them on the stack. A stack overflow could occur when a kernel uses a large private or local memory, as a thread's stack size can't be changed after its creation. The probability of error is much higher on Windows since the default thread stack size of a master thread is only 1MB. To address this issue, SYCL CPU device previously adopted an approach of context swapping to expand the stack size using low-level API provided by operating system. Application master thread stack size is 8MB on Linux and 1MB on Windows. The stack size for other worker threads is set to 8MB on a 64-bit system and 4MB on a 32-bit system. When a work-group requires a stack size larger than that of its executing thread's stack size, the SYCL CPU device runtime swaps the thread's context before execution. However, this method results in large-scale performance degradation on Windows due to the swapping involving frequent and inefficient memory movement. Some SYCL workloads on Windows even hang with

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Implementation of Rlibm-OMP Algorithm Based on Sunway 3

Implementation of Rlibm-OMP Algorithm Based on Sunway

引用

3rd International Conference on Electronics and Information Technology, EIT 2024

作者： Ying, Jinrui Wang, Lei Wang, Panlong Gao, Zhiyong Liu, Bowen Zhongyuan University of Technology's School of Cyber Security Henan Zhengzhou China Zhengzhou University College of Computer and Artificial Intelligence School of Software Henan Zhengzhou China

ISBN: (纸本)9798350369151

The performance bottleneck of math library functions has long been a common issue among major manufacturers. To overcome these bottlenecks, this paper proposes the Rlibm-OMP method, which combines the RLibm fast polynomial method with the OpenMP parallel programming model based on Sunway architecture. This approach is aimed at generating polynomial replacements for foundational library functions. The RLibm fast polynomial method aims to produce polynomial replacements that yield correct results for foundational functions across all inputs and various rounding modes. However, due to several adaptation issues when porting external platform algorithms to the Sunway platform, performance often degrades. Thus, an optimization was carried out using the OpenMP parallel programming model based on Sunway to optimize polynomial evaluations. By integrating the RLibm fast polynomial method with the OpenMP parallel programming model, it is possible not only to generate correctly rounded polynomials but also to achieve efficient parallel computations on the Sunway platform. Experimental results indicate that the polynomial results for 32-bit floating-point functions are not only accurate but also, on average, 12% faster than those achieved using the Rlibm fast polynomial method alone. © 2024 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

parallel implementation of distributed acoustic sensor acquired signals: detection, processing, and classification

引用

JOURNAL OF APPLIED REMOTE SENSING 2022年第2期16卷

作者： Bencharif, Billel Alla Eddine Bayar, Salih Ozkan, Erkan Marmara Univ Elect & Comp Engn Istanbul Turkey SAMM Teknol Kocaeli Turkey

We aim to classify acoustic events recorded by a fiber optic distributed acoustic sensor (DAS). We derived the information from probing the fiber with light pulses and analyzing the Rayleigh backscatter. Then, we processed this data by a pipeline of processing algorithms to form the input for our machine learning classification model. We put random matrix theory to the test to distinguish the acoustic event of interest from the noise. We conditioned the raw trace using moving average and wavelet-based filtering algorithms to improve the signal-to-noise ratio. For raw, low pass, and wavelet-based filtered signals that we inject into a convolutional neural network, we rely on the magnitude of their complex coefficients to categorize the nature of the event. We also investigate Mel-Frequency Cepstral coefficients specific to the event as an input for the classifier and compare their performance to other signal representations. We run the experiments on the CNN for two-class and three-class classification using datasets from a DAS that is deployed for perimeter security and pipeline monitoring. We obtained the best results when using the MFCCs paired with wavelet denoising, achieving accuracies of 96.4% for the "event" class and 99.7% for the "no event" class when it comes to the two-class process. The three-class process yielded optimal accuracies of 83.3%, 81.3%, and 96.7% for the "digging," "walking," and "excavation" classes, respectively. Finally, the training execution time is exceptionally long because the extensive dataset and the model's architecture are complex. As a result, we make efficient use of the CPU and GPU to maximize our machine's power using the Keras API's sequence data generator. Compared with the serial implementation, we report an improvement of up to 4.87 times. (c) 2022 Society of Photo-Optical Instrumentation Engineers (SPIE)

关键词： classification convolutional neural networks distributed acoustic sensing parallel programming mel frequency cepstrum coefficient

来源：评论

学校读者我要写书评

暂无评论

Toward Automated Detection of Portability Bugs in Kokkos parallel Programs

Toward Automated Detection of Portability Bugs in Kokkos Par...

引用

2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024

作者： Kale, Vivek Yan, Hanru Mukherjee, Shyamali Mayo, Jackson Teranishi, Keita Rutledge, Richard Orso, Alessandro Sandia National Laboratories Livermore United States Georgia Institute of Technology Atlanta United States Oak Ridge National Labratory Oak Ridge United States

ISBN: (纸本)9798350355543

Performance-portable programming frameworks provide abstractions for parallel execution to allow easily porting an application to multiple backend programming models, such as CUDA, HIP, and OpenMP. However, programs may still have portability bugs that manifest only on specific backends. Traditional testing is ineffective in discovering these bugs, as it would require concrete execution on all supported hardware configurations for a potentially infinite set of inputs. To mitigate this issue, we focused on a specific programming framework, Kokkos, and identified several categories of common portability bugs. We then developed Klokkos, a static analysis approach based on symbolic execution that can run on commodity hardware, before execution on supercomputers. As a proof-of-concept, we ran Klokkos on examples encoding the identified bugs. Our results show that Klokkos is effective, efficient, and precise: it detected all the considered bugs, quickly, and without any false positives. Although preliminary, our results motivate further research and development in this direction. © 2024 IEEE.

关键词： bug detection c++ data race kokkos mock testing parallel programming symbolic execution

来源：评论

学校读者我要写书评

暂无评论

Accuracy of software and hardware of computer systems for human-machine interaction 1

Accuracy of software and hardware of computer systems for hu...

引用

1st International Workshop on Bioinformatics and Applied Information Technologies, BAIT 2024

作者： Stefanyshyn, Volodymyr Stefanyshyn, Ivan Pastukh, Oleh Yatsyshyn, Vasyl Yakymenko, Ihor Ternopil Ivan Puluj National Technical University 56 Ruska str. Ternopil46001 Ukraine West Ukrainian National University 11 Lvivska str. Ternopil46009 Ukraine

This article examines popular classifiers such as Bagging Classifier, Nearest Neighbors Classifier, Boosting Classifier, Support Vector Classifier for the highest performance accuracy. Classifiers will be tested for accuracy based on human brain activity data. Brain activity data collected during repetitive mechanical movements. © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Performance analysis of a stencil code using modern C++

Performance analysis of a stencil code using modern C++

引用

2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024

作者： Eijkhout, Victor Chitkara, Yojan Chaplot, Daksh The University of Texas at Austin Texas Advanced Computing Center United States

In this paper we evaluate multiple C++ shared memory programming models with respect to both ease of expression, and resulting performance. We do this by implementing the mathematical algorithm known as the 'power... 详细信息

ISBN: (纸本)9798350355543

关键词： modern c++ parallel programming stencil codes

来源：评论

学校读者我要写书评

暂无评论

Evaluating Tuning Opportunities of the LLVM/OpenMP Runtime

Evaluating Tuning Opportunities of the LLVM/OpenMP Runtime

引用

2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024

作者： Chheda, Smeet Verma, Gaurav Tian, Shilei Chapman, Barbara Doerfert, Johannes Stony Brook University United States Lawrence Livermore National Laboratory United States

ISBN: (纸本)9798350355543

Tuning parallel applications on multi-core architectures is an arduous task. Several studies have utilized auto-tuning for OpenMP applications via standardized user-facing features, namely number of threads, thread placement, binding and scheduling policy. However, they fall short on utilizing the additional parameters provided by an OpenMP implementation. In this paper, we analyze OpenMP application runtime through an exhaustive exploration of all relevant configuration options of the LLVM/OpenMP *** findings allow to identify trends in tuning potential, architecture-aware tuning suggestions, and good default configurations per architecture. We will open-source the 240,000 unique samples collected during experiments for use by the community. These runs have been conducted on three different CPU architectures vital in the HPC and datacenter community. Choice of applications includes popular benchmark suites and microbench-marks namely, NAS parallel Benchmarks, Barcelona OpenMP Task Suite, XSBench, RSBench, SU3Bench and *** employ the Linear Models class of Machine Learning algorithms to perform analysis, explain, and form qualitative relations between features comprising of the underlying architecture, application, input size, number of threads, and considered environment variables. This is further used to recommend different configurations given an application type/architecture. © 2024 IEEE.

关键词： HPC machine learning parallel programming tuning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：