检索结果-内蒙古大学图书馆

Reproducibility of the DaCe Framework on NPBench Benchmarks

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2025年第5期36卷 841-846页

作者： Govind, Anish Jing, Yuchen Dao, Stefanie Granado, Michael Handran, Rachel Margarian, Davit Mikhailov, Matthew Vo, Danny Gardus, Matei-Alexandru Vu, Khai Bouius, Derek Chin, Bryan Tatineni, Mahidhar Thomas, Mary Univ Calif San Diego Elect Comp Engn La Jolla CA 92093 USA Univ Calif San Diego Comp Sci & Engn La Jolla CA 92093 USA Univ Calif San Diego Cognit Sci La Jolla CA 92093 USA Univ Calif San Diego Math La Jolla CA 92093 USA Adv Micro Devices Inc Santa Clara CA 95054 USA Univ Calif San Diego San Diego Supercomp Ctr La Jolla CA 92093 USA

DaCe is a framework for Python that claims to provide massive speedups with C-like speeds compared to already existing high-performance Python frameworks (e.g. Numba or Pythran). In this work, we take a closer look at reproducing the NPBench work. We use performance results to confirm that NPBench achieves higher performance than NumPy in a variety of benchmarks and provide reasons as to why DaCe is not truly as portable as it claims to be, but with a small adjustment it can run anywhere.

关键词： Benchmark testing Graphics processing units Python Random access memory Hardware Optimization Software Distributed computing high performance computing parallel programming performance analysis reproducibility supercomputers

来源：评论

学校读者我要写书评

暂无评论

An assessment of large language models for OpenMP-based code parallelization: a user perspective

引用

JOURNAL OF BIG DATA 2024年第1期11卷 161页

作者： Misic, Marko Dodovic, Matija Univ Belgrade Sch Elect Engn Bulevar Kralja Aleksandra 73 Belgrade 11000 Serbia

Large language models have sparked a lot of attention in the research community in recent days, especially with the introduction of practical tools such as ChatGPT and Github Copilot. Their ability to solve complex programming tasks was also shown in several studies and commercial solutions increasing the interest in using them for software development in different fields. High performance computing is one of such fields, where parallel programming techniques have been extensively used to utilize raw computing power available in contemporary multicore and manycore processors. In this paper, we perform an evaluation of the ChatGPT and Github Copilot tools for OpenMP-based code parallelization using a proposed methodology. We used nine different benchmark applications which represent typical parallel programming workloads and compared their OpenMP-based parallel solutions produced manually and using ChatGPT and Github Copilot in terms of obtained speedup, applied optimizations, and quality of the solution. ChatGPT 3.5 and Github Copilot installed with Visual Studio Code 1.88 were used. We concluded that both tools can produce correct parallel code in most cases. However, performance-wise, ChatGPT can match manually produced and optimized parallel code only in simpler cases, as it lacks a deeper understanding of the code and the context. The results are much better with Github Copilot, where much less effort is needed to obtain correct and performant parallel solution.

关键词： ChatGPT Github Copilot High performance computing Large language models OpenMP parallel programming

来源：评论

学校读者我要写书评

暂无评论

Looking Forward: A High-Throughput Track Following Algorithm for parallel Architectures

引用

IEEE ACCESS 2024年 12卷 114198-114211页

作者： Bailly-Reyre, Aurelien Bian, Lingzhu Billoir, Pierre Perez, Daniel Hugo Campora Gligorov, Vladimir Vava Pisani, Flavio Quagliani, Renato Scarabotto, Alessandro Bruch, Dorothea Vom Sorbonne Univ Paris Diderot Sorbonne Paris Cite LPNHE CNRS IN2P3 F-75005 Paris France Wuhan Univ Sch Phys & Technol Wuhan 430079 Peoples R China Univ Maastricht NL-6211 LK Maastricht Netherlands European Org Nucl Res CERN CH-1211 Geneva Switzerland Ecole Polytech Fed Lausanne EPFL Inst Phys CH-1015 Lausanne Switzerland Tech Univ TU Dortmund Fak Phys D-44221 Dortmund Germany Aix Marseille Univ CNRS CPPM IN2P3 F-13007 Marseille France

Real-time data processing is a central aspect of particle physics experiments with high requirements on computing resources. The LHCb (Large Hadron Collider beauty) experiment must cope with the 30 million proton-proton bunches collision per second rate of the Large Hadron Collider (LHC), producing 109 particles/s. The large input data rate of 32 Tb/s needs to be processed in real time by the LHCb trigger system, which includes both reconstruction and selection algorithms to reduce the number of saved events. The trigger system is implemented in two stages and deployed in a custom data centre. We present Looking Forward, a high-throughput track following algorithm designed for the first stage of the LHCb trigger and optimised for GPUs. The algorithm focuses on the reconstruction of particles traversing the whole LHCb detector and is developed to obtain the best physics performance while respecting the throughput limitations of the trigger. The physics and computing performances are discussed and validated with simulated samples.

关键词： Detectors Real-time systems Large Hadron Collider Graphics processing units Physics Computer architecture Servers parallel programming CUDA GPU track reconstruction particle tracking parallel programming

来源：评论

学校读者我要写书评

暂无评论

Distributed out-of-memory NMF on CPU/GPU architectures

引用

JOURNAL OF SUPERCOMPUTING 2024年第3期80卷 3970-3999页

作者： Boureima, Ismael Bhattarai, Manish Eren, Maksim Skau, Erik Romero, Philip Eidenbenz, Stephan Alexandrov, Boian Los Alamos Natl Lab Theorit Div Los Alamos NM 87545 USA Los Alamos Natl Lab Comp Computat & Stat Sci Div Los Alamos NM 87545 USA Los Alamos Natl Lab HPC Div Los Alamos NM 87545 USA

We propose an efficient distributed out-of-memory implementation of the non-negative matrix factorization (NMF) algorithm for heterogeneous high-performance-computing systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/output latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10-6\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10<^>{-6}$$\end{document}.

关键词： NMF Out-of-memory Latent features Model selection Distributed processing parallel programming Big data Heterogeneous computing GPU CUDA NCCL Cupy

来源：评论

学校读者我要写书评

暂无评论

CComp: A parallel compression algorithm for compressed word search

JOURNAL OF THE FACULTY OF ENGINEERING AND ARCHITECTURE OF GA...

引用

JOURNAL OF THE FACULTY OF ENGINEERING AND ARCHITECTURE OF GAZI UNIVERSITY 2024年第3期39卷 1933-1944页

作者： Oztuerk, Emir Mesut, Altan Trakya Univ Dept Comp Engn Fac Engn TR-22030 Edirne Turkiye

It is important to save space storing the generated data. To achieve this, compression algorithms are used. Stored data is compressed once but accessed many times to search on it. For this reason, the biggest disadvantage of compressed data is that it needs to be decompressed when it will be used. This disadvantage can be eliminated by using a fast decompression algorithm or a compressed search method that does not require decompression. Compressed search can achieve faster results than open-and-search methods, thanks to its small search space and not using decompression. In this article, CComp, a parallel semi-static word-based compression algorithm that supports compressed search, is presented. The purpose of CComp is to obtain faster search results while compressing-decompressing at the speed of other parallel compression algorithms. CComp performs these operations in parallel. CComp has been compared to other parallel methods. As shown in the results, the compression ratios of CComp give results in parallel with other word-based algorithms. In the compressed search process, results were obtained approximately 7 times faster than the Zstd algorithm, which gave the best results before. With these results, CComp can be shown as a better alternative to algorithms that support compressed search.

关键词： Compressed Search Data Compression parallel programming Text Compression

来源：评论

学校读者我要写书评

暂无评论

On the parallelization of multipacting simulation codes for the design of particle accelerator components

引用

JOURNAL OF SUPERCOMPUTING 2024年第9期80卷 12021-12042页

作者： Navaridas, Javier Pascual, Jose A. Galarza, Julen Romero, Txomin Munoz, Juan L. Bustinduy, Ibon Univ Basque Country UPV EHU Dept Comp Architecture & Technol Donostia San Sebastian 20018 Gipuzkoa Spain Donostia Int Phys Ctr Donostia San Sebastian 20018 Gipuzkoa Spain ESS Bilbao Edificio 207-B Derio 48160 Biscay Spain

Particle trajectory and collision simulation is a critical step of the design and construction of novel particle accelerator components. However it requires a huge computational effort which can slow down the design process. We started from a sequential simulation program which is used to study an event called Multipacting. Our work explains the physical problem that is simulated and the implications it can have on the behavior of the components. Then we analyze the original program's operation to find the best options for parallelization. We first developed a parallel version of the Multipacting simulation and were able to accelerate the execution up to similar to 35x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 35\times $$\end{document} with 48 or 56 cores. In the best cases, parallelization efficiency was maintained up to 16 cores (similar to 95\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 95$$\end{document}%) and the speed-up plateaus at around 40-48 cores. When this first parallelization effort was tried for multi-power simulations, we found that parallelism was severely limited with a maximum of 20x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$20\times $$\end{document} speed-up. For this reason, we introduced a new method to improve the parallelization efficiency for this second use case. This method uses a shared processor pool for all simulations of electrons (OnePool). OnePool improved scalability by pushi

关键词： Multicore systems Multipactor effect parallel programming Particle simulation

来源：评论

学校读者我要写书评

暂无评论

parallel programming paradigms illustrated (abstract only) 14

Parallel programming paradigms illustrated (abstract only)

引用

Proceedings of the 45th ACM technical symposium on Computer science education

作者： David P. Bunde Michael Graf Deyu Han Jens Mache Knox College Galesburg IL USA Lewis & Clark College Portland OR USA

ISBN: (纸本)9781450326056

There are many different ways to write parallel programs. We illustrate a variety of relevant language paradigms by presenting implementations of the Game of Life, a simple simulation motivated by living organisms. Featured paradigms include shared memory, GPU acceleration, message passing, and Partitioned Global Address Space (PGAS).

关键词： message passing GPGPU parallel programming PGAs parallel programming languages game of life shared memory

来源：评论

学校读者我要写书评

暂无评论

Dynamic calibration of complex tubing systems using a single pressure measurement device

引用

JOURNAL OF WIND ENGINEERING AND INDUSTRIAL AERODYNAMICS 2024年 249卷

作者： Pallas, Nikolaos-Petros Kellaris, Konstantinos Bouris, Demetri Natl Tech Univ Athens Heroon Polytech Sch Mech Engn GR-15773 Athens Greece

Surface pressure measurement via pressure taps is an integral part of wind tunnel testing. Commonly, the pressure signal is transferred from the taps to a pressure measurement device through an appropriate system of tubing and possibly other components. Depending on its characteristics, the system distorts the dynamics of the signal and, when these dynamics are of interest, appropriate correction/calibration is necessary, taking into account its frequency response. In this context, a novel approach of tubing dynamic calibration is proposed here, using a single pressure measurement device instead of two or more, as is commonly done. The approach accounts for both the amplitude distortion and the phase shift of a selectable range of computer -generated dynamic signals, produced through a speaker. Apart from the innovative use of a single pressure sensor, e.g. in situations where a suitable multi -port pressure scanner is unavailable, the principal merits of the proposed procedure include its straightforward implementation whenever a component of the tubing system is altered (e.g. tube length, different pressure scanner) and the elimination of uncertainty stemming from differences between pressure sensors due to malfunction, inappropriate calibration and inattentive maintenance. The procedure is successfully validated by applying it to two different types of input signal.

关键词： Pressure tubing Frequency response Signal distortion Surface pressures Spectral filter Wind tunnel parallel programming

来源：评论

学校读者我要写书评

暂无评论

parallel design of SFO optimization algorithm based on FPGA

引用

JOURNAL OF SUPERCOMPUTING 2024年第8期80卷 10796-10817页

作者： Naji, Hamid Reza Shadravan, Soodeh Jafarabadi, Hossien Mousa Momeni, Hossien Grad Univ Adv Technol Dept Comp & Informat Technol Kerman Iran Islamic Azad Univ Dept Comp Engn Bardsir Branch Bardsir Iran Islamic Azad Univ Dept Math Bardsir Branch Bardsir Iran

Taking a lot of time to solve optimization problems has become a challenge for metaheuristic algorithms. Due to independence of the metaheuristics components, parallel processing is a good option to reduce the computational time and to find high quality solutions that are close to the optimum with an acceptable cost. One of these metaheuristic algorithms is the Sailfish Optimizer (SFO) which is inspired by a group of hunting sailfish. The SFO algorithm uses a simple method to provide a dynamic balance between exploration and exploitation phases, creates a swarm diversity, avoids local optima, and guarantees high convergence speed. It has been shown that the SFO algorithm outperforms various state-of-art metaheuristic algorithms for multimodal and high dimensional benchmark functions and complicated real-world optimization problems in terms of accuracy and speed by CPU and GPU implementation. In this paper, to speedup this algorithm and increase its performance we propose a reconfigurable hardware version of SFO implemented on Field Programmable Gate Array (FPGA). The FPGA-based SFO can be a very good option in many applications with massive calculations. Due to the inherent parallelism and high computing capabilities of FPGA, the SFO algorithm gains optimum computational time despite the complexity of optimization problems. We have compared the performance of the proposed FPGA-based SFO with its CPU and GPU implementation and some other metaheuristic algorithms. The results show the FPGA implementation is considerably faster than the CPU and GPU implementation. Also, it outperforms other compared FPGA-based metaheuristic algorithms in terms of execution time and convergence speed.

关键词： SFO algorithm Metaheuristic algorithm Field programmable gate array Finite state machine parallel programming

来源：评论

学校读者我要写书评

暂无评论

Experiment in parallel Computing for the Julia programming Language 3

Experiment in Parallel Computing for the Julia Programming L...

引用

3rd International Conference on Algorithms, Computing and Artificial Intelligence, ACAI 2020

作者： Song, Rui Song, Xumin Zhang, Yasheng Ma, Yanni Space Engineering University Beijing China

ISBN: (纸本)9781450388115

Julia language is a free developing scripting language under the MIT license. Its goal is to case the difficulty of parallel programming. Based on the language mechanisms of Julia, we constructed a use case of computing the average running-time between every two bus stops. And then, we exampled the Julia programming framework and the code refining steps. Julia language supports both multi-cores/CPUs parallel programming mode. To full use all the computing resources, we developed some experiments on new policies about how to improve the computing performance. Experiments show that managing processors in parallel computing model consume working time, but with the increasing of problem size, this impact can be gradually ignored, and gaining nearly linear speedups. © 2020 ACM.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：