DaCe is a framework for Python that claims to provide massive speedups with C-like speeds compared to already existing high-performance Python frameworks (e.g. Numba or Pythran). In this work, we take a closer look at...
详细信息
DaCe is a framework for Python that claims to provide massive speedups with C-like speeds compared to already existing high-performance Python frameworks (e.g. Numba or Pythran). In this work, we take a closer look at reproducing the NPBench work. We use performance results to confirm that NPBench achieves higher performance than NumPy in a variety of benchmarks and provide reasons as to why DaCe is not truly as portable as it claims to be, but with a small adjustment it can run anywhere.
Large language models have sparked a lot of attention in the research community in recent days, especially with the introduction of practical tools such as ChatGPT and Github Copilot. Their ability to solve complex pr...
详细信息
Large language models have sparked a lot of attention in the research community in recent days, especially with the introduction of practical tools such as ChatGPT and Github Copilot. Their ability to solve complex programming tasks was also shown in several studies and commercial solutions increasing the interest in using them for software development in different fields. High performance computing is one of such fields, where parallel programming techniques have been extensively used to utilize raw computing power available in contemporary multicore and manycore processors. In this paper, we perform an evaluation of the ChatGPT and Github Copilot tools for OpenMP-based code parallelization using a proposed methodology. We used nine different benchmark applications which represent typical parallel programming workloads and compared their OpenMP-based parallel solutions produced manually and using ChatGPT and Github Copilot in terms of obtained speedup, applied optimizations, and quality of the solution. ChatGPT 3.5 and Github Copilot installed with Visual Studio Code 1.88 were used. We concluded that both tools can produce correct parallel code in most cases. However, performance-wise, ChatGPT can match manually produced and optimized parallel code only in simpler cases, as it lacks a deeper understanding of the code and the context. The results are much better with Github Copilot, where much less effort is needed to obtain correct and performant parallel solution.
Real-time data processing is a central aspect of particle physics experiments with high requirements on computing resources. The LHCb (Large Hadron Collider beauty) experiment must cope with the 30 million proton-prot...
详细信息
Real-time data processing is a central aspect of particle physics experiments with high requirements on computing resources. The LHCb (Large Hadron Collider beauty) experiment must cope with the 30 million proton-proton bunches collision per second rate of the Large Hadron Collider (LHC), producing 109 particles/s. The large input data rate of 32 Tb/s needs to be processed in real time by the LHCb trigger system, which includes both reconstruction and selection algorithms to reduce the number of saved events. The trigger system is implemented in two stages and deployed in a custom data centre. We present Looking Forward, a high-throughput track following algorithm designed for the first stage of the LHCb trigger and optimised for GPUs. The algorithm focuses on the reconstruction of particles traversing the whole LHCb detector and is developed to obtain the best physics performance while respecting the throughput limitations of the trigger. The physics and computing performances are discussed and validated with simulated samples.
We propose an efficient distributed out-of-memory implementation of the non-negative matrix factorization (NMF) algorithm for heterogeneous high-performance-computing systems. The proposed implementation is based on p...
详细信息
We propose an efficient distributed out-of-memory implementation of the non-negative matrix factorization (NMF) algorithm for heterogeneous high-performance-computing systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/output latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10-6\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10<^>{-6}$$\end{document}.
It is important to save space storing the generated data. To achieve this, compression algorithms are used. Stored data is compressed once but accessed many times to search on it. For this reason, the biggest disadvan...
详细信息
It is important to save space storing the generated data. To achieve this, compression algorithms are used. Stored data is compressed once but accessed many times to search on it. For this reason, the biggest disadvantage of compressed data is that it needs to be decompressed when it will be used. This disadvantage can be eliminated by using a fast decompression algorithm or a compressed search method that does not require decompression. Compressed search can achieve faster results than open-and-search methods, thanks to its small search space and not using decompression. In this article, CComp, a parallel semi-static word-based compression algorithm that supports compressed search, is presented. The purpose of CComp is to obtain faster search results while compressing-decompressing at the speed of other parallel compression algorithms. CComp performs these operations in parallel. CComp has been compared to other parallel methods. As shown in the results, the compression ratios of CComp give results in parallel with other word-based algorithms. In the compressed search process, results were obtained approximately 7 times faster than the Zstd algorithm, which gave the best results before. With these results, CComp can be shown as a better alternative to algorithms that support compressed search.
Particle trajectory and collision simulation is a critical step of the design and construction of novel particle accelerator components. However it requires a huge computational effort which can slow down the design p...
详细信息
Particle trajectory and collision simulation is a critical step of the design and construction of novel particle accelerator components. However it requires a huge computational effort which can slow down the design process. We started from a sequential simulation program which is used to study an event called Multipacting. Our work explains the physical problem that is simulated and the implications it can have on the behavior of the components. Then we analyze the original program's operation to find the best options for parallelization. We first developed a parallel version of the Multipacting simulation and were able to accelerate the execution up to similar to 35x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 35\times $$\end{document} with 48 or 56 cores. In the best cases, parallelization efficiency was maintained up to 16 cores (similar to 95\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 95$$\end{document}%) and the speed-up plateaus at around 40-48 cores. When this first parallelization effort was tried for multi-power simulations, we found that parallelism was severely limited with a maximum of 20x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$20\times $$\end{document} speed-up. For this reason, we introduced a new method to improve the parallelization efficiency for this second use case. This method uses a shared processor pool for all simulations of electrons (OnePool). OnePool improved scalability by pushi
There are many different ways to write parallel programs. We illustrate a variety of relevant language paradigms by presenting implementations of the Game of Life, a simple simulation motivated by living organisms. Fe...
详细信息
ISBN:
(纸本)9781450326056
There are many different ways to write parallel programs. We illustrate a variety of relevant language paradigms by presenting implementations of the Game of Life, a simple simulation motivated by living organisms. Featured paradigms include shared memory, GPU acceleration, message passing, and Partitioned Global Address Space (PGAS).
Surface pressure measurement via pressure taps is an integral part of wind tunnel testing. Commonly, the pressure signal is transferred from the taps to a pressure measurement device through an appropriate system of t...
详细信息
Surface pressure measurement via pressure taps is an integral part of wind tunnel testing. Commonly, the pressure signal is transferred from the taps to a pressure measurement device through an appropriate system of tubing and possibly other components. Depending on its characteristics, the system distorts the dynamics of the signal and, when these dynamics are of interest, appropriate correction/calibration is necessary, taking into account its frequency response. In this context, a novel approach of tubing dynamic calibration is proposed here, using a single pressure measurement device instead of two or more, as is commonly done. The approach accounts for both the amplitude distortion and the phase shift of a selectable range of computer -generated dynamic signals, produced through a speaker. Apart from the innovative use of a single pressure sensor, e.g. in situations where a suitable multi -port pressure scanner is unavailable, the principal merits of the proposed procedure include its straightforward implementation whenever a component of the tubing system is altered (e.g. tube length, different pressure scanner) and the elimination of uncertainty stemming from differences between pressure sensors due to malfunction, inappropriate calibration and inattentive maintenance. The procedure is successfully validated by applying it to two different types of input signal.
Taking a lot of time to solve optimization problems has become a challenge for metaheuristic algorithms. Due to independence of the metaheuristics components, parallel processing is a good option to reduce the computa...
详细信息
Taking a lot of time to solve optimization problems has become a challenge for metaheuristic algorithms. Due to independence of the metaheuristics components, parallel processing is a good option to reduce the computational time and to find high quality solutions that are close to the optimum with an acceptable cost. One of these metaheuristic algorithms is the Sailfish Optimizer (SFO) which is inspired by a group of hunting sailfish. The SFO algorithm uses a simple method to provide a dynamic balance between exploration and exploitation phases, creates a swarm diversity, avoids local optima, and guarantees high convergence speed. It has been shown that the SFO algorithm outperforms various state-of-art metaheuristic algorithms for multimodal and high dimensional benchmark functions and complicated real-world optimization problems in terms of accuracy and speed by CPU and GPU implementation. In this paper, to speedup this algorithm and increase its performance we propose a reconfigurable hardware version of SFO implemented on Field Programmable Gate Array (FPGA). The FPGA-based SFO can be a very good option in many applications with massive calculations. Due to the inherent parallelism and high computing capabilities of FPGA, the SFO algorithm gains optimum computational time despite the complexity of optimization problems. We have compared the performance of the proposed FPGA-based SFO with its CPU and GPU implementation and some other metaheuristic algorithms. The results show the FPGA implementation is considerably faster than the CPU and GPU implementation. Also, it outperforms other compared FPGA-based metaheuristic algorithms in terms of execution time and convergence speed.
Julia language is a free developing scripting language under the MIT license. Its goal is to case the difficulty of parallel programming. Based on the language mechanisms of Julia, we constructed a use case of computi...
详细信息
暂无评论