Scientific parallel programming has become mainstream in recent years by the introduction of high-performance graphics processing units (GPUs) that are specifically designed for numerical processing. In addition, free...
详细信息
Scientific parallel programming has become mainstream in recent years by the introduction of high-performance graphics processing units (GPUs) that are specifically designed for numerical processing. In addition, freely available programming tools have made it possible for anyone who wants to leverage the processing power of GPUs to do so relatively easily. This article provides an introduction to parallel programming using GPUs with numerical examples demonstrating the speedup that can be obtained in a microwave engineering problem. All programming tools that are used in the article can be obtained free-of-charge from online resources. This accessibility is a tremendous benefit to engineers, students, and enthusiasts.
An overview of PARCS (parallel Asynchronous Recursive Control Space) technology developments is provided. The concept of the control space, i.e., a model apparatus, based on which the logical structure of the investig...
详细信息
An overview of PARCS (parallel Asynchronous Recursive Control Space) technology developments is provided. The concept of the control space, i.e., a model apparatus, based on which the logical structure of the investigated problem (system) is described, and dynamic changes in it are reflected, is considered. The PARCS model is proposed whose application leads to flexible and unified adaptation to emerging programming technologies. PARCS-extension of the following programming languages is considered: PASCAL, C, FORTRAN, MODULA2, Java, CUDA, OpenCL, PYTHON, .NET, GO/PYTHON.
The implementation of parallel applications is always a challenge. It embraces many distinctive design decisions that are to be taken. The paper presents issues of parallel processing with use of .NET applications and...
详细信息
The implementation of parallel applications is always a challenge. It embraces many distinctive design decisions that are to be taken. The paper presents issues of parallel processing with use of .NET applications and popular Database Management Systems (DBMSes). In the paper, four design dilemmas are addressed: how efficient is the auto-parallelism implemented in the .NET TPL library, how do popular DBMSes differ in serving parallel requests, what is the optimal size of data chunks in the data parallelism scheme, and how the TPL auto-parallelism behaves in the public clouds. They are analyzed in the context of the typical and practical business case originated from IT solutions which are dedicated for the energy market participants. The paper presents the results of experiments conducted in a controlled, on-premises and cloud environments. The experiments allowed to compare the performance of the TPL auto-parallelism with a wide range of manually set numbers of worker threads. They also helped to evaluate four DBMSes: Oracle, MySQL, PostgreSQL, and MSSQL in the scenario of serving parallel queries. Finally, they showed the impact of data chunk sizes on the overall performance.
Over the past three decades,the numerical manifold method(NMM)has attracted many researchers from geotechnical community because it unifies the solutions of continuous and discontinuous problems in the same ***,due to...
详细信息
Over the past three decades,the numerical manifold method(NMM)has attracted many researchers from geotechnical community because it unifies the solutions of continuous and discontinuous problems in the same ***,due to the lack of ready-made preprocessing tools,the development of three dimensional NMM(3DNMM)is still limited.A practical strategy to generate the discretized models for a 3DNMM analysis is *** the proposed strategy,regular hexahedral meshes are uniformly deployed to construct the mathematical cover *** physical meshes including the joints,material interfaces,and problem domain boundaries are adopted to cut the mathematical cover system into physical cover system and manifold elements(MEs).To improve the efficiency of the proposed strategy,the Intel threading building blocks(TBB)parallel library for CPU paralleling is *** typical examples are adopted to validate the proposed *** results show that the proposed strategy can effectively generate the discretized 3D models of some geotechnical problems for 3DNMM *** proposed strategy deserves a further investigation.
A performance-portable application can run on a variety of different hardware platforms, achieving an acceptable level of performance without requiring significant rewriting for each platform. Several performance-port...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
A performance-portable application can run on a variety of different hardware platforms, achieving an acceptable level of performance without requiring significant rewriting for each platform. Several performance-portable programming models are now suitable for high-performance scientific application development, including OpenMP and Kokkos. Chapel is a parallel programming language that supports the productive development of high-performance scientific applications and has recently added support for GPU architectures through native code generation. Using three mini-apps BabelStream, miniBUDE, and TeaLeaf we evaluate the Chapel language's performance portability across various CPU and GPU platforms. In our evaluation, we replicate and build on previous studies of performance portability using mini-apps, comparing Chapel against OpenMP, Kokkos, and the vendor programming models CUDA and HIP. We find that Chapel achieves comparable performance portability to OpenMP and Kokkos and identify several implementation issues that limit Chapel's performance portability on certain platforms.
Remote Memory Access (RMA) programming models enable processes running on a distributed-memory computer to access and manipulate the memory of other processes directly. Such one-sided communication has the benefit tha...
详细信息
ISBN:
(纸本)9798400717932
Remote Memory Access (RMA) programming models enable processes running on a distributed-memory computer to access and manipulate the memory of other processes directly. Such one-sided communication has the benefit that the receiving process is not actively involved in the communication compared to the classical two-sided message-passing model. The three programming models MPI RMA, OpenSHMEM, and GASPI provide such a communication scheme. However, RMA models require the developer to synchronize the accesses with corresponding API calls correctly. Concurrent modifications of the same (remote) memory location due to wrong or missing synchronization lead to data races. Such data races are undefined behavior and may result in non-deterministic failures of the program execution. This paper presents RMASanitizer, an on-the-fly race detector for MPI RMA, OpenSHMEM, and GASPI applications. It relies on a generalized race detection model independent of the concrete RMA programming model. RMASanitizer combines a dynamic on-the-fly analysis with a static analysis at compile-time that detects and instruments only relevant memory accesses. It is implemented as part of the MPI correctness checking framework MUST which we extended with support for OpenSHMEM and GASPI. We show that RMASanitizer can detect races in MPI RMA, OpenSHMEM, and GASPI applications with an accuracy of over 95 percent by running it on the data race benchmark suite RMARaceBench. On proxy applications, the slowdown for the execution with up to 700 processes ranges from 1.1x to 30x, depending on the application, showing that our tool is applicable in practice.
The high-performance computing (HPC) community has recently seen a substantial diversification of hardware platforms and their associated programming models. From traditional multicore processors to highly specialized...
详细信息
ISBN:
(纸本)9783031800832;9783031800849
The high-performance computing (HPC) community has recently seen a substantial diversification of hardware platforms and their associated programming models. From traditional multicore processors to highly specialized accelerators, vendors and tool developers back up the relentless progress of those architectures. In the context of scientific programming, it is fundamental to consider performance portability frameworks, i.e., software tools that allow programmers to write code once and run it on different computer architectures without sacrificing performance. We report here on the benefits and challenges of performance portability using a field-line tracing simulation and a particle-incell code, two relevant applications in computational plasma physics with applications to magnetically-confined nuclear-fusion energy research. For these applications we report performance results obtained on four HPC platforms with server-class CPUs from Intel (Xeon) and AMD (EPYC), and high-end GPUs from Nvidia and AMD, including the latest Nvidia H100 GPU and the novel AMD Instinct MI300A APU. Our results show that both Kokkos and OpenMP are powerful tools to achieve performance portability and decent "out-of-the-box" performance, even for the very latest hardware platforms. For our applications, Kokkos provided performance portability to the broadest range of hardware architectures from different vendors.
Hate speech is a threat to democratic values, because it stimulates incitement to discrimination, which international law prohibits. To limit the harmful effects of this scourge, scientists often integrate into social...
详细信息
ISBN:
(纸本)9783031631092;9783031631108
Hate speech is a threat to democratic values, because it stimulates incitement to discrimination, which international law prohibits. To limit the harmful effects of this scourge, scientists often integrate into social network platforms models provided by deep learning algorithms allowing to detect and react automatically to a message with a hateful nature. One of the particularities of these algorithms is that they are so efficient as the amount of data used is large. However, sequential execution of these algorithms on large amounts of data can take a very long time. In this paper we first compared three variants of Recurrent Neural Network (RNN) to detect hate messages. We have shown that Long Short Time Memory (LSTM) provides better metric performance, but implies more important execution time in comparison with Gated Recurrent Unit (GRU) and standard RNN. To have both good metric performance and reduced execution time, we proceeded to a parallel implementation of the training algorithms. We proposed a parallel implementation based on an implicit aggregation strategy in comparison to the existing approach which is based on a strategy with an aggregation function. The experimental results on an 8-core machine at 2.20GHz show that better results are obtained with the parallelization strategy that we proposed. For the parallel implementation of an LSTM using the dataset obtained on kaggle, we obtained an f-measure of 0.70 and a speedup of 2.2 with our approach, compared to a f-measure of 0.65 and a speedup of 2.19 with an explicit aggregation strategy between workers.
Existing tiled manycore architectures propose to convert abundant silicon resources into general-purpose parallel processors with unmatched computational density and programmability. However, as we approach 100K cores...
详细信息
ISBN:
(纸本)9798350326598;9798350326581
Existing tiled manycore architectures propose to convert abundant silicon resources into general-purpose parallel processors with unmatched computational density and programmability. However, as we approach 100K cores in one chip, conventional manycore architectures struggle to navigate three key axes: scalability, programmability, and density. Many manycores sacrifice programmability for density;or scalability for programmability. In this paper, we explore HammerBlade, which simultaneously achieves scalability, programmability and density. HammerBlade is a fully open-source RISC-V manycore architecture, which has been silicon-validated with a 2048-core ASIC implementation using a 14/16nm process. We evaluate the system using a suite of parallel benchmarks that captures a broad spectrum of computation and communication patterns.
New algorithms for embedding graphs have reduced the asymptotic complexity of finding low-dimensional representations. One-Hot Graph Encoder Embedding (GEE) uses a single, linear pass over edges and produces an embedd...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
New algorithms for embedding graphs have reduced the asymptotic complexity of finding low-dimensional representations. One-Hot Graph Encoder Embedding (GEE) uses a single, linear pass over edges and produces an embedding that converges asymptotically to the spectral embedding. The scaling and performance benefits of this approach have been limited by a serial implementation in an interpreted language. We refactor GEE into a parallel program in the Ligra graph engine that maps functions over the edges of the graph and uses lock-free atomic instructions to prevent data races. On a graph with 1.86 edges, this results in a 500 times speedup over the original implementation and a 17 times speedup over a just-in-time compiled version.
暂无评论