检索结果-内蒙古大学图书馆

Optimal Kernel Design for Finite-Element Numerical Integration on GPUs

COMPUTING IN SCIENCE & ENGINEERING 2020年第6期22卷 61-74页

作者： Banas, Krzysztof Kruzel, Filip Bielanski, Jan AGH Univ Sci & Technol Krakow Poland Cracow Univ Technol Inst Comp Sci Krakow Poland

This article presents the design and optimization of the GPU kernels for numerical integration, as it is applied in the standard form in finite-element codes. The optimization process employs autotuning, with the main emphasis on the placement of variables in the shared memory or registers. OpenCL and the first order finite-element method (FEM) approximation are selected for code design, but the techniques are also applicable to the CUDA programming model and other types of finite-element discretizations (including discontinuous Galerkin and isogeometric). The autotuning optimization is performed for four example graphics processors and the obtained results are discussed.

关键词： Finite Element Analysis Galerkin Method Graphics Processing Units Integration Optimisation parallel Architectures parallel programming Shared Memory Systems Isogeometric Discretization Discontinuous Galerkin Discretization Open CL Graphics Processors First Order Finite Element Method Approximation Autotuning Optimization Finite Element Discretizations CUDA programming Model Code Design Shared Memory Finite Element Codes GPU Kernels Finite Element Numerical Integration Optimal Kernel Design Finite Element Analysis Graphics Processing Units Jacobian Matrices Solid Modeling Approximation Algorithms Optimization Computational Modeling

来源：评论

学校读者我要写书评

暂无评论

Simplifying and implementing service level objectives for stream parallelism

引用

JOURNAL OF SUPERCOMPUTING 2020年第6期76卷 4603-4628页

作者： Griebler, Dalvan Vogel, Adriano De Sensi, Daniele Danelutto, Marco Fernandes, Luiz G. Pontifical Catholic Univ Rio Grande Sul PUCRS Sch Technol Porto Alegre RS Brazil Univ Pisa UNIPI Dept Comp Sci Pisa Italy Tres de Maio Fac SETREM Lab Adv Res Cloud Comp LARCC Tres De Maio Brazil

An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of resource usage while improving the system utilization. Since the current and next-generation computing systems are intrinsically offering parallel architectures, the software has to naturally exploit the architecture parallelism. Implement and meet SLOs on existing applications is not a trivial task for application programmers, since the software development process, besides the parallelism exploitation, requires the implementation of autonomic algorithms or strategies. This is a system-oriented programming approach and requires the management of multiple knobs and sensors (e.g., the number of threads to use, the clock frequency of the cores, etc.) so that the system can self-adapt at runtime. In this work, we introduce a new and simpler way to define SLO in the application's source code, by abstracting from the programmer all the details relative to self-adaptive system implementation. The application programmer specifies which parts of the code to parallelize and the related SLOs that should be enforced. To reach this goal, source-to-source code transformation rules are implemented in our compiler, which automatically generates self-adaptive strategies to enforce, at runtime, the user-expressed objectives. The experiments highlighted promising results with simpler, effective, and efficient SLO implementations for real-world applications.

关键词： parallel programming Stream processing Self-adaptive Domain-specific language Power-aware computing

来源：评论

学校读者我要写书评

暂无评论

Efficient parallelisation of the packet classification algorithms on multi-core central processing units using multi-threading application program interfaces

引用

IET COMPUTERS AND DIGITAL TECHNIQUES 2020年第6期14卷 313-321页

作者： Abbasi, Mahdi Rafiee, Milad Bu Ali Sina Univ Engn Fac Dept Comp Engn Hamadan Hamadan Iran

The categorisation of network packets according to multiple parameters such as sender and receiver addresses is called packet classification. Packet classification lies at the core of Software-Defined Networking (SDN)-based network applications. Due to the increasing speed of network traffic, there is an urgent need for packet classification at higher speeds. Although it is possible to accelerate packet classification algorithms through hardware implementation, this solution imposes high costs and offers limited development capacity. On the other hand, current software methods to solve this problem are relatively slow. A practical solution to this problem is to parallelise packet classification using multi-core processors. In this study, the Thread, parallel patterns library (PPL), open multi-processing (OpenMP), and threading building blocks (TBB) libraries are examined and implemented to parallelise three packet classification algorithms, i.e. tuple space search, tuple pruning search, and hierarchical tree. According to the results, the type of algorithm and rulesets may influence the performance of parallelisation libraries. In general, the TBB-based method shows the best performance among parallelisation libraries due to using a theft mechanism and can accelerate the classification process up to 8.3 times on a system with a quad-core processor.

关键词： multi-threading application program interfaces telecommunication traffic parallel programming software libraries pattern classification multiprocessing systems packet switching parallelisation libraries classification process packet classification algorithms multicore central processing units multithreading application program interfaces network packets SDN-based network applications multicore processors open multiprocessing

来源：评论

学校读者我要写书评

暂无评论

Performance Tradeoffs in Shared-memory Platform Portable Implementations of a Stencil Kernel

Performance Tradeoffs in Shared-memory Platform Portable Imp...

引用

21st Eurographics Symposium on parallel Graphics and Visualization, EGPGV 2021

作者： Bethel, E. Wes Heinemann, Colleen Perciano, Talita Lawrence Berkeley National Laboratory BerkeleyCA United States San Francisco State University San FranciscoCA United States University of Illinois Urbana-ChampaignIL United States

ISBN: (纸本)9783038681380

Building on a significant amount of current research that examines the idea of platform-portable parallel code across different types of processor families, this work focuses on two sets of related questions. First, using a performance analysis methodology that leverages multiple metrics including hardware performance counters and elapsed time on both CPU and GPU platforms, we examine the performance differences that arise when using two common platform portable parallel programming approaches, namely OpenMP and VTK-m, for a stencil-based computation, which serves as a proxy for many different types of computations in visualization and analytics. Second, we explore the performance differences that result when using coarser- and finer-grained parallelism approaches that are afforded by both OpenMP and VTK-m. © 2021 The Author(s).

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Highly parallelized Contour Integral Method for Computing Resonant Modes of Lossy Cavities

引用

IEEE TRANSACTIONS ON MAGNETICS 2020年第1期56卷 1-4页

作者： Pham-Xuan, Vinh Ackermann, Wolfgang De Gersem, Herbert Tech Univ Darmstadt Inst Teilchenbeschleunigung & Elektromagnet Felde D-64289 Darmstadt Germany

In this article, we address an efficient solver of the Maxwell eigenvalue problem for lossy cavity resonators. The curlcurl equation for the electric field is discretized using curved tetrahedral incomplete quadratic finite elements, resulting in a nonlinear eigenvalue formulation. The eigenvalue problem is efficiently solved using a contour integral method (CIM). This method enables an accurate computation of all eigenvalues within a predefined region and is implemented in a highly parallelized framework to enhance the performance of the algorithm. Numerical results are presented to demonstrate the accuracy and efficiency of the proposed method.

关键词： Cavity resonators contour integral method (CIM) eigenvalues and eigenfunctions finite-element analysis Maxwell equations parallel programming

来源：评论

学校读者我要写书评

暂无评论

Fast Monte-Carlo Photon Transport Employing GPU-Based parallel Computation

引用

IEEE TRANSACTIONS ON RADIATION AND PLASMA MEDICAL SCIENCES 2020年第4期4卷 450-460页

作者： Mirzapour, M. Hadad, K. Faghihi, R. Hamilton, R. J. Watchman, C. J. Shiraz Univ Dept Ray Med Engn Shiraz *** Iran Univ Arizona Dept Radiat Oncol Tucson AZ 85721 USA

Monte Carlo (MC) is known to be the most accurate dose calculation method. However, MC suffers from high computational cost as a large number of particles have to be simulated to achieve the desired statistical uncertainty. Enhancing computational power by parallelizing the simulation with multiple GPU threads reduces the time required to reach the desired uncertainty in MC simulation. In this article, we present DOSXYZgpu, a GPU implementation of EGSnrc code which is written in CUDA Fortran as an algorithm. This article relies on a well validated and popular code among medical physicists, EGSnrc/DOSXYZnrc. In order to transport particles between two consecutive interactions, we developed an algorithm to handle several thousands of histories per warp. DOSXYZgpu implementation is evaluated with the original sequential EGSnrc/DOSXYZnrc. Maximum speedup of 205 times is achieved while the statistical uncertainty of the simulation is preserved. The t-test statistical analysis indicates that for more than 95% of the voxels there is no significant difference between the results obtained from the GPU and the CPU.

关键词： Dosimetry GPU-based computation Monte Carlo (MC) methods parallel programming

来源：评论

学校读者我要写书评

暂无评论

parallelisation of practical shared sampling alpha matting with OpenMP

引用

INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING 2020年第1期21卷 105-115页

作者： Weng, Tien-Hsiung Chiu, Chi-Ching Hsieh, Meng-Yen Lu, Huimin Li, Kuan-Ching Providence Univ Dept Comp Sci & Informat Engn CSIE Taichung 43301 Taiwan Kyushu Inst Technol Dept Mech & Control Engn Kitakyushu Fukuoka Japan Hubei Univ Educ Hubei Educ Cloud Serv Engn Technol Res Ctr Wuhan Peoples R China

In modern filmmaking industry, image matting has been one of the common tasks in video side effects and the necessary intermediate steps in computer vision. It pulls the foreground object from the background of an image by estimating the alpha values. However, the computational speed for matting high resolution images can be significantly slow due to its complexity and computation that is proportional to the size of unknown region. In order to improve the performance, we implement a parallel alpha matting code with OpenMP from existing sequential code for running on the multicore servers. We present and discuss the algorithm and experimentation results from the perspective of the parallel application developer. The development takes less effort, and the results show significant performance improvement of the entire program.

关键词： image matting OpenMP multicore processing parallel programming

来源：评论

学校读者我要写书评

暂无评论

A Low-Cost Multicomputer for Teaching Environments

引用

IEEE REVISTA IBEROAMERICANA DE TECNOLOGIAS DEL APRENDIZAJE-IEEE RITA 2020年第3期15卷 171-182页

作者： Aliagas, Carles Garcia-Famoso, Montse Meseguer, Roc Millan, Pere Molina, Carlos Univ Rovira & Virgili Dept Engn Informat & Matemat Tarragona 43007 Spain Univ Politecn Cataluna Dept Arquitectura Computadors Barcelona 08034 Spain

We propose a teaching resource that uses HardKernel boards to build an MPI server with 256 cores. Although this system has a relatively low performance, the aim is to provide access to hundreds of cores for carrying out scalability analyses, while obtaining a good trade-off between performance, price, and energy consumption. Here, we give details about the implementation of this system at both the hardware and software levels. We also explain how it was used to teach parallel programming in a university degree course, and discuss the teachers' and students' comments about using this new system.

关键词： parallel programming Graphics processing units Education Servers Scalability Hardware parallel machines multicore processing multiprocessor interconnection low-power systems educational activities

来源：评论

学校读者我要写书评

暂无评论

GPU-Accelerated Sparse Matrix Vector Product based on Element-by-Element Method for Unstructured FEM using OpenACC

GPU-Accelerated Sparse Matrix Vector Product based on Elemen...

引用

Workshop on Accelerator programming using Directives (WACCPD)

作者： Ryota Kusakabe Kohei Fujita Tsuyoshi Ichimura Muneo Hori Maddegedara Lalith Earthquake Research Institute & Department of Civil Engineering The University of Tokyo Tokyo Japan Earthquake Research Institute & Department of Civil Engineering The University of Tokyo & Center for Computational Science RIKEN Tokyo Japan Research Institute for Value-Added-Information Generation Japan Agency for Marine-Earth Science and Technology Yokohama Japan

ISBN: (纸本)9781665490207

The development of directive based parallel programming models such as OpenACC has significantly reduced the cost in using accelerators such as GPUs. In this study, the sparse matrix vector product (SpMV), which was often the most computationally expensive part in physics-based simulations, was accelerated by GPU porting using OpenACC. Further speed-up was achieved by introducing the element-by-element (EBE) method in SpMV, an algorithm that is suitable for GPU architecture because it requires large amount of operations but small amount of memory access. In a comparison on one compute node of the supercomputer ABCI, using GPUs resulted in a 22- fold speedup over the CPU-only case, even when using the typical SpMV algorithm, and an additional 3.4-fold speedup when using the EBE method. The results on such analysis was applied to a seismic response analysis considering soil liquefaction, and using GPUs resulted in a 42-fold speedup compared to using only CPUs.

关键词： Costs parallel programming Computational modeling Conferences Memory management Graphics processing units Soil

来源：评论

学校读者我要写书评

暂无评论

Analysis of Validating and Verifying OpenACC Compilers 3.0 and Above

arXiv

引用

arXiv 2022年

作者： Jarmusch, Aaron M. Liu, Aaron Munley, Christian Horta, Daniel Ravichandran, Vaidhyanathan Denny, Joel Chandrasekaran, Sunita University of Delaware United States Oak Ridge National Laboratory United States

OpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers’ implementations. To address this challenge, the OpenACC Validation and Verification team has proposed a validation testsuite to verify the OpenACC implementations across various compilers with an infrastructure for a more streamlined execution. This paper will cover the following aspects: (a) the new developments since the last publication on the testsuite, (b) outline the use of the infrastructure, (c) discuss tests that highlight our workflow process, (d) analyze the results from executing the testsuite on various systems, and (e) outline future developments. © 2022, CC BY.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：