检索结果-内蒙古大学图书馆

High-speed, two-dimensional digital image correlation algorithm using heterogeneous (CPU-GPU) framework

STRAIN 2020年第3期56卷 e12342-e12342页

作者： Thiagu, Mullai Subramanian, Sankara J. Nasre, Rupesh Indian Inst Technol Madras Dept Engn Design Chennai Tamil Nadu India PhotoGAUGE India Private Ltd Chennai Tamil Nadu India Indian Inst Technol Madras Dept Comp Sci & Engn Chennai 600036 Tamil Nadu India

Two-dimensional digital image correlation (2D-DIC) is an experimental technique used to measure in-plane displacement of a test specimen. Real-time measurement of full-field displacement data is challenging due to enormous computational load of the algorithm. In order to improve the computational speed, the focus of recent research works has been on the approach of parallelization across subsets within image pairs using graphics processing unit (GPU). But alternate GPU-based parallelization approaches to improve the performance of this algorithm as per the order of data processing have not been explored. To address this research gap, our method utilizes parallelism within a subset as well as across subsets for each computation step in an iteration cycle. A heterogeneous (CPU-GPU) framework in combination with a pyramid-based initial values estimation for subsets (in parallel) is proposed in this work. The precompute steps of the proposed framework are implemented using CPU, whereas the main iterative steps are realized using GPU. It is demonstrated that the overall computational speed of the proposed heterogeneous framework improves by approximate to 9x compared to a sequential CPU-based implementation for a pair of gray-scale images with a resolution of 588x2,048 pixels. As an important milestone, feasibility to measure deformations in real time ( <= 1 s) is manifested in this study.

关键词： affine shape function cuda programming model digital image correlation full-field displacements graphics programming unit heterogeneous framework image pyramids kernel thread-block threads

来源：评论

学校读者我要写书评

暂无评论

Optimal Kernel Design for Finite-Element Numerical Integration on GPUs

引用

COMPUTING IN SCIENCE & ENGINEERING 2020年第6期22卷 61-74页

作者： Banas, Krzysztof Kruzel, Filip Bielanski, Jan AGH Univ Sci & Technol Krakow Poland Cracow Univ Technol Inst Comp Sci Krakow Poland

This article presents the design and optimization of the GPU kernels for numerical integration, as it is applied in the standard form in finite-element codes. The optimization process employs autotuning, with the main emphasis on the placement of variables in the shared memory or registers. OpenCL and the first order finite-element method (FEM) approximation are selected for code design, but the techniques are also applicable to the cuda programming model and other types of finite-element discretizations (including discontinuous Galerkin and isogeometric). The autotuning optimization is performed for four example graphics processors and the obtained results are discussed.

关键词： Finite Element Analysis Galerkin Method Graphics Processing Units Integration Optimisation Parallel Architectures Parallel programming Shared Memory Systems Isogeometric Discretization Discontinuous Galerkin Discretization Open CL Graphics Processors First Order Finite Element Method Approximation Autotuning Optimization Finite Element Discretizations cuda programming model Code Design Shared Memory Finite Element Codes GPU Kernels Finite Element Numerical Integration Optimal Kernel Design Finite Element Analysis Graphics Processing Units Jacobian Matrices Solid modeling Approximation Algorithms Optimization Computational modeling

来源：评论

学校读者我要写书评

暂无评论

An Evaluation of Unified Memory Technology on NVIDIA GPUs 15

An Evaluation of Unified Memory Technology on NVIDIA GPUs

引用

2015 15th IEEE ACM International Symposium on Cluster Cloud and Grid Computing (CCGrid 2015)

作者： Li, Wenqiang Jin, Guanghao Cui, Xuewen See, Simon Shanghai Jiao Tong Univ Ctr High Performance Comp Shanghai 200030 Peoples R China Tokyo Inst Technol Tokyo Japan NVIDIA Singapore Singapore

ISBN: (纸本)9781479980062

Unified Memory is an emerging technology which is supported by cuda 6.X. Before cuda 6.X, the existing cuda programming model relies on programmers to explicitly manage data between CPU and GPU and hence increases programming complexity. cuda 6.X provides a new technology which is called as Unified Memory to provide a new programming model that defines CPU and GPU memory space as a single coherent memory (imaging as a same common address space). The system manages data access between CPU and GPU without explicit memory copy functions. This paper is to evaluate the Unified Memory technology through different applications on different GPUs to show the users how to use the Unified Memory technology of cuda 6.X efficiently. The applications include Diffusion3D Benchmark, Parboil Benchmark Suite, and Matrix Multiplication from the cuda SDK Samples. We changed those applications to corresponding Unified Memory versions and compare those with the original ones. We selected the NVIDIA Kepler K40 and the Jetson TK1, which can represent the latest GPUs with Kepler architecture and the first mobile platform of NVIDIA series with Kepler GPU. This paper shows that Unified Memory versions cause 10% performance loss on average. Furthermore, we used the NVIDIA Visual Profiler to dig the reason of the performance loss by the Unified Memory technology.

关键词： Unified Memory Heterogeneous Computing cuda programming model

来源：评论

学校读者我要写书评

暂无评论

High-Performance cuda Kernel Execution on FPGAs 09

High-Performance CUDA Kernel Execution on FPGAs

引用

ACM SIGARCH International Conference on Supercomputing

作者： Papakonstantinou, Alexandros Gururaj, Karthik Stratton, John A. Chen, Deming Cong, Jason Hwu, Wen-Mei W. Univ Illinois Dept Elect & Comp Engn Urbana IL 61801 USA

ISBN: (纸本)9781605584980

In this work, we propose a new FPGA design flow that combines the cuda programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in cuda kernels onto reconfigurable devices. The use of the cuda programming model offers the advantage of a common programming interface for exploiting parallelism on two very different types of accelerators - FPGAs and GPUs. Moreover, by leveraging the advanced synthesis capabilities of AutoPilot we enable efficient exploitation of the FPGA configurability for application specific acceleration. Our flow is based on a compilation process that transforms the SPMD cuda thread blocks into high-concurrency AutoPilot-C code. We provide an overview of our cuda-to-FPGA flow and demonstrate the highly competitive performance of the generated multi-core accelerators.

关键词： High performance computing high-level synthesis coarse-grained parallelism FPGA GPU cuda programming model

来源：评论

学校读者我要写书评

暂无评论

Fast cuda-based Codec for Height Fields

Fast CUDA-based Codec for Height Fields

引用

21st Telecommunications Forum (TELFOR)

作者： Durdevic, Dorde M. Tartalja, Igor I. Univ Belgrade Sch Elect Engn Belgrade 11120 Serbia

ISBN: (纸本)9781479914203

Following the advances in remote sensing technology in the last decade, the horizontal and vertical scan resolutions for digital terrains have reached the order of a meter and decimeter, respectively. At these resolutions, descriptions of real terrains require very large storage spaces. Efficient storage, transfer, retrieval, and manipulation of such large amounts of data require an efficient compression method. This paper presents a method for fast lossy and lossless compression of regular height fields, which are a commonly used solution for representing surfaces scanned at regular intervals along two axes. The method is suitable for SIMD parallel implementation and thus inherently suitable for modern GPU architectures, which significantly outperform modern CPUs in computation speed, and are already present in home computers. The method allows independent decompression of individual data points, as well as progressive decompression. Even in the case of lossy decompression, the decompressed surface is inherently seamless. The method's efficiency was confirmed through a cuda implementation of compression and decompression algorithms, and application in a terrain visualization system.

关键词： height field lossy and lossless compression progressive decompression SIMD parallelism GPU friendly algorithm cuda programming model terrain visualization

来源：评论

学校读者我要写书评

暂无评论

The Research on Parallel Optimization of SAR Imaging R-D Algorithm Based on cuda 10

The Research on Parallel Optimization of SAR Imaging R-D Alg...

引用

10th International Conference on Communication Software and Networks (ICCSN)

作者： Wei, Pei Du, Jing Sui, Sai Chen, Yancang Luoyang Elect Equipment Test Ctr Luoyang Peoples R China

ISBN: (纸本)9781538672242

Synthetic Aperture Radar (SAR) imaging technology is widely used in the field of remote sensing observation, navigation positioning and so on, SAR imaging is large in data scale and long in operating time. Based on the Compute Unified Device Architecture (cuda) programming model, the SAR imaging R-D algorithm is designed and implemented for parallel optimization on the CPU-GPU heterogeneous platform and tested on the GPU Tesla K20. The test shows that the efficiency of the core steps of R-D algorithm has been greatly improved.

关键词： SAR parallel optimization cuda programming model

来源：评论

学校读者我要写书评

暂无评论

DDGSim: GPU Based Simulator for Large Multicore with Bufferless NoC 11

DDGSim: GPU Based Simulator for Large Multicore with Bufferl...

引用

11th Annual IEEE India Conference (INDICON)

作者： Kumar, Navin Sahu, Aryabarna Indian Inst Technol Guwahati Dept Comp Sci & Engn Gauhati 781039 Assam India

ISBN: (纸本)9781479953646

In large scale chip multicore, last level cache management and core interconnection network play important roles in per-formance and power consumption. And in large scale chip multicore, mesh interconnect is used widely due to scalability and simplicity of design. As interconnection network occupied significant area and consumes significant percent of system power, bufferless network is an appealing alternative design to reduce power consumption and hardware cost. We have designed and implemented a simulator for simulation of distributed cache management of large chip multicore where cores are connected using bufferless interconnection network. Also, we have redesigned and implemented the DDGSim, which is a GPU compatible parallel version of the same simulator using cuda programming model. We have simulated target large chip multicore with up to 43,000 cores and achieved up to 25 times speedup on NVIDIA GeForce GTX 690 GPU over serial simulation.

关键词： cache storage digital simulation multiprocessing systems network-on-chip parallel architectures cuda programming model DDGSim GPU based simulator LCMP NVIDIA GeForce GTX 690 GPU bufferless NoC bufferless interconnection network distributed cache management large scale chip multicore Graphics processing units Instruction sets Kernel Multicore processing Ports (Computers) Radiation detectors Routing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：