Two-dimensional digital image correlation (2D-DIC) is an experimental technique used to measure in-plane displacement of a test specimen. Real-time measurement of full-field displacement data is challenging due to eno...
详细信息
Two-dimensional digital image correlation (2D-DIC) is an experimental technique used to measure in-plane displacement of a test specimen. Real-time measurement of full-field displacement data is challenging due to enormous computational load of the algorithm. In order to improve the computational speed, the focus of recent research works has been on the approach of parallelization across subsets within image pairs using graphics processing unit (GPU). But alternate GPU-based parallelization approaches to improve the performance of this algorithm as per the order of data processing have not been explored. To address this research gap, our method utilizes parallelism within a subset as well as across subsets for each computation step in an iteration cycle. A heterogeneous (CPU-GPU) framework in combination with a pyramid-based initial values estimation for subsets (in parallel) is proposed in this work. The precompute steps of the proposed framework are implemented using CPU, whereas the main iterative steps are realized using GPU. It is demonstrated that the overall computational speed of the proposed heterogeneous framework improves by approximate to 9x compared to a sequential CPU-based implementation for a pair of gray-scale images with a resolution of 588x2,048 pixels. As an important milestone, feasibility to measure deformations in real time ( <= 1 s) is manifested in this study.
This article presents the design and optimization of the GPU kernels for numerical integration, as it is applied in the standard form in finite-element codes. The optimization process employs autotuning, with the main...
详细信息
This article presents the design and optimization of the GPU kernels for numerical integration, as it is applied in the standard form in finite-element codes. The optimization process employs autotuning, with the main emphasis on the placement of variables in the shared memory or registers. OpenCL and the first order finite-element method (FEM) approximation are selected for code design, but the techniques are also applicable to the cuda programming model and other types of finite-element discretizations (including discontinuous Galerkin and isogeometric). The autotuning optimization is performed for four example graphics processors and the obtained results are discussed.
Unified Memory is an emerging technology which is supported by cuda 6.X. Before cuda 6.X, the existing cuda programming model relies on programmers to explicitly manage data between CPU and GPU and hence increases pro...
详细信息
ISBN:
(纸本)9781479980062
Unified Memory is an emerging technology which is supported by cuda 6.X. Before cuda 6.X, the existing cuda programming model relies on programmers to explicitly manage data between CPU and GPU and hence increases programming complexity. cuda 6.X provides a new technology which is called as Unified Memory to provide a new programmingmodel that defines CPU and GPU memory space as a single coherent memory (imaging as a same common address space). The system manages data access between CPU and GPU without explicit memory copy functions. This paper is to evaluate the Unified Memory technology through different applications on different GPUs to show the users how to use the Unified Memory technology of cuda 6.X efficiently. The applications include Diffusion3D Benchmark, Parboil Benchmark Suite, and Matrix Multiplication from the cuda SDK Samples. We changed those applications to corresponding Unified Memory versions and compare those with the original ones. We selected the NVIDIA Kepler K40 and the Jetson TK1, which can represent the latest GPUs with Kepler architecture and the first mobile platform of NVIDIA series with Kepler GPU. This paper shows that Unified Memory versions cause 10% performance loss on average. Furthermore, we used the NVIDIA Visual Profiler to dig the reason of the performance loss by the Unified Memory technology.
In this work, we propose a new FPGA design flow that combines the cuda programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed paralleli...
详细信息
ISBN:
(纸本)9781605584980
In this work, we propose a new FPGA design flow that combines the cuda programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in cuda kernels onto reconfigurable devices. The use of the cuda programming model offers the advantage of a common programming interface for exploiting parallelism on two very different types of accelerators - FPGAs and GPUs. Moreover, by leveraging the advanced synthesis capabilities of AutoPilot we enable efficient exploitation of the FPGA configurability for application specific acceleration. Our flow is based on a compilation process that transforms the SPMD cuda thread blocks into high-concurrency AutoPilot-C code. We provide an overview of our cuda-to-FPGA flow and demonstrate the highly competitive performance of the generated multi-core accelerators.
Following the advances in remote sensing technology in the last decade, the horizontal and vertical scan resolutions for digital terrains have reached the order of a meter and decimeter, respectively. At these resolut...
详细信息
ISBN:
(纸本)9781479914203
Following the advances in remote sensing technology in the last decade, the horizontal and vertical scan resolutions for digital terrains have reached the order of a meter and decimeter, respectively. At these resolutions, descriptions of real terrains require very large storage spaces. Efficient storage, transfer, retrieval, and manipulation of such large amounts of data require an efficient compression method. This paper presents a method for fast lossy and lossless compression of regular height fields, which are a commonly used solution for representing surfaces scanned at regular intervals along two axes. The method is suitable for SIMD parallel implementation and thus inherently suitable for modern GPU architectures, which significantly outperform modern CPUs in computation speed, and are already present in home computers. The method allows independent decompression of individual data points, as well as progressive decompression. Even in the case of lossy decompression, the decompressed surface is inherently seamless. The method's efficiency was confirmed through a cuda implementation of compression and decompression algorithms, and application in a terrain visualization system.
Synthetic Aperture Radar (SAR) imaging technology is widely used in the field of remote sensing observation, navigation positioning and so on, SAR imaging is large in data scale and long in operating time. Based on th...
详细信息
ISBN:
(纸本)9781538672242
Synthetic Aperture Radar (SAR) imaging technology is widely used in the field of remote sensing observation, navigation positioning and so on, SAR imaging is large in data scale and long in operating time. Based on the Compute Unified Device Architecture (cuda) programmingmodel, the SAR imaging R-D algorithm is designed and implemented for parallel optimization on the CPU-GPU heterogeneous platform and tested on the GPU Tesla K20. The test shows that the efficiency of the core steps of R-D algorithm has been greatly improved.
In large scale chip multicore, last level cache management and core interconnection network play important roles in per-formance and power consumption. And in large scale chip multicore, mesh interconnect is used wide...
详细信息
ISBN:
(纸本)9781479953646
In large scale chip multicore, last level cache management and core interconnection network play important roles in per-formance and power consumption. And in large scale chip multicore, mesh interconnect is used widely due to scalability and simplicity of design. As interconnection network occupied significant area and consumes significant percent of system power, bufferless network is an appealing alternative design to reduce power consumption and hardware cost. We have designed and implemented a simulator for simulation of distributed cache management of large chip multicore where cores are connected using bufferless interconnection network. Also, we have redesigned and implemented the DDGSim, which is a GPU compatible parallel version of the same simulator using cuda programming model. We have simulated target large chip multicore with up to 43,000 cores and achieved up to 25 times speedup on NVIDIA GeForce GTX 690 GPU over serial simulation.
暂无评论