检索结果-内蒙古大学图书馆

International Conference on Computational Science and Computational Intelligence (CSCI)

作者： Zhang, Jinhua Guan, Wenkai Ababei, Cristinel Medeiros, Henry Povinelli, Richard J. Marquette Univ Dept Elect & Comp Engn Milwaukee WI 53233 USA

ISBN: (纸本)9781665458412

Object detection and tracking are essential tasks in many computer vision applications. One of the most popular tracking algorithms is the particle filter, which is widely used for real-time object tracking in live video streams. While very popular, the particle filter algorithm suffers from increased computational runtimes for high-resolution frames and large numbers of particles. In this paper, we investigate the use of cuda programming as a method to parallelize portions of the particle filter algorithm in order to speed-up its execution time on compute systems that are equipped with NVIDIA GPUs. Experiments that compare a CPU sequential version, as the base case, with the cuda parallelized version demonstrate an achievable speed-up of up to 7.5x for a 3840x2160 video resolution, and 9216 particles on a computer equipped with an NVIDIA Tesla K40c GPU.

关键词： object tracking particle filter algorithm GPU cuda programming

来源：评论

学校读者我要写书评

暂无评论

An improved mixed Lagrangian-Eulerian (IMLE) method for modelling incompressible Navier-Stokes flows with cuda programming on multi-GPUs

引用

COMPUTERS & FLUIDS 2019年 184卷 99-106页

作者： Liu, Rex Kuan-Shuo Wu, Cheng-Tao Kao, Neo Shih-Chao Sheu, Tony Wen-Hann Natl Taiwan Univ Dept Engn Sci & Ocean Engn 1Sec 4Roosevelt Rd Taipei Taiwan CR Classificat Soc Res Dept 8F103Sec 3Nanjing E Rd Taipei Taiwan Natl Taiwan Univ Inst Appl Math Sci Taipei Taiwan Natl Taiwan Univ Ctr Adv Study Theoret Sci Taipei Taiwan

In this study, a GPU-accelerated improved mixed Lagrangian-Eulerian (IMLE) method is proposed to solve the three-dimensional incompressible Navier-Stokes equations. To improve the prediction accuracy, the proposed IMLE method approximates the total derivative term in Lagragian sense, and the spatial derivative terms are approximated on Eulerian coordinates. Transfer of data from Lagrangian particles to data on Eulerian grids is accurately carried out by adopting moving least squares (MLS) interpolation method. The velocity-pressure decoupling issue is overcome by adopting pressure-free projection method in which the pressure field is calculated by solving a pressure Poisson equation (PPE). It is noted that the MLS interpolation is time consuming since this procedure belongs to a pointwise scheme in which a local matrix equation shall be solved on each grid point. In addition, the discretized PPE forms a large sparse matrix and it is computationally intensive to solve by using the conjugate gradient (CG) method. Therefore, we are aimed to resort to cuda- and OpenMP-programming means to accelerate the computation. In this study, the performance of the multiple GPUs code can reach up to 27 times faster with respect to multi-threads CPU performance. (C) 2019 Elsevier Ltd. All rights reserved.

关键词： Incompressible Navier-Stokes equations Moving least squares (MLS) interpolation Conjugate gradient (CG) method cuda programming OpenMP programming

来源：评论

学校读者我要写书评

暂无评论

Multilevel parallel GPU implementation of SQP solvers for Nonlinear MPC

Multilevel parallel GPU implementation of SQP solvers for No...

引用

European Control Conference (ECC)

作者： Verheijen, P. C. N. Derkani, A. H. Agarwal, Y. A. Lazar, M. Goswami, D. Eindhoven Univ Technol Dept Elect Engn Eindhoven Netherlands

ISBN: (纸本)9798331540920;9783907144107

In recent literature, it has been shown that the number of steps in a sequential quadratic programming algorithm for a non-linear model predictive control (NMPC) problem can be greatly reduced by a parallel shooting method. The efficiency of such a parallel shooting method further depends on how the algorithm is implemented on parallel computing platforms such as Graphics Processing Units (GPUs). The GPU implementation should consider the degree of parallelism necessary for higher time efficiency as well as the hardware resource consumption/limitation at the GPU for a given problem size. In this paper, we present a multilevel parallel GPU implementation for sequential quadratic programming and an (Alternating Direction Method of Multipliers) ADMM solver. First, we introduce a GPU implementation enabling parallel computing of many quadratic programs (QPs) by functional parallelism. Next, we parallelize each QP solver using data parallelism of basic linear matrix operations. We show that the proposed GPU implementation greatly scales with the degree of parallelism in the parallel shooting method. Further, we show how a GPU implementation can be configured for a given problem size avoiding resource overprovisioning.

关键词： Nonlinear MPC Sequential Quadratic programming cuda programming Parallel Shooting

来源：评论

学校读者我要写书评

暂无评论

Adaptive Contrast Based Real-Time Image and Video Dehazing 5

Adaptive Contrast Based Real-Time Image and Video Dehazing

引用

5th International Conference on Circuits, Control, Communication and Computing

作者： Joshi, Rohin Dange, Dhruv Kavya, K. S. Maiya, Sharath R. Ramaiah Inst Technol Dept Informat Sci & Engn Bangalore Karnataka India

ISBN: (纸本)9798331528539;9798331528546

Dehazing algorithms have been developed in response to the need for effectively and instantaneously removing atmospheric turbidities such as mist, haze, and fog from media. The removal of haze from an image or video enables the extraction of additional details from the scene. This paper presents the development of a real-time, memory-optimized dehazing system that utilizes digital image processing techniques and NVIDIA's cuda architecture for efficient parallel computing. The methodology incorporates a quad-tree search algorithm for efficient atmospheric light estimation, which significantly enhances dehazing accuracy. Advanced contrast enhancement techniques are employed for transmission estimation to ensure clarity and visibility in dehazed images, addressing challenges such as non-uniform illumination and varying haze densities. For image restoration, the system dynamically clips values to improve brightness and minimize information loss. The transmission map's artifacts are then smoothed using a Gaussian filter, resulting in more reliable dehazing. Extensive testing on a variety of publicly accessible datasets demonstrated that the proposed model achieved comparable accuracy to numerous existing techniques, while also restoring high-quality dehazed images and videos. The system achieves a running time of 35ms per image and up to 10ms per frame for video sequences. Performance was objectively assessed using the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), yielding a PSNR of 28.75 and an SSIM of 0.80.

关键词： Real-time dehazing atmospheric light estimation adaptive contrast enhancement quad-tree search cuda programming

来源：评论

学校读者我要写书评

暂无评论

ACE-HoT: Accelerating an Extreme Amount of Symmetric Cipher Evaluations for (High-order) Avalanche Tests 8th

ACE-HoT: Accelerating an Extreme Amount of Symmetric Cipher ...

引用

8th International Conference on Cryptology and Information Security in Latin America (LATINCRYPT)

作者： Bellini, Emanuele Grados, Juan Rachidi, Mohamed Satpute, Nitin Daemen, Joan El Hirch, Solane Technol Innovat Inst Cryptog Res Ctr Abu Dhabi U Arab Emirates Radboud Univ Nijmegen Nijmegen Netherlands

ISBN: (纸本)9783031444685;9783031444692

In this work, we tackle the problem of estimating the security of iterated symmetric ciphers in an efficient manner, with tests that do not require a deep analysis of the internal structure of the cipher. This is particularly useful during the design phase of these ciphers, especially for quickly testing several combinations of possible parameters defining several cipher design variants. We consider a popular statistical test that allows us to determine the probability of flipping each cipher output bit, given a small variation in the input of the cipher. From these probabilities, one can compute three measurable metrics related to the well-known full diffusion, avalanche and strict avalanche criteria. This highly parallelizable testing process scales linearly with the number of samples, i.e., cipher inputs, to be evaluated and the number of design variants to be tested. But, the number of design variants might grow exponentially with respect to some parameters. The high cost of Central Processing Unit (CPU)s makes them a bad candidate for this kind of parallelization. As a main contribution, we propose a framework, ACE-HoT, to parallelize the testing process using multi-Graphics Processing Units (GPUs). Our implementation does not perform any intermediate CPU-GPU data transfers. The diffusion and avalanche criteria can be seen as an application of discrete first-order derivatives. As a secondary contribution, we generalize these criteria to their high-order version. Our generalization requires an exponentially larger number of samples, in order to compute sufficiently accurate probabilities. As a case study, we apply ACE-HoT on most of the finalists of the National Institute of Standards and Technologies (NIST) lightweight standardization process, with a special focus on the winner ASCON.

关键词： GPU cuda programming Avalanche tests Symmetric ciphers Statistical tests

来源：评论

学校读者我要写书评

暂无评论

STVAI: Exploring spatio-temporal similarity for scalable and efficient intelligent video inference

引用

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 2025年 201卷

作者： Li, Chuang Wang, Heshi Wen, Yanhua Shi, Qingyu Wang, Qinyu Hu, Chunhua Wu, Dongchen Hunan Univ Technol & Business Coll Comp Sci Changsha 410205 Hunan Peoples R China Xiangjiang Lab Changsha 410205 Hunan Peoples R China South China Univ Technol Sch Future Technol Guangzhou 510641 Guangdong Peoples R China York Univ Schulich Sch Business Toronto ON M3J 1P3 Canada

The integration of video data computation and inference is a cornerstone for the evolution of multimodal artificial intelligence (MAI). The extensive adoption and optimization of CNN-based frameworks have significantly improved the accuracy of video inference, yet they present substantial challenges for real-time and large-scale computational demands. Existing researches primarily utilize the temporal similarity between video frames to reduce redundant computations, but most of them overlooked the spatial similarity within the frames themselves. Hence, we propose STVAI, a scalable and efficient method that leverages both spatial and temporal similarities to accelerate video inference. This approach uses a parallel region merging strategy, which maintains inference accuracy and enhances the sparsity of the computation matrix. Moreover, we have optimized the computation of sparse convolutions by utilizing Tensor Cores, which accelerate dense convolution computations based on the sparsity of the tiles. Experimental results demonstrate that STVAI achieves a stable acceleration of 1.25 times faster than cuDNN implementations, with only a 5% decrease in prediction accuracy. STVAI can achieve accelerations up to 1.53x, surpassing that of existing methods. Our method can be directly applied to various CNN architectures for video inference tasks without the need for retraining the model.

关键词： Convolutional neural network Deep learning cuda programming Video inference Parallel computing

来源：评论

学校读者我要写书评

暂无评论

Two-Replacements Policy Island Model on GPU 13th

Two-Replacements Policy Island Model on GPU

引用

13th International Conference on Swarm Intelligence (ICSI)

作者： Amin, Faiza Li, Jinlong Univ Comp Sci & Technol China Hefei Anhui Peoples R China Univ Comp Sci & Technol China Dept Comp Sci Hefei Anhui Peoples R China

ISBN: (纸本)9783031097263;9783031097263

The island model is one technique to tackle complex and critical difficulties of evolutionary algorithms. This paper will design a two-replacements policy and warp-based island mapping mechanism in TRPIM with ring topology on GPU Nvidia's cuda programming. Each thread in the warp-based island executes the same instruction sequence in parallel to eliminate thread divergence. The two-replacement policy would replace the worse individuals with the better ones asynchronously and synchronously, reducing the waiting duration. We conduct experiments on the knapsack problem to verify the warp-based island mapping mechanism's effectiveness and two-replacement policy in TRPIM. And the results show that the proposed TRPIM improves the speedup time and solution quality on the GPU version compared to the CPU.

关键词： Graphics processing units Island model cuda programming Knapsack problem Warp

来源：评论

学校读者我要写书评

暂无评论

Implementation of SM4 block cipher on cuda GPU and its analysis 7

Implementation of SM4 block cipher on CUDA GPU and its analy...

引用

7th International Conference on Platform Technology and Service (PlatCon)

作者： Eum, Si-Woo Kim, Hyun-Jun Kwon, Hyeok-Dong Jang, Kyung-Bae Kim, Hyun-Ji Seo, Hwa-Jeong Hansung Univ Div IT Convergence Engn Seoul South Korea

ISBN: (数字)9781665459570

ISBN: (纸本)9781665459570

SM4 is a symmetric key algorithm developed by the China National Cryptographic Authority. In this paper, the parallel implementation of SM4 block cipher commonly used in China was performed on GPU. The SM4 block cipher has an implementation that uses an 8-bit Sbox table and an implementation that uses a 32-bit T-table. Measuring the performance of each of the two table implementations, the Ttable implementation performed approximately 0.75x worse than the Sbox table implementation. Additionally, Implemented SM4 to use shared memory for better performance. The result is a performance improvement of approximately 1.06x similar to 1.19x when using shared memory in the Sbox table implementation.

关键词： SM4 Block cipher GPU Shared memory Parallel implementation cuda programming

来源：评论

学校读者我要写书评

暂无评论

An Investigation on Numerical Bugs in GPU Programs Towards Automated Bug Detection

Proceedings of the ACM on Software Engineering

引用

Proceedings of the ACM on Software Engineering 2025年第ISSTA期2卷 1654-1677页

作者： Ravishka Rathnasuriya Nidhi Majoju Zihe Song Wei Yang University of Texas at Dallas Richardson USA

General-purpose graphics processing unit (GPU) computing has emerged as a leading parallel computing paradigm, offering significant performance gains in various domains such as scientific computing and deep learning. However, GPU programs are susceptible to numerical bugs, which can lead to incorrect results or crashes. These bugs are difficult to detect, debug, and fix due to their dependence on specific input values or types and the absence of reliable error-checking mechanisms and oracles. Additionally, the unique programming conventions of GPUs complicate identifying the root causes of bugs, while fixing them requires domain-specific knowledge of GPU computing and numerical libraries. Therefore, understanding the characteristics of GPU numerical bugs (GPU-NBs) is crucial for developing effective *** this paper, we conduct a comprehensive study of GPU-NBs by analyzing 397 real-world bug samples from GitHub. We identify common root causes, symptoms, input patterns, test oracles that trigger these bugs and the strategies used to fix them. We also present GPU-NBDetect, a preliminary tool designed to detect numerical bugs across six distinct bug categories. GPU-NBDetect detected a total of 226 bugs across 186 mathematical functions in four libraries, with 60 of the bugs confirmed by developers. Our findings lay the groundwork for developing detection and prevention techniques for GPU-NBs and offer insights for building more effective debugging and auto-repair tool.

关键词： cuda programming bug detection fault localization floating-point errors numerical bugs

来源：评论

学校读者我要写书评

暂无评论

GPU-Accelerated implementation of a genetically optimized image encryption algorithm

引用

SOFT COMPUTING 2021年第22期25卷 14413-14428页

作者： Bharadwaj, Brijgopal Banu, J. Saira Madiajagan, M. Ghalib, Muhammad Rukunuddin Castillo, Oscar Shankar, Achyut Vellore Inst Technol Sch Comp Sci Engn Vellore Tamil Nadu India Tijuana Inst Technol Div Grad Studies & Res Tijuana Mexico Amity Univ Dept Comp Sci & Engn ASET Noida India

This paper presents a GPU-accelerated implementation of an image encryption algorithm. The algorithm uses the concepts of a modified XOR cipher to encrypt and decrypt the images, with an encryption pad, generated using the shared secret key and some initialization vectors. It uses a genetically optimized pseudo-random generator that outputs a stream of random bytes of the specified length. The proposed algorithm is subjected to a number of theoretical, experimental, and mathematical analyses, to examine its performance and security against a number of possible attacks, using the following metrics - histogram analysis, correlation analysis, information entropy analysis, NPCR and UACI. The performance analysis carried out shows an average speedup-ratio of 3.489 for encryption, and 4.055 for decryption operation, between the serial and parallel implementations using GPU. The algorithm aims to provide better performance benchmarks, which can significantly improve the experience in the relevant use-cases, like real-time media applications.

关键词： Pseudo-random generator GPU cuda programming Symmetric key Image encryption Genetic optimization

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：