Object detection and tracking are essential tasks in many computer vision applications. One of the most popular tracking algorithms is the particle filter, which is widely used for real-time object tracking in live vi...
详细信息
ISBN:
(纸本)9781665458412
Object detection and tracking are essential tasks in many computer vision applications. One of the most popular tracking algorithms is the particle filter, which is widely used for real-time object tracking in live video streams. While very popular, the particle filter algorithm suffers from increased computational runtimes for high-resolution frames and large numbers of particles. In this paper, we investigate the use of cuda programming as a method to parallelize portions of the particle filter algorithm in order to speed-up its execution time on compute systems that are equipped with NVIDIA GPUs. Experiments that compare a CPU sequential version, as the base case, with the cuda parallelized version demonstrate an achievable speed-up of up to 7.5x for a 3840x2160 video resolution, and 9216 particles on a computer equipped with an NVIDIA Tesla K40c GPU.
In this study, a GPU-accelerated improved mixed Lagrangian-Eulerian (IMLE) method is proposed to solve the three-dimensional incompressible Navier-Stokes equations. To improve the prediction accuracy, the proposed IML...
详细信息
In this study, a GPU-accelerated improved mixed Lagrangian-Eulerian (IMLE) method is proposed to solve the three-dimensional incompressible Navier-Stokes equations. To improve the prediction accuracy, the proposed IMLE method approximates the total derivative term in Lagragian sense, and the spatial derivative terms are approximated on Eulerian coordinates. Transfer of data from Lagrangian particles to data on Eulerian grids is accurately carried out by adopting moving least squares (MLS) interpolation method. The velocity-pressure decoupling issue is overcome by adopting pressure-free projection method in which the pressure field is calculated by solving a pressure Poisson equation (PPE). It is noted that the MLS interpolation is time consuming since this procedure belongs to a pointwise scheme in which a local matrix equation shall be solved on each grid point. In addition, the discretized PPE forms a large sparse matrix and it is computationally intensive to solve by using the conjugate gradient (CG) method. Therefore, we are aimed to resort to cuda- and OpenMP-programming means to accelerate the computation. In this study, the performance of the multiple GPUs code can reach up to 27 times faster with respect to multi-threads CPU performance. (C) 2019 Elsevier Ltd. All rights reserved.
In recent literature, it has been shown that the number of steps in a sequential quadratic programming algorithm for a non-linear model predictive control (NMPC) problem can be greatly reduced by a parallel shooting m...
详细信息
ISBN:
(纸本)9798331540920;9783907144107
In recent literature, it has been shown that the number of steps in a sequential quadratic programming algorithm for a non-linear model predictive control (NMPC) problem can be greatly reduced by a parallel shooting method. The efficiency of such a parallel shooting method further depends on how the algorithm is implemented on parallel computing platforms such as Graphics Processing Units (GPUs). The GPU implementation should consider the degree of parallelism necessary for higher time efficiency as well as the hardware resource consumption/limitation at the GPU for a given problem size. In this paper, we present a multilevel parallel GPU implementation for sequential quadratic programming and an (Alternating Direction Method of Multipliers) ADMM solver. First, we introduce a GPU implementation enabling parallel computing of many quadratic programs (QPs) by functional parallelism. Next, we parallelize each QP solver using data parallelism of basic linear matrix operations. We show that the proposed GPU implementation greatly scales with the degree of parallelism in the parallel shooting method. Further, we show how a GPU implementation can be configured for a given problem size avoiding resource overprovisioning.
Dehazing algorithms have been developed in response to the need for effectively and instantaneously removing atmospheric turbidities such as mist, haze, and fog from media. The removal of haze from an image or video e...
详细信息
ISBN:
(纸本)9798331528539;9798331528546
Dehazing algorithms have been developed in response to the need for effectively and instantaneously removing atmospheric turbidities such as mist, haze, and fog from media. The removal of haze from an image or video enables the extraction of additional details from the scene. This paper presents the development of a real-time, memory-optimized dehazing system that utilizes digital image processing techniques and NVIDIA's cuda architecture for efficient parallel computing. The methodology incorporates a quad-tree search algorithm for efficient atmospheric light estimation, which significantly enhances dehazing accuracy. Advanced contrast enhancement techniques are employed for transmission estimation to ensure clarity and visibility in dehazed images, addressing challenges such as non-uniform illumination and varying haze densities. For image restoration, the system dynamically clips values to improve brightness and minimize information loss. The transmission map's artifacts are then smoothed using a Gaussian filter, resulting in more reliable dehazing. Extensive testing on a variety of publicly accessible datasets demonstrated that the proposed model achieved comparable accuracy to numerous existing techniques, while also restoring high-quality dehazed images and videos. The system achieves a running time of 35ms per image and up to 10ms per frame for video sequences. Performance was objectively assessed using the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), yielding a PSNR of 28.75 and an SSIM of 0.80.
In this work, we tackle the problem of estimating the security of iterated symmetric ciphers in an efficient manner, with tests that do not require a deep analysis of the internal structure of the cipher. This is part...
详细信息
ISBN:
(纸本)9783031444685;9783031444692
In this work, we tackle the problem of estimating the security of iterated symmetric ciphers in an efficient manner, with tests that do not require a deep analysis of the internal structure of the cipher. This is particularly useful during the design phase of these ciphers, especially for quickly testing several combinations of possible parameters defining several cipher design variants. We consider a popular statistical test that allows us to determine the probability of flipping each cipher output bit, given a small variation in the input of the cipher. From these probabilities, one can compute three measurable metrics related to the well-known full diffusion, avalanche and strict avalanche criteria. This highly parallelizable testing process scales linearly with the number of samples, i.e., cipher inputs, to be evaluated and the number of design variants to be tested. But, the number of design variants might grow exponentially with respect to some parameters. The high cost of Central Processing Unit (CPU)s makes them a bad candidate for this kind of parallelization. As a main contribution, we propose a framework, ACE-HoT, to parallelize the testing process using multi-Graphics Processing Units (GPUs). Our implementation does not perform any intermediate CPU-GPU data transfers. The diffusion and avalanche criteria can be seen as an application of discrete first-order derivatives. As a secondary contribution, we generalize these criteria to their high-order version. Our generalization requires an exponentially larger number of samples, in order to compute sufficiently accurate probabilities. As a case study, we apply ACE-HoT on most of the finalists of the National Institute of Standards and Technologies (NIST) lightweight standardization process, with a special focus on the winner ASCON.
The integration of video data computation and inference is a cornerstone for the evolution of multimodal artificial intelligence (MAI). The extensive adoption and optimization of CNN-based frameworks have significantl...
详细信息
The integration of video data computation and inference is a cornerstone for the evolution of multimodal artificial intelligence (MAI). The extensive adoption and optimization of CNN-based frameworks have significantly improved the accuracy of video inference, yet they present substantial challenges for real-time and large-scale computational demands. Existing researches primarily utilize the temporal similarity between video frames to reduce redundant computations, but most of them overlooked the spatial similarity within the frames themselves. Hence, we propose STVAI, a scalable and efficient method that leverages both spatial and temporal similarities to accelerate video inference. This approach uses a parallel region merging strategy, which maintains inference accuracy and enhances the sparsity of the computation matrix. Moreover, we have optimized the computation of sparse convolutions by utilizing Tensor Cores, which accelerate dense convolution computations based on the sparsity of the tiles. Experimental results demonstrate that STVAI achieves a stable acceleration of 1.25 times faster than cuDNN implementations, with only a 5% decrease in prediction accuracy. STVAI can achieve accelerations up to 1.53x, surpassing that of existing methods. Our method can be directly applied to various CNN architectures for video inference tasks without the need for retraining the model.
The island model is one technique to tackle complex and critical difficulties of evolutionary algorithms. This paper will design a two-replacements policy and warp-based island mapping mechanism in TRPIM with ring top...
详细信息
ISBN:
(纸本)9783031097263;9783031097263
The island model is one technique to tackle complex and critical difficulties of evolutionary algorithms. This paper will design a two-replacements policy and warp-based island mapping mechanism in TRPIM with ring topology on GPU Nvidia's cuda programming. Each thread in the warp-based island executes the same instruction sequence in parallel to eliminate thread divergence. The two-replacement policy would replace the worse individuals with the better ones asynchronously and synchronously, reducing the waiting duration. We conduct experiments on the knapsack problem to verify the warp-based island mapping mechanism's effectiveness and two-replacement policy in TRPIM. And the results show that the proposed TRPIM improves the speedup time and solution quality on the GPU version compared to the CPU.
SM4 is a symmetric key algorithm developed by the China National Cryptographic Authority. In this paper, the parallel implementation of SM4 block cipher commonly used in China was performed on GPU. The SM4 block ciphe...
详细信息
ISBN:
(数字)9781665459570
ISBN:
(纸本)9781665459570
SM4 is a symmetric key algorithm developed by the China National Cryptographic Authority. In this paper, the parallel implementation of SM4 block cipher commonly used in China was performed on GPU. The SM4 block cipher has an implementation that uses an 8-bit Sbox table and an implementation that uses a 32-bit T-table. Measuring the performance of each of the two table implementations, the Ttable implementation performed approximately 0.75x worse than the Sbox table implementation. Additionally, Implemented SM4 to use shared memory for better performance. The result is a performance improvement of approximately 1.06x similar to 1.19x when using shared memory in the Sbox table implementation.
General-purpose graphics processing unit (GPU) computing has emerged as a leading parallel computing paradigm, offering significant performance gains in various domains such as scientific computing and deep learning. ...
详细信息
General-purpose graphics processing unit (GPU) computing has emerged as a leading parallel computing paradigm, offering significant performance gains in various domains such as scientific computing and deep learning. However, GPU programs are susceptible to numerical bugs, which can lead to incorrect results or crashes. These bugs are difficult to detect, debug, and fix due to their dependence on specific input values or types and the absence of reliable error-checking mechanisms and oracles. Additionally, the unique programming conventions of GPUs complicate identifying the root causes of bugs, while fixing them requires domain-specific knowledge of GPU computing and numerical libraries. Therefore, understanding the characteristics of GPU numerical bugs (GPU-NBs) is crucial for developing effective *** this paper, we conduct a comprehensive study of GPU-NBs by analyzing 397 real-world bug samples from GitHub. We identify common root causes, symptoms, input patterns, test oracles that trigger these bugs and the strategies used to fix them. We also present GPU-NBDetect, a preliminary tool designed to detect numerical bugs across six distinct bug categories. GPU-NBDetect detected a total of 226 bugs across 186 mathematical functions in four libraries, with 60 of the bugs confirmed by developers. Our findings lay the groundwork for developing detection and prevention techniques for GPU-NBs and offer insights for building more effective debugging and auto-repair tool.
This paper presents a GPU-accelerated implementation of an image encryption algorithm. The algorithm uses the concepts of a modified XOR cipher to encrypt and decrypt the images, with an encryption pad, generated usin...
详细信息
This paper presents a GPU-accelerated implementation of an image encryption algorithm. The algorithm uses the concepts of a modified XOR cipher to encrypt and decrypt the images, with an encryption pad, generated using the shared secret key and some initialization vectors. It uses a genetically optimized pseudo-random generator that outputs a stream of random bytes of the specified length. The proposed algorithm is subjected to a number of theoretical, experimental, and mathematical analyses, to examine its performance and security against a number of possible attacks, using the following metrics - histogram analysis, correlation analysis, information entropy analysis, NPCR and UACI. The performance analysis carried out shows an average speedup-ratio of 3.489 for encryption, and 4.055 for decryption operation, between the serial and parallel implementations using GPU. The algorithm aims to provide better performance benchmarks, which can significantly improve the experience in the relevant use-cases, like real-time media applications.
暂无评论