This paper proposes a real -time video encryption strategy based on multi-round confusion-diffusion architecture and heterogeneous parallel computing. It leverages the powerful computing capacity of the Central Proces...
详细信息
This paper proposes a real -time video encryption strategy based on multi-round confusion-diffusion architecture and heterogeneous parallel computing. It leverages the powerful computing capacity of the Central Processing Unit (CPU) and the high parallel capability of the Graphics Processing Unit (GPU) to perform byte generation, confusion and diffusion operations concurrently, thereby enhancing computational efficiency. Statistical and security analysis demonstrate that the proposed method exhibits exceptional statistical properties and provides resistance against different types of attacks. Encryption speed evaluation shows that it can realize latency-free 768x768 30FPS video encryption using Intel Xeon Gold 6226R and NVIDIA GeForce RTX 3090, with an average encryption time of 25.12 ms, despite performing seven rounds of confusion and six rounds of diffusion operations on each frame. Additionally, the proposed strategy is adopted to implement a droneoriented secure video communication system, achieving latency-free 256x256 29FPS video encryption with NVIDIA Jetson Xavier NX (NVIDIA Camel ARM CPU and Volta GPU).
parallelcomputing techniques have been introduced into digital image correlation(DIC) in recent years and leads to a surge in computation speed. The graphics processing unit(GPU)-based parallelcomputing demonstrated...
详细信息
parallelcomputing techniques have been introduced into digital image correlation(DIC) in recent years and leads to a surge in computation speed. The graphics processing unit(GPU)-based parallelcomputing demonstrated a surprising effect on accelerating the iterative subpixel DIC, compared with CPU-based parallelcomputing. In this paper, the performances of the two kinds of parallelcomputing techniques are compared for the previously proposed path-independent DIC method, in which the initial guess for the inverse compositional Gauss-Newton(IC-GN) algorithm at each point of interest(POI) is estimated through the fast Fourier transform-based cross-correlation(FFT-CC) algorithm. Based on the performance evaluation, a heterogeneous parallel computing(HPC) model is proposed with hybrid mode of parallelisms in order to combine the computing power of GPU and multicore CPU. A scheme of trial computation test is developed to optimize the configuration of the HPC model on a specific computer. The proposed HPC model shows excellent performance on a middle-end desktop computer for real-time subpixel DIC with high resolution of more than 10000 POIs per frame.
Hydrological model calibration has been a hot issue for decades. The shuffled complex evolution method developed at the University of Arizona (SCE-UA) has been proved to be an effective and robust optimization approac...
详细信息
Hydrological model calibration has been a hot issue for decades. The shuffled complex evolution method developed at the University of Arizona (SCE-UA) has been proved to be an effective and robust optimization approach. However, its computational efficiency deteriorates significantly when the amount of hydrometeorological data increases. In recent years, the rise of heterogeneous parallel computing has brought hope for the acceleration of hydrological model calibration. This study proposed a parallel SCE-UA method and applied it to the calibration of a watershed rainfall-runoff model, the Xinanjiang model. The parallel method was implemented on heterogeneouscomputing systems using OpenMP and CUDA. Performance testing and sensitivity analysis were carried out to verify its correctness and efficiency. Comparison results indicated that heterogeneous parallel computing-accelerated SCE-UA converged much more quickly than the original serial version and possessed satisfactory accuracy and stability for the task of fast hydrological model calibration.
Since the invention of the first microprocessor has passed many years. Technological developments in CPU construction is primarily based on increasing the performance of devices, their miniaturisation and the reductio...
详细信息
ISBN:
(纸本)9783319747187;9783319747170
Since the invention of the first microprocessor has passed many years. Technological developments in CPU construction is primarily based on increasing the performance of devices, their miniaturisation and the reduction of manufacturing costs. Well known Moore’s Law, speaking of doubling the number of transistors on a chip at regular intervals (going in hand with reduction of manufacturing costs), proved work well over years (initially assumed rate of eighteen months has been slightly extended to two years). Such a trend, due to the technological constraints cannot be everlasting; right now it can be already observed as it slows down. Limitations in minimum size of the individual components (transistors) and a total power draw of a system, forced to change the direction of the technological development. Instead of boost the clock of a processor, it was decided to multiply its number in a chip. Thanks to clustering of processor cores in a single chip that utilise fast shared cache memory, we still can observe considerable performance boost.
Atmospheric aerosol particles have a significant impact on radiation, climate, and human health, with their size and shape being fundamental physical parameters for atmospheric change research. Due to the widespread e...
详细信息
Atmospheric aerosol particles have a significant impact on radiation, climate, and human health, with their size and shape being fundamental physical parameters for atmospheric change research. Due to the widespread effects and applications of aerosol particles, the direct measurement of aerosol size and shape has become crucial. Nevertheless, several challenges persist in aerosol measurement instruments, including limited resolution, complex operation, poor synchronization, and inaccurate inversion methods. Therefore, we developed a new scientific instrument and corresponding image intelligent interaction system, whose name is the fast atmospheric aerosol size and shape imaging instrument (FASI). The instrument is designed for transmission imaging that contains a light source, imaging chamber, microscope objective, tube lens, extension tube, camera, etc. Before the operation, the FASI calibrates background field, pixel size, characteristic gray value (CGV), and depth of field (DOF) based on image processing. During intelligent interaction, the FASI extracts aerosol particles by image denoising and edge detection, and then uses our proposed defocus and duplicate particle detection algorithms for secondary screening of aerosols. Aerosol size and shape parameters are measured in parallel by the central processing unit (CPU) and the graphics processing unit (GPU) using heterogeneous computation. Polystyrene latex (PSL) calculations and quantitative experiments indicate that FASI can accurately detect 0.5-20 mu m aerosol particles. In particular, the FASI measures aerosol particles supplied by an aerosol generator, dryer, and neutralizer, demonstrating that the aerosol size distribution range of oil solutions (0.5-3.5 mu m) is narrower than that of aqueous solutions (0.5-7.5 mu m). For all samples, 92.12% of aerosols have an aspect ratio (AR) exceeding 1, and the shape of these nonspherical aerosols varies greatly from each other. The evaluations of computational effici
Urban wind flow simulation, based on numerical methods, serves as a powerful tool for understanding the intricate interactions between urban structures and atmospheric conditions. The Lattice Boltzmann method (LBM) st...
详细信息
Urban wind flow simulation, based on numerical methods, serves as a powerful tool for understanding the intricate interactions between urban structures and atmospheric conditions. The Lattice Boltzmann method (LBM) stands out as a popular choice for simulating urban wind flow. However, traditional LBM approaches face limitations in terms of scalability on large parallelcomputing systems and their ability to support high-resolution wind flow simulations across vast megacities spanning hundreds of square kilometers. In response to these challenges, we introduce THLB (Tianhe lattice Boltzmann), a purpose-built LBM simulator tailored for large-scale urban wind flow simulations. THLB streamlines the preprocessing of extensive simulation data through an innovative scheme that automatically identifies flow regions along irregular boundaries. Additionally, THLB integrates a novel processing pipeline and employs parallel optimization techniques, enhancing scalability and performance for large-scale LBM simulations. Our assessment of THLB involves conducting wind flow simulations within a megacity covering an area of 50km \times 40km at an impressive one-meter simulation resolution, featuring 150,000 buildings. This simulation represents the most extensive urban wind flow analysis to date, comprising over two trillion simulation lattices. We gauge THLB's performance on the Tianhe new-generation supercomputer, harnessing more than 155 million heterogeneous cores. Our experimental results demonstrate exceptional performance and scalability, achieving a peak computation throughput of 24,553.43 G Lattices Updates Per Second (GLUPS), setting a new state-of-the-art benchmark for LBM simulations. Despite the inherent challenges of large-scale LBM simulations, our approach showcases robust scalability, delivering 90.48\% and 69.91\% of weak and strong scaling efficiency, respectively.
In this work, a novel GPU-accelerated heterogeneous method for the automated multilevel substructuring method(HAMLS) is presented for dealing large finite element models in structural dynamics. Different parallel mode...
详细信息
In this work, a novel GPU-accelerated heterogeneous method for the automated multilevel substructuring method(HAMLS) is presented for dealing large finite element models in structural dynamics. Different parallel modes based on node, subtree, and eigenpair have been developed in the solution steps of AMLS to achieve a heterogeneous strategy. First, a new data management method is designed during the model transformation phase to eliminate the determinacy race in the parallel strategy of the separator tree. Considering the distribution characteristics of the nodes in the separator tree and the dependence of node tasks, a load balancing heterogeneousparallel strategy is designed to take full advantage of hosts and devices. By developing an adaptive batch processing program for solving eigenvectors during the back transformation phase, the overheads of launching kernels, as well as the GPU memory requirements, can be reduced by several orders of magnitude. Several numerical examples have been employed to validate the efficiency and practicality of the novel GPU-accelerated heterogeneous strategy. The results demonstrate that the computational efficiency of the novel strategy using one GPU can increase to 3.0x that of the original parallel AMLS method when 16 CPU threads are used.
Many fields of scientific simulation, such as chemistry and condensed matter physics, are increasingly eschewing dense tensor contraction in favor of sparse tensor contraction. In this work, we center around binary sp...
详细信息
Many fields of scientific simulation, such as chemistry and condensed matter physics, are increasingly eschewing dense tensor contraction in favor of sparse tensor contraction. In this work, we center around binary sparse tensor contraction (SpTC) which has the challenges of index matching and accumulation. To address these difficulties, we present GSpTC, an efficient element-wise SpTC framework on CPU-GPU heterogeneous systems. GSpTC first introduces a fine-grained partitioning strategy based on element-wise tensor contraction. By analyzing and selecting appropriate dimension partitioning strategies, we can efficiently utilize the multi-threading parallelism on GPUs and optimize the overall performance of GSpTC. In particular, GSpTC leverages multi-threading parallelism on GPUs for the contraction phase and merging phase, which greatly accelerates the computation phase in sparse tensor contraction computations. Furthermore, GSpTC employs parallel pipeline technology to hide the data transmission time between the host and the device, further enhancing its performance. As a result, GSpTC achieves an average performance improvement of 267% compared to the previous state-of-the-art framework Sparta.
We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU-GPU heterogeneous systems. Thereby, we provide a systemati...
详细信息
We analyze the performance portability of the skeleton-based, single-source multi-backend high-level programming framework SkePU across multiple different CPU-GPU heterogeneous systems. Thereby, we provide a systematic application efficiency characterization of SkePU-generated code in comparison to equivalent hand-written code in more low-level parallel programming models such as OpenMP and CUDA. For this purpose, we contribute ports of the STREAM benchmark suite and of a part of the NAS parallel Benchmark suite to SkePU. We show that for STREAM and the EP benchmark, SkePU regularly scores efficiency values above 80% and in particular for CPU systems, SkePU can outperform hand-written code.
S C The improvement of computational effectiveness is a vital issue in the field of large-scale finite element analysis. The performance is fundamentally determined by the efficiency of solving sparse linear system eq...
详细信息
S C The improvement of computational effectiveness is a vital issue in the field of large-scale finite element analysis. The performance is fundamentally determined by the efficiency of solving sparse linear system equations using the implicit finite element method. This paper presents a direct linear solver based on heterogeneous hybrid parallelcomputing on CPUs and GPUs. This can efficiently utilize computing resources of multiple devices to achieve performance improvement. Initially, we partition the elimination tree into several subtrees to accomplish the task decomposition. Based on this, we build a dynamic programming mathematical model to balance the computational load of the various devices. Then, we develop a numerical decomposition strategy by combining node parallelism and tree parallelism for the CPUs. In addition, efficient numerical decomposition is achieved on the GPU through batch processing and maximizing the overlap between computations and data transfers. Numerical experiments show that, compared with MKL PARDISO, the performance of numerical factorization can be improved by up to 10 times by using CPU and dual-path GPU hybrid calculations, and the computation time of simulation can be reduced by one-third for the multicondition analysis of Body In White and by 20% for the large-scale nonlinear finite element deformation analysis.(c) 2022 Elsevier B.V. All rights reserved.
暂无评论