Multidimensional aggregation is one of the most important computational building blocks and hence also a potential performance bottleneck in Online Analytic processing (OLAP). In order to deliver fast query responses ...
详细信息
Image processing has high computational requirements. As many papers (including [Frisk 2010;N. Zhang and Wang 2010;Yang et al. 2008]) have indicated, a number of image processing operations can be optimized by heavy p...
详细信息
ISBN:
(纸本)9781450375313
Image processing has high computational requirements. As many papers (including [Frisk 2010;N. Zhang and Wang 2010;Yang et al. 2008]) have indicated, a number of image processing operations can be optimized by heavy parallelization of the computation. Currently, one of the best options for parallelized image processing is usinggeneralpurpose computation on graphicsprocessingunits (gpgpu). In the area of gpgpu, a rather wide range of APIs is available and finding an appropriate choice for a project is challenging. In the course of being able to provide guidance on selecting the right API in an image processing context, four GPU programming models were compared according to their platform independence, usability, and performance. the four investigated APIs are CUDA, OpenCL 1.2, Vulkan, and SYCL. To gain information on the metrics usability and performance a test project was created, implementing a number of image processing tasks usingthe four GPU programming models. the test project was designed as a library, where the 4 investigated APIs could be replaced at compile-time and new operations could be easily added. the implemented reference algorithms form a pipeline for processing images from polarization cameras. the targeted platforms were mid- to high-end desktop PCs as well as embedded single board platforms ranging from Odroid to Nvidia Jetson. On all devices, either MSVC 19.x or GCC 7+ were used as host compilers. CUDA tests were compiled with NVCC and SYCL was represented by ComputeCpp and hipSYCL depending on the device. the direct comparison of the APIs, done for this work, cumulates into a decision matrix to aid organizations in their process of selecting an offloading API. It shows, for example, that CUDA and SYCL are comparable in terms of development cost, support of modern C++ and ease of integration in existing code bases. When considering modern standard adoption and GPU portability Vulkan has advantages, while OpenCL and SYCL can run on CPUs and
Correlation Power Analysis (CPA) is a type of power analysis based side channel attack that can be used to derive the secret key of encryption algorithms including DES (Data Encryption Standard) and AES (Advanced Encr...
详细信息
As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. Withthe recent shift to using GPUs for generalpurpose com...
详细信息
ISBN:
(纸本)9781450320177
As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. Withthe recent shift to using GPUs for generalpurpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are de- signed to use all available GPUs in the system. In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single- GPU sorting algorithm. then, a series of merge steps pro- duce a globally sorted array distributed across all the GPUs in the system. this merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU. Copyright 2013 ACM.
graphicsprocessingunits (GPUs) have been adopted by major cloud vendors, as GPUs provide orders-of-magnitude speedup for computation-intensive data-parallel applications. In the cloud, efficiently sharing GPU resour...
详细信息
Correlation Power Analysis (CPA) is a type of power analysis based side channel attack that can be used to derive the secret key of encryption algorithms including DES (Data Encryption Standard) and AES (Advanced Encr...
详细信息
Correlation Power Analysis (CPA) is a type of power analysis based side channel attack that can be used to derive the secret key of encryption algorithms including DES (Data Encryption Standard) and AES (Advanced Encryption Standard). A typical CPA attack on unprotected AES is performed by analysing a few thousand power traces that requires about an hour of computational time on a generalpurpose CPU. Due to the severity of this situation, a large number of researchers work on countermeasures to such attacks. Verifying that a proposed countermeasure works well requires performing the CPA attack on about 1.5 million power traces. Such processing, even for a single attempt of verification on commodity hardware would run for several days making the verification process infeasible. Modern graphicsprocessingunits (GPUs) have support for thousands of light weight threads, making them ideal for parallelizable algorithms like CPA. While the cost of a GPU being lesser than a high performance multicore server, still the GPU performance for this algorithm is many folds better than that of a multicore server. We present an algorithm and its implementation on GPU for CPA on 128-bit AES that is capable of executing 1300x faster than that on a single threaded CPU and more than 60x faster than that on a 32 threaded multicore server. We show that an attack that would take hours on the multicore server would take even less than a minute on a much cost effective GPU.
暂无评论