Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjo...
详细信息
Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (gpus), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using gpu-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and gpu memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The gpu version achieved an additional speedup of 7.5 +/- 4.4, showing that the processing power of gpus can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography. Prog
Graphics Processing Units (gpus) are increasingly used for general-purpose applications because of their low price, energy efficiency and enormous computing power. Considering the importance of gpu applications, it is...
详细信息
Graphics Processing Units (gpus) are increasingly used for general-purpose applications because of their low price, energy efficiency and enormous computing power. Considering the importance of gpu applications, it is vital that the behaviour of gpu programs can be specified and proven correct formally. This paper presents a logic to verify gpu kernels written in OpenCL, a platform-independent low-level programming language. The logic can be used to prove both data-race-freedom and functional correctness of kernels. The verification is modular, based on ideas from permission-based separation logic. We present the logic and its soundness proof, and then discuss tool support and illustrate its use on a complex example kernel. (C) 2014 Elsevier B.V. All rights reserved.
Emerging many-core processors, like CUDA capable nVidia gpus, are promising platforms for regular parallel algorithms such as the Lattice Boltzmann Method (LBM). Since the global memory for graphic devices shows high ...
详细信息
Emerging many-core processors, like CUDA capable nVidia gpus, are promising platforms for regular parallel algorithms such as the Lattice Boltzmann Method (LBM). Since the global memory for graphic devices shows high latency and LBM is data intensive, the memory access pattern is an important issue for achieving good performances. Whenever possible, global memory loads and stores should be coalescent and aligned, but the propagation phase in LBM can lead to frequent misaligned memory accesses. Most previous CUDA implementations of 3D LBM addressed this problem by using low latency on chip shared memory. Instead of this, our CUDA implementation of LBM follows carefully chosen data transfer schemes in global memory. For the 3D lid-driven cavity test case, we obtained up to 86% of the global memory maximal throughput on nVidia's GT200. We show that as a consequence highly efficient implementations of LBM on gpus are possible, even for complex models. (C) 2010 Elsevier Ltd. All rights reserved.
ChainMail algorithm is a physically based deformation algorithm that has been successfully used in virtual surgery simulators, where time is a critical factor. In this paper, we present a parallel algorithm, based on ...
详细信息
ChainMail algorithm is a physically based deformation algorithm that has been successfully used in virtual surgery simulators, where time is a critical factor. In this paper, we present a parallel algorithm, based on ChainMail, and its efficient implementation that reduces the time required to compute deformations over large medical 3D datasets by means of modern gpu capabilities. We also present a 3D blocking scheme that reduces the amount of unnecessary processing threads. For this purpose, this paper describes a new parallel boolean reduction scheme, used to efficiently decide which blocks are computed. Finally, through an extensive analysis, we show the performance improvement achieved by our implementation of the proposed algorithm and the use of the proposed blocking scheme, due to the high spatial and temporal locality of our approach.
In this paper we present a streaming compression scheme for gigantic point sets including per-point normals. This scheme extends on our previous Duodecim approach [21] in two different ways. First, we show how to use ...
详细信息
In this paper we present a streaming compression scheme for gigantic point sets including per-point normals. This scheme extends on our previous Duodecim approach [21] in two different ways. First, we show how to use this approach for the compression and rendering of high-resolution iso-surfaces in volumetric data sets. Second, we use deferred shading of point primitives to considerably improve rendering quality. Iso-surface reconstruction is performed in a hexagonal close packing (HCP) grid, into which the initial data set is resampled. Normals are resampled from the initial domain using volumetric gradients. By incremental encoding, only slightly more than 3 bits per surface point and 5 bits per surface normal are required at high fidelity. The compressed data stream can be decoded in the graphics processing unit (gpu). Decoded point positions are saved in graphics memory, and they are then used on the gpu again to render point primitives. In this way high quality gigantic data sets can directly be rendered from their compressed representation in local gpu memory at interactive frame rates (see Fig. 1).
The adjoint method is a useful tool for finding gradients of design objectives with respect to system parameters for fluid dynamics simulations. But the utility of this method is hampered by the difficulty in writing ...
详细信息
The adjoint method is a useful tool for finding gradients of design objectives with respect to system parameters for fluid dynamics simulations. But the utility of this method is hampered by the difficulty in writing an efficient implementation for the adjoint flow solver, especially one that scales to thousands of cores. This paper demonstrates a Python library, called adFVM, that can be used to construct an explicit unsteady flow solver and derive the corresponding discrete adjoint flow solver using automatic differentiation (AD). The library uses a two-level computational graph method for representing the structure of both solvers. The library translates this structure into a sequence of optimized kernels, significantly reducing its execution time and memory footprint. Kernels can be generated for heterogeneous architectures including distributed memory, shared memory and accelerator based systems. The library is used to write a finite volume based compressible flow solver. A wall clock time comparison between different flow solvers and adjoint flow solvers built using this library and state of the art graph based AD libraries is presented on a turbo-machinery flow problem. Performance analysis of the flow solvers is carried out for CPUs and gpus. Results of strong and weak scaling of the flow solver and its adjoint are demonstrated on subsonic flow in a periodic box. (C) 2018 Elsevier B.V. All rights reserved.
Denial-of-service (DoS) and distributed DoS (DDoS) are among the major threats to cyber-security, and client puzzle, which demands a client to perform computationally expensive operations before being granted services...
详细信息
Denial-of-service (DoS) and distributed DoS (DDoS) are among the major threats to cyber-security, and client puzzle, which demands a client to perform computationally expensive operations before being granted services from a server, is a well-known countermeasure to them. However, an attacker can inflate its capability of DoS/DDoS attacks with fast puzzle-solving software and/or built-in graphics processing unit (gpu) hardware to significantly weaken the effectiveness of client puzzles. In this paper, we study how to prevent DoS/DDoS attackers from inflating their puzzle-solving capabilities. To this end, we introduce a new client puzzle referred to as software puzzle. Unlike the existing client puzzle schemes, which publish their puzzle algorithms in advance, a puzzle algorithm in the present software puzzle scheme is randomly generated only after a client request is received at the server side and the algorithm is generated such that: 1) an attacker is unable to prepare an implementation to solve the puzzle in advance and 2) the attacker needs considerable effort in translating a central processing unit puzzle software to its functionally equivalent gpu version such that the translation cannot be done in real time. Moreover, we show how to implement software puzzle in the generic server-browser model.
In this paper, we consider the implementation of a thermal flow solver based on the lattice Boltzmann method (LBM) for graphics processing units (gpus). We first describe the hybrid thermal LBM model implemented, and ...
详细信息
In this paper, we consider the implementation of a thermal flow solver based on the lattice Boltzmann method (LBM) for graphics processing units (gpus). We first describe the hybrid thermal LBM model implemented, and give a concise review of the CUDA technology. The specific issues that arise with LBM on CPUs are outlined. We propose an approach for efficient handling of the thermal part. Performance is close to optimum and is significantly better than the one of comparable CPU solvers. We validate our code by simulating the differentially heated cubic cavity (DHC). The computed results for steady flow patterns are in good agreement with previously published ones. Finally, we use our solver to study the phenomenology of transitional flows in the DHC. (C) 2011 Elsevier Ltd. All rights reserved.
Finding the shortest path between any two nodes in a graph, known as the All-Pairs Shortest Paths (APSP), is a fundamental problem in many data analysis problems, such as supply chains in logistics, routing protocols ...
详细信息
Finding the shortest path between any two nodes in a graph, known as the All-Pairs Shortest Paths (APSP), is a fundamental problem in many data analysis problems, such as supply chains in logistics, routing protocols in IoT networks that involve consumer electronics as well as data analysis for social networking apps and Google Maps apps used by the general public on their smartphones. In this work, we present a novel approach to solve the APSP problem on multicore and gpu systems. In our approach, a graph is first pre-processed by partitioning the graph into sub-graphs. Then, each sub-graph is processed in parallel using any existing shortest path algorithm such as the Floyd-Warshall algorithm or Dijkstra's algorithm. Finally, the distance results in individual sub-graphs are aggregated to obtain the distances of APSP for the entire graph. OpenMP and CUDA are used to implement the parallelization on multicore CPUs and gpus, respectively. We conduct the extensive experiments with both synthetic and real-world graphs on the JADE (Joint Academic Data Science Endeavour) cluster at the University of Oxford, which is part of the Tier-2 high performance computing facilities in the U.K. In the experiments, we compared our methods with three existing APSP algorithms in the literature, including n-Dijkstra, ParAPSP and SuperFW. The results show that our methods outperform the existing algorithms, achieving the speedup of up to 8.3x over Dijkstra.
The skeletonization of binary images is a common task in many image processing and machine learning applications. Some of these applications require very fast image processing. We propose novel techniques for efficien...
详细信息
The skeletonization of binary images is a common task in many image processing and machine learning applications. Some of these applications require very fast image processing. We propose novel techniques for efficient 2D and 3D thinning of binary images using gpu processors. The algorithms use bit-encoded binary images to process multiple points simultaneously in each thread. The simpleness of a point is determined based on Boolean algebra using only bitwise logical operators. This avoids computationally expensive decoding and encoding steps and allows for additional parallelization. The 2D algorithm is evaluated using a data set of handwritten characters images. It required an average computation time of 3.53 ns for 32 x 32 pixels and 0.25 ms for 1024 x 1024 pixels. This is 52-18,380 times faster than a multi-threaded border-parallel algorithm. The 3D algorithm was evaluated based on clinical images of the human vasculature and required computation times of 0.27 ms for 128 x 128 x 128 voxels and 20.32 ms for 512 x 512 x 512 voxels, which is 32-46 times faster than the compared border-sequential algorithm using the same gpu processor. The proposed techniques enable efficient real-time 2D and 3D skeletonization of binary images, which could improve the performance of many existing machine learning applications.
暂无评论