检索结果-内蒙古大学图书馆

gpu-accelerated adjoint algorithmic differentiation

COMPUTER PHYSICS COMMUNICATIONS 2016年 200卷 300-311页

作者： Gremse, Felix Hoefter, Andreas Razik, Lukas Kiessling, Fabian Naumann, Uwe Rhein Westfal TH Aachen Expt Mol Imaging Aachen Germany Rhein Westfal TH Aachen Software & Tools Computat Engn Aachen Germany

Many scientific problems such as classifier training or medical image reconstruction can be expressed as minimization of differentiable real-valued cost functions and solved with iterative gradient-based methods. Adjoint algorithmic differentiation (AAD) enables automated computation of gradients of such cost functions implemented as computer programs. To backpropagate adjoint derivatives, excessive memory is potentially required to store the intermediate partial derivatives on a dedicated data structure, referred to as the "tape". Parallelization is difficult because threads need to synchronize their accesses during taping and backpropagation. This situation is aggravated for many-core architectures, such as Graphics Processing Units (gpus), because of the large number of light-weight threads and the limited memory size in general as well as per thread. We show how these limitations can be mediated if the cost function is expressed using gpu-accelerated vector and matrix operations which are recognized as intrinsic functions by our AAD software. We compare this approach with naive and vectorized implementations for CPUs. We use four increasingly complex cost functions to evaluate the performance with respect to memory consumption and gradient computation times. Using vectorization, CPU and gpu memory consumption could be substantially reduced compared to the naive reference implementation, in some cases even by an order of complexity. The vectorization allowed usage of optimized parallel libraries during forward and reverse passes which resulted in high speedups for the vectorized CPU version compared to the naive reference implementation. The gpu version achieved an additional speedup of 7.5 +/- 4.4, showing that the processing power of gpus can be utilized for AAD using this concept. Furthermore, we show how this software can be systematically extended for more complex problems such as nonlinear absorption reconstruction for fluorescence-mediated tomography. Prog

关键词： Adjoint algorithmic differentiation gpu programming

来源：评论

学校读者我要写书评

暂无评论

Specification and verification of GPgpu programs

引用

SCIENCE OF COMPUTER programming 2014年第Part3期95卷 376-388页

作者： Blom, Stefan Huisman, Marieke Mihelcic, Matej Univ Twente NL-7500 AE Enschede Netherlands

Graphics Processing Units (gpus) are increasingly used for general-purpose applications because of their low price, energy efficiency and enormous computing power. Considering the importance of gpu applications, it is vital that the behaviour of gpu programs can be specified and proven correct formally. This paper presents a logic to verify gpu kernels written in OpenCL, a platform-independent low-level programming language. The logic can be used to prove both data-race-freedom and functional correctness of kernels. The verification is modular, based on ideas from permission-based separation logic. We present the logic and its soundness proof, and then discuss tool support and illustrate its use on a complex example kernel. (C) 2014 Elsevier B.V. All rights reserved.

关键词： Formal verification Separation logic Permissions gpu programming

来源：评论

学校读者我要写书评

暂无评论

A new approach to the lattice Boltzmann method for graphics processing units

引用

COMPUTERS & MATHEMATICS WITH APPLICATIONS 2011年第12期61卷 3628-3638页

作者： Obrecht, Christian Kuznik, Frederic Tourancheau, Bernard Roux, Jean-Jacques Univ Lyon INSA Lyon UMR5008 CNRSCtr Therm Lyon Lyon France UCB Lyon 1 ENS Lyon CNRS UMR 5668Lab Informat ParallelismeINRIA Lyon France

Emerging many-core processors, like CUDA capable nVidia gpus, are promising platforms for regular parallel algorithms such as the Lattice Boltzmann Method (LBM). Since the global memory for graphic devices shows high latency and LBM is data intensive, the memory access pattern is an important issue for achieving good performances. Whenever possible, global memory loads and stores should be coalescent and aligned, but the propagation phase in LBM can lead to frequent misaligned memory accesses. Most previous CUDA implementations of 3D LBM addressed this problem by using low latency on chip shared memory. Instead of this, our CUDA implementation of LBM follows carefully chosen data transfer schemes in global memory. For the 3D lid-driven cavity test case, we obtained up to 86% of the global memory maximal throughput on nVidia's GT200. We show that as a consequence highly efficient implementations of LBM on gpus are possible, even for complex models. (C) 2010 Elsevier Ltd. All rights reserved.

关键词： gpu programming CUDA Lattice Boltzmann method Parallel computing

来源：评论

学校读者我要写书评

暂无评论

SP-ChainMail: a gpu-based sparse parallel ChainMail algorithm for deforming medical volumes

引用

JOURNAL OF SUPERCOMPUTING 2015年第9期71卷 3482-3499页

作者： Rodriguez, Alejandro Leon, Alejandro Arroyo, German Miguel Mantas, Jose Univ Granada Granada Spain

ChainMail algorithm is a physically based deformation algorithm that has been successfully used in virtual surgery simulators, where time is a critical factor. In this paper, we present a parallel algorithm, based on ChainMail, and its efficient implementation that reduces the time required to compute deformations over large medical 3D datasets by means of modern gpu capabilities. We also present a 3D blocking scheme that reduces the amount of unnecessary processing threads. For this purpose, this paper describes a new parallel boolean reduction scheme, used to efficiently decide which blocks are computed. Finally, through an extensive analysis, we show the performance improvement achieved by our implementation of the proposed algorithm and the use of the proposed blocking scheme, due to the high spatial and temporal locality of our approach.

关键词： gpu programming Stencil computation Physically based deformation Parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Compression and rendering of iso-surfaces and point sampled geometry

引用

VISUAL COMPUTER 2006年第8期22卷 517-530页

作者： Krueger, Jens Schneider, Jens Westermann, Ruediger Tech Univ Munich Comp Graph & Visualizat Grp D-8000 Munich Germany

In this paper we present a streaming compression scheme for gigantic point sets including per-point normals. This scheme extends on our previous Duodecim approach [21] in two different ways. First, we show how to use this approach for the compression and rendering of high-resolution iso-surfaces in volumetric data sets. Second, we use deferred shading of point primitives to considerably improve rendering quality. Iso-surface reconstruction is performed in a hexagonal close packing (HCP) grid, into which the initial data set is resampled. Normals are resampled from the initial domain using volumetric gradients. By incremental encoding, only slightly more than 3 bits per surface point and 5 bits per surface normal are required at high fidelity. The compressed data stream can be decoded in the graphics processing unit (gpu). Decoded point positions are saved in graphics memory, and they are then used on the gpu again to render point primitives. In this way high quality gigantic data sets can directly be rendered from their compressed representation in local gpu memory at interactive frame rates (see Fig. 1).

关键词： data compression gpu programming huge point clouds deferred shading

来源：评论

学校读者我要写书评

暂无评论

A two-level computational graph method for the adjoint of a finite volume based compressible unsteady flow solver

引用

PARALLEL COMPUTING 2019年 81卷 68-84页

作者： Talnikar, Chaitanya Wang, Qiqi MIT Aerosp Computat Design Lab Cambridge MA 02139 USA

The adjoint method is a useful tool for finding gradients of design objectives with respect to system parameters for fluid dynamics simulations. But the utility of this method is hampered by the difficulty in writing an efficient implementation for the adjoint flow solver, especially one that scales to thousands of cores. This paper demonstrates a Python library, called adFVM, that can be used to construct an explicit unsteady flow solver and derive the corresponding discrete adjoint flow solver using automatic differentiation (AD). The library uses a two-level computational graph method for representing the structure of both solvers. The library translates this structure into a sequence of optimized kernels, significantly reducing its execution time and memory footprint. Kernels can be generated for heterogeneous architectures including distributed memory, shared memory and accelerator based systems. The library is used to write a finite volume based compressible flow solver. A wall clock time comparison between different flow solvers and adjoint flow solvers built using this library and state of the art graph based AD libraries is presented on a turbo-machinery flow problem. Performance analysis of the flow solvers is carried out for CPUs and gpus. Results of strong and weak scaling of the flow solver and its adjoint are demonstrated on subsonic flow in a periodic box. (C) 2018 Elsevier B.V. All rights reserved.

关键词： Compressible flow solver Finite volume method Computational graphs Automatic differentiation gpu programming Python MPI

来源：评论

学校读者我要写书评

暂无评论

Software Puzzle: A Countermeasure to Resource-Inflated Denial-of-Service Attacks

引用

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY 2015年第1期10卷 168-177页

作者： Wu, Yongdong Zhao, Zhigang Bao, Feng Deng, Robert H. Agcy Sci Technol & Res Dept Infocomm Secur Inst Infocomm Res Singapore 138632 Singapore Huawei Int Pte Ltd Shield Lab Cent Res Inst Singapore 486035 Singapore Singapore Management Univ Sch Informat Syst Singapore 188065 Singapore

Denial-of-service (DoS) and distributed DoS (DDoS) are among the major threats to cyber-security, and client puzzle, which demands a client to perform computationally expensive operations before being granted services from a server, is a well-known countermeasure to them. However, an attacker can inflate its capability of DoS/DDoS attacks with fast puzzle-solving software and/or built-in graphics processing unit (gpu) hardware to significantly weaken the effectiveness of client puzzles. In this paper, we study how to prevent DoS/DDoS attackers from inflating their puzzle-solving capabilities. To this end, we introduce a new client puzzle referred to as software puzzle. Unlike the existing client puzzle schemes, which publish their puzzle algorithms in advance, a puzzle algorithm in the present software puzzle scheme is randomly generated only after a client request is received at the server side and the algorithm is generated such that: 1) an attacker is unable to prepare an implementation to solve the puzzle in advance and 2) the attacker needs considerable effort in translating a central processing unit puzzle software to its functionally equivalent gpu version such that the translation cannot be done in real time. Moreover, we show how to implement software puzzle in the generic server-browser model.

关键词： Software puzzle code obfuscation gpu programming distributed denial of service (DDoS)

来源：评论

学校读者我要写书评

暂无评论

The TheLMA project: A thermal lattice Boltzmann solver for the gpu

引用

COMPUTERS & FLUIDS 2012年第1期54卷 118-126页

作者： Obrecht, Christian Kuznik, Frederic Tourancheau, Bernard Roux, Jean-Jacques EDF R&D Dept EnerBAT F-77818 Moret Sur Loing France Univ Lyon F-69361 Lyon 07 France Inst Natl Sci Appl CETH1L UMR5008 F-69621 Villeurbanne France Inst Natl Sci Appl CITI INRIA F-69621 Villeurbanne France Univ Lyon 1 F-69622 Villeurbanne France

In this paper, we consider the implementation of a thermal flow solver based on the lattice Boltzmann method (LBM) for graphics processing units (gpus). We first describe the hybrid thermal LBM model implemented, and give a concise review of the CUDA technology. The specific issues that arise with LBM on CPUs are outlined. We propose an approach for efficient handling of the thermal part. Performance is close to optimum and is significantly better than the one of comparable CPU solvers. We validate our code by simulating the differentially heated cubic cavity (DHC). The computed results for steady flow patterns are in good agreement with previously published ones. Finally, we use our solver to study the phenomenology of transitional flows in the DHC. (C) 2011 Elsevier Ltd. All rights reserved.

关键词： Thermal lattice Boltzmann method gpu programming CUDA

来源：评论

学校读者我要写书评

暂无评论

Efficient Parallel Processing of All-Pairs Shortest Paths on Multicore and gpu Systems

引用

IEEE TRANSACTIONS ON CONSUMER ELECTRONICS 2024年第1期70卷 2896-2908页

作者： Alghamdi, Mohammed H. He, Ligang Ren, Shenyuan Maray, Mohammed Univ Jeddah Coll Comp Sci & Engn Dept Informat & Technol Syst Jeddah 23466 Saudi Arabia Univ Warwick Dept Comp Sci Coventry CV4 7AL England Beijing Jiaotong Univ Comp Sci Dept Beijing 100044 Peoples R China Univ Oxford Dept Phys Clarendon Lab Oxford OX1 2JD England King Khalid Univ Coll Comp Sci & Informat Syst Abha 62529 Saudi Arabia

Finding the shortest path between any two nodes in a graph, known as the All-Pairs Shortest Paths (APSP), is a fundamental problem in many data analysis problems, such as supply chains in logistics, routing protocols in IoT networks that involve consumer electronics as well as data analysis for social networking apps and Google Maps apps used by the general public on their smartphones. In this work, we present a novel approach to solve the APSP problem on multicore and gpu systems. In our approach, a graph is first pre-processed by partitioning the graph into sub-graphs. Then, each sub-graph is processed in parallel using any existing shortest path algorithm such as the Floyd-Warshall algorithm or Dijkstra's algorithm. Finally, the distance results in individual sub-graphs are aggregated to obtain the distances of APSP for the entire graph. OpenMP and CUDA are used to implement the parallelization on multicore CPUs and gpus, respectively. We conduct the extensive experiments with both synthetic and real-world graphs on the JADE (Joint Academic Data Science Endeavour) cluster at the University of Oxford, which is part of the Tier-2 high performance computing facilities in the U.K. In the experiments, we compared our methods with three existing APSP algorithms in the literature, including n-Dijkstra, ParAPSP and SuperFW. The results show that our methods outperform the existing algorithms, achieving the speedup of up to 8.3x over Dijkstra.

关键词： Graphics processing units Partitioning algorithms Parallel processing Multicore processing Supply chains Computers Web services All-pairs shortest path graph partition parallel processing supply chain process sheared memory parallelism gpu programming

来源：评论

学校读者我要写书评

暂无评论

Real-time thinning algorithms for 2D and 3D images using gpu processors

引用

JOURNAL OF REAL-TIME IMAGE PROCESSING 2020年第5期17卷 1255-1266页

作者： Wagner, Martin G. Univ Wisconsin Dept Med Phys 1111 Highland Ave Madison WI 53705 USA

The skeletonization of binary images is a common task in many image processing and machine learning applications. Some of these applications require very fast image processing. We propose novel techniques for efficient 2D and 3D thinning of binary images using gpu processors. The algorithms use bit-encoded binary images to process multiple points simultaneously in each thread. The simpleness of a point is determined based on Boolean algebra using only bitwise logical operators. This avoids computationally expensive decoding and encoding steps and allows for additional parallelization. The 2D algorithm is evaluated using a data set of handwritten characters images. It required an average computation time of 3.53 ns for 32 x 32 pixels and 0.25 ms for 1024 x 1024 pixels. This is 52-18,380 times faster than a multi-threaded border-parallel algorithm. The 3D algorithm was evaluated based on clinical images of the human vasculature and required computation times of 0.27 ms for 128 x 128 x 128 voxels and 20.32 ms for 512 x 512 x 512 voxels, which is 32-46 times faster than the compared border-sequential algorithm using the same gpu processor. The proposed techniques enable efficient real-time 2D and 3D skeletonization of binary images, which could improve the performance of many existing machine learning applications.

关键词： Centerline gpu programming Medial axis Skeletonization Thinning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：