检索结果-内蒙古大学图书馆

Three Applications of GPU Computing in Neuroscience

COMPUTING IN SCIENCE & ENGINEERING 2012年第3期14卷 40-47页

作者： Baladron, Javier Fasoli, Diego Faugeras, Olivier INRIA NeuroMathComp Grp NeuroMathComp Project Team Sophia Antipolis France

Three scenarios outlined here show the benefits of using a computer system with multiple GPUs in theoretical neuroscience. In each instance, it's clear that the GPU speedup considerably helps answer a scientific o... 详细信息

关键词： Graphics Processing Units GPU Computing Neuroscience Computer System Computational Modeling Mathematical Model Numerical Models Brain Modeling Visualization Computational Neuroscience parallel and vector implementations GP Us Scientific Computing

来源：评论

学校读者我要写书评

暂无评论

parallel cryptographic arithmetic using a redundant Montgomery representation

引用

IEEE TRANSACTIONS ON COMPUTERS 2004年第11期53卷 1474-1482页

作者： Page, D Smart, NP Univ Bristol Dept Comp Sci Bristol BS8 1UB Avon England

We describe how using a redundant Montgomery representation allows for high-performance SIMD-based implementations of RSA and elliptic curve cryptography. This is in addition to the known benefits of immunity from timing attacks afforded by the use of such a representation. We present some preliminary implementation timings using the SSE2 instruction set on a Pentium 4 processor and show that an SIMD parallel implementation of RSA can be around twice as fast as traditional sequential code. This is especially useful given the larger 2,048 bit RSA keys which are now being proposed for standard security levels. Finally, we remark on other application areas that improve the security of our work in the context of side-channel analysis while maintaining high performance.

关键词： public key cryptosystems algorithm design and analysis parallel and vector implementations performance measures

来源：评论

学校读者我要写书评

暂无评论

An Optimized FFT-Based Direct Poisson Solver on CUDA GPUs

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2014年第3期25卷 550-559页

作者： Wu, Jing JaJa, Joseph Balaras, Elias Univ Maryland Dept Elect & Comp Engn Inst Adv Comp Studies College Pk MD 20742 USA George Washington Univ Dept Mech & Aerosp Engn Acad Ctr 720F Washington DC 20052 USA

A highly multithreaded FFT-based direct Poisson solver that makes effective use of the capabilities of the current NVIDIA graphics processing units (GPUs) is presented. Our algorithms carefully manage the multiple layers of the memory hierarchy of the GPUs such that almost all the global memory accesses are coalesced into 128-byte device memory transactions, and all computations are carried out directly on the registers. A new strategy to interleave the FFT computation along each dimension with other computations is used to minimize the total number of accesses to the 3D grid. We illustrate the performance of our algorithms on the NVIDIA Tesla and Fermi architectures for a wide range of grid sizes, up to the largest size that can fit on the device memory (512 x 512 x 512 on the Tesla C1060/C2050 and 512 x 256 x 256 on the GeForce GTX 280/480). We achieve up to 140 GFLOPS and a bandwidth of 70 GB/s on the Tesla C1060, and up to 375 GFLOPS with a bandwidth of 120GB/s on the GTX 480. The performance of our algorithms is superior to what can be achieved using the CUDA FFT library in combination with well-known parallel algorithms for solving tridiagonal linear systems of equations.

关键词： Fast-Fourier transforms parallel and vector implementations elliptic equations

来源：评论

学校读者我要写书评

暂无评论

SLEEF: A Portable vectorized Library of C Standard Mathematical Functions

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2020年第6期31卷 1316-1327页

作者： Shibata, Naoki Petrogalli, Francesco Nara Inst Sci & Technol Grad Sch Informat Sci Nara 6300192 Japan ARM110 Cambridge CB1 9NJ England

In this article, we present techniques used to implement our portable vectorized library of C standard mathematical functions written entirely in C language. In order to make the library portable while maintaining good performance, intrinsic functions of vector extensions are abstracted by inline functions or preprocessor macros. We implemented the functions so that they can use sub-features of vector extensions such as fused multiply-add, mask registers, and extraction of mantissa. In order to make computation with SIMD instructions efficient, the library only uses a small number of conditional branches, and all the computation paths are vectorized. We devised a variation of the Payne-Hanek argument reduction for trigonometric functions and a floating point remainder, both of which are suitable for vector computation. We compare the performance with our library to Intel SVML.

关键词： parallel and vector implementations SIMD processors elementary functions floating-point arithmetic

来源：评论

学校读者我要写书评

暂无评论

An Optimized Cell BE Special Function Library Generated by Coconut

引用

IEEE TRANSACTIONS ON COMPUTERS 2009年第8期58卷 1126-1138页

作者： Anand, Christopher Kumar Kahl, Wolfram McMaster Univ Dept Comp & Software ITB202 Hamilton ON L8S 4K1 Canada

Coconut, a tool for developing high-assurance, high-performance kernels for scientific computing, contains an extensible domain-specific language (DSL) embedded in Haskell. The DSL supports interactive prototyping and unit testing, simplifying the process of designing efficient implementations of common patterns. Unscheduled C and scheduled assembly language output are supported. Using the patterns, even nonexpert users can write efficient function implementations, leveraging special hardware features. A production-quality library of elementary functions for the Cell BE SPU compute engines has been developed. Coconut-generated and -scheduled vector functions were more than four times faster than commercially distributed functions written in C with intrinsics (a nicer syntax for in-line assembly), wrapped in loops and scheduled by spuxIc. All Coconut functions were faster, but the difference was larger for hard-to-approximate functions for which register-level SIMD lookups made a bigger difference. Other helpful features in the language include facilities for translating interval and polynomial descriptions between GHCi, a Haskell interpreter used to prototype in the DSL, and Maple, used for exploration and minimax polynomial generation. This makes it easier to match mathematical properties of the functions with efficient calculational patterns in the SPU ISA. By using single, literate source files, the resulting functions are remarkably readable.

关键词： Special function approximations parallel and vector implementations code generation specialized application languages SIMD processors applicative (functional) programming

来源：评论

学校读者我要写书评

暂无评论

Adaptive Particle Swarm Optimization with Heterogeneous Multicore parallelism

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2017年第10期28卷 2784-2793页

作者： Wachowiak, Mark P. Timson, Mitchell C. DuVal, David J. Nipissing Univ Dept Comp Sci & Math North Bay ON P1B 8L7 Canada

Much progress has recently been made in global optimization, with particular attention devoted to robust nature-inspired stochastic methods for difficult, high-dimensional problems. This paper presents a computational study of an adaptation of one such method, particle swarmoptimization(PSO), which is analyzed for parallelization on readily-available heterogeneous parallel computational hardware: specifically, multicore technologies accelerated by graphics processing units (GPUs), as well as Intel Xeon Phi co-processors accelerated with vectorization. In this heterogeneous approach, computationally-intensive, task-parallel components are performed with multicore parallelism and data-parallel elements are executed via co-processing (GPUs or vectorization). A computationally intensive adaptive PSO technique is parallelized according to this schema. In experiments with two high-dimensional and complex functions, large speedups can be obtained. Thus, a heterogeneous approach mitigates the time complexity of PSO adaptations, suggesting that other time-intensive stochastic methods can also benefit from the techniques proposed here.

关键词： Applications parallel and vector implementations optimization stochastic programming unconstrained optimization

来源：评论

学校读者我要写书评

暂无评论

Numerical engineering:: design of PDE black-box solvers

引用

MATHEMATICS AND COMPUTERS IN SIMULATION 2000年第4-5期54卷 269-277页

作者： Schönauer, W Univ Karlsruhe Rech Zentrum D-76128 Karlsruhe Germany

The design of PDE black-box solvers (for nonlinear systems of elliptic and parabolic PDEs) needs many compromises between efficiency and robustness which we call 'Numerical Engineering'. The requirements for a black-box solver are formulated and the way how to meet them is presented, guided by many years of practical experience in the design of the program packages FIDISOL/CADSOL, VECFEM and LINSOL. The basic approach to the new finite difference element method (FDEM) program package, an FDM on an unstructured FEM grid, is discussed. The common feature of all these methods is the error equation that allows a transparent balancing of all errors. The discretization errors are estimated from difference formulae of different consistency orders. The error balancing must include the iterative solution of the large and sparse linear systems by the LINSOL program package. The real challenge is the parallelization on distributed memory parallel computers which is solved by corresponding data structures with optimal communication patterns and redistribution after each grid refinement cycle. (C) 2000 IMACS. Published by Elsevier Science B.V. All rights reserved.

关键词： G4 mathematical software algorithm design and analysis efficiency parallel and vector implementations reliability and robustness

来源：评论

学校读者我要写书评

暂无评论

parallel implementation of the 2D discrete wavelet transform on Graphics Processing Units:: Filter Bank versus Lifting

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2008年第3期19卷 299-310页

作者： Tenllado, Christian Setoain, Javier Prieto, Manuel Pinuel, Luis Tirado, Francisco Univ Complutense Madrid Fac Ciencias Fis Dept Comp Architecture ArTeCS Grp E-28040 Madrid Spain

The widespread usage of the discrete wavelet transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as Filter Bank Scheme (FBS) and Lifting Scheme (LS), and have always concluded that LS is the most efficient option. However, there is no such study on streaming processors such as modern Graphics Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current-generation GPUs. In our experiments, the actual FBS gains range between 10 percent and 140 percent, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future-generation GPUs.

关键词： graphics processors parallel processing parallel algorithms parallel and vector implementations wavelets and fractals SIMD processors optimization parallel discrete wavelet transform lifting filter bank GPU stream processors

来源：评论

学校读者我要写书评

暂无评论

Accelerating Matrix Operations with Improved Deeply Pipelined vector Reduction

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2012年第2期23卷 202-210页

作者： Tai, Yi-Gang Lo, Chia-Tien Dan Psarris, Kleanthis Univ Texas San Antonio Dept Comp Sci San Antonio TX 78249 USA So Polytech State Univ Dept Comp Sci & Software Engn Marietta GA 30060 USA

Many scientific or engineering applications involve matrix operations, in which reduction of vectors is a common operation. If the core operator of the reduction is deeply pipelined, which is usually the case, dependencies between the input data elements cause data hazards. To tackle this problem, we propose a new reduction method with low latency and high pipeline utilization. The performance of the proposed design is evaluated for both single data set and multiple data set scenarios. Further, QR decomposition is used to demonstrate how the proposed method can accelerate its execution. We implement the design on an FPGA and compare its results to other methods.

关键词： Reconfigurable hardware pipeline processors parallel algorithms parallel and vector implementations algorithm design and analysis

来源：评论

学校读者我要写书评

暂无评论

Optimized FFT computations on heterogeneous platforms with application to the Poisson equation

引用

JOURNAL OF parallel AND DISTRIBUTED COMPUTING 2014年第8期74卷 2745-2756页

作者： Wu, Jing Jaja, Joseph Univ Maryland Dept Elect & Comp Engn College Pk MD 20742 USA Univ Maryland Inst Adv Comp Studies College Pk MD 20742 USA

We develop optimized multi-dimensional FFT implementations on CPU-GPU heterogeneous platforms for the case when the input is too large to fit on the GPU global memory, and use the resulting techniques to develop a fast Poisson solver. The solver involves memory bound computations for which the large 3D data may have to be transferred over the PCIe bus several times during the computation. We develop a new strategy to decompose and allocate the computation between the GPU and the CPU such that the 3D data is transferred only once to the device-memory, and the executions of the GPU kernels are almost completely overlapped with the PCI data transfer. We were able to achieve significantly better performance than what has been reported in previous related work, including over 145 GFLOPS for the three periodic boundary conditions (single precision version), and over 105 GFLOPS for the two periodic, one Neumann boundary conditions (single precision version). The effective bidirectional PCIe bus bandwidth achieved is 9-10 GB/s, which is close to the best possible on our platform. For all the cases tested, the single 3D data PCIe transfer time, which constitutes a lower bound on what is possible on our platform, takes almost 70% of the total execution time of the Poisson solver. (C) 2014 Elsevier Inc. All rights reserved.

关键词： Fast Fourier transforms parallel and vector implementations CUDA GPU Poisson equations

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：