In this article, we present techniques used to implement our portable vectorized library of C standard mathematical functions written entirely in C language. In order to make the library portable while maintaining goo...
详细信息
In this article, we present techniques used to implement our portable vectorized library of C standard mathematical functions written entirely in C language. In order to make the library portable while maintaining good performance, intrinsic functions of vector extensions are abstracted by inline functions or preprocessor macros. We implemented the functions so that they can use sub-features of vector extensions such as fused multiply-add, mask registers, and extraction of mantissa. In order to make computation with SIMD instructions efficient, the library only uses a small number of conditional branches, and all the computation paths are vectorized. We devised a variation of the Payne-Hanek argument reduction for trigonometric functions and a floating point remainder, both of which are suitable for vector computation. We compare the performance with our library to Intel SVML.
Much progress has recently been made in global optimization, with particular attention devoted to robust nature-inspired stochastic methods for difficult, high-dimensional problems. This paper presents a computational...
详细信息
Much progress has recently been made in global optimization, with particular attention devoted to robust nature-inspired stochastic methods for difficult, high-dimensional problems. This paper presents a computational study of an adaptation of one such method, particle swarmoptimization(PSO), which is analyzed for parallelization on readily-available heterogeneous parallel computational hardware: specifically, multicore technologies accelerated by graphics processing units (GPUs), as well as Intel Xeon Phi co-processors accelerated with vectorization. In this heterogeneous approach, computationally-intensive, task-parallel components are performed with multicore parallelism and data-parallel elements are executed via co-processing (GPUs or vectorization). A computationally intensive adaptive PSO technique is parallelized according to this schema. In experiments with two high-dimensional and complex functions, large speedups can be obtained. Thus, a heterogeneous approach mitigates the time complexity of PSO adaptations, suggesting that other time-intensive stochastic methods can also benefit from the techniques proposed here.
Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient (SG) method is one of the most popular algorithms for...
详细信息
Matrix factorization is known to be an effective method for recommender systems that are given only the ratings from users to items. Currently, stochastic gradient (SG) method is one of the most popular algorithms for matrix factorization. However, as a sequential approach, SG is difficult to be parallelized for handling web-scale problems. In this article, we develop a fast parallel SG method, FPSG, for shared memory systems. By dramatically reducing the cache-miss rate and carefully addressing the load balance of threads, FPSG is more efficient than state-of-the-art parallel algorithms for matrix factorization.
A highly multithreaded FFT-based direct Poisson solver that makes effective use of the capabilities of the current NVIDIA graphics processing units (GPUs) is presented. Our algorithms carefully manage the multiple lay...
详细信息
A highly multithreaded FFT-based direct Poisson solver that makes effective use of the capabilities of the current NVIDIA graphics processing units (GPUs) is presented. Our algorithms carefully manage the multiple layers of the memory hierarchy of the GPUs such that almost all the global memory accesses are coalesced into 128-byte device memory transactions, and all computations are carried out directly on the registers. A new strategy to interleave the FFT computation along each dimension with other computations is used to minimize the total number of accesses to the 3D grid. We illustrate the performance of our algorithms on the NVIDIA Tesla and Fermi architectures for a wide range of grid sizes, up to the largest size that can fit on the device memory (512 x 512 x 512 on the Tesla C1060/C2050 and 512 x 256 x 256 on the GeForce GTX 280/480). We achieve up to 140 GFLOPS and a bandwidth of 70 GB/s on the Tesla C1060, and up to 375 GFLOPS with a bandwidth of 120GB/s on the GTX 480. The performance of our algorithms is superior to what can be achieved using the CUDA FFT library in combination with well-known parallel algorithms for solving tridiagonal linear systems of equations.
We develop optimized multi-dimensional FFT implementations on CPU-GPU heterogeneous platforms for the case when the input is too large to fit on the GPU global memory, and use the resulting techniques to develop a fas...
详细信息
We develop optimized multi-dimensional FFT implementations on CPU-GPU heterogeneous platforms for the case when the input is too large to fit on the GPU global memory, and use the resulting techniques to develop a fast Poisson solver. The solver involves memory bound computations for which the large 3D data may have to be transferred over the PCIe bus several times during the computation. We develop a new strategy to decompose and allocate the computation between the GPU and the CPU such that the 3D data is transferred only once to the device-memory, and the executions of the GPU kernels are almost completely overlapped with the PCI data transfer. We were able to achieve significantly better performance than what has been reported in previous related work, including over 145 GFLOPS for the three periodic boundary conditions (single precision version), and over 105 GFLOPS for the two periodic, one Neumann boundary conditions (single precision version). The effective bidirectional PCIe bus bandwidth achieved is 9-10 GB/s, which is close to the best possible on our platform. For all the cases tested, the single 3D data PCIe transfer time, which constitutes a lower bound on what is possible on our platform, takes almost 70% of the total execution time of the Poisson solver. (C) 2014 Elsevier Inc. All rights reserved.
Two parallel computer paradigms available today are multi-core accelerators such as the Sony, Toshiba and IBM Cell or Graphics Processing Unit (GPUs), and massively parallel message-passing machines such as the IBM Bl...
详细信息
Two parallel computer paradigms available today are multi-core accelerators such as the Sony, Toshiba and IBM Cell or Graphics Processing Unit (GPUs), and massively parallel message-passing machines such as the IBM Blue Gene (BG). The solution of systems of linear equations is one of the most central processing unit-intensive steps in engineering and simulation applications and can greatly benefit from the multitude of processing cores and vectorisation on today's parallel computers. We parallelise the conjugate gradient (CG) linear equation solver on the Cell Broadband Engine and the IBM Blue Gene/L machine. We perform a scalability analysis of CG on both machines across 1, 8 and 16 synergistic processing elements and 1-32 cores on BG with heptadiagonal matrices. The results indicate that the multi-core Cell system outperforms by three to four times the massively parallel BG system due to the Cell's higher communication bandwidth and accelerated vector processing capability.
We develop an optimized FFT based Poisson solver on a CPU-GPU heterogeneous platform for the case when the input is too large to fit on the GPU global memory. The solver involves memory bound computations such as 3D F...
详细信息
ISBN:
(纸本)9780769549712
We develop an optimized FFT based Poisson solver on a CPU-GPU heterogeneous platform for the case when the input is too large to fit on the GPU global memory. The solver involves memory bound computations such as 3D FFT in which the large 3D data may have to be transferred over the PCIe bus several times during the computation. We develop a new strategy to decompose and allocate the computation between the GPU and the CPU such that the 3D data is transferred only once to the device memory, and the executions of the GPU kernels are almost completely overlapped with the PCI data transfer. We were able to achieve significantly better performance than what has been reported in previous related work, including over 50 GFLOPS for the three periodic boundary conditions, and over 40 GFLOPS for the two periodic, one Neumann boundary conditions. The PCIe bus bandwidth achieved is over 5GB/s, which is close to the best possible on our platform. For all the cases tested, the single 3D PCIe transfer time, which constitutes a lower bound on what is possible on our platform, takes almost 70% of the total execution time of the Poisson solver.
Three scenarios outlined here show the benefits of using a computer system with multiple GPUs in theoretical neuroscience. In each instance, it's clear that the GPU speedup considerably helps answer a scientific o...
详细信息
Three scenarios outlined here show the benefits of using a computer system with multiple GPUs in theoretical neuroscience. In each instance, it's clear that the GPU speedup considerably helps answer a scientific or technological question.
Many scientific or engineering applications involve matrix operations, in which reduction of vectors is a common operation. If the core operator of the reduction is deeply pipelined, which is usually the case, depende...
详细信息
Many scientific or engineering applications involve matrix operations, in which reduction of vectors is a common operation. If the core operator of the reduction is deeply pipelined, which is usually the case, dependencies between the input data elements cause data hazards. To tackle this problem, we propose a new reduction method with low latency and high pipeline utilization. The performance of the proposed design is evaluated for both single data set and multiple data set scenarios. Further, QR decomposition is used to demonstrate how the proposed method can accelerate its execution. We implement the design on an FPGA and compare its results to other methods.
Coconut, a tool for developing high-assurance, high-performance kernels for scientific computing, contains an extensible domain-specific language (DSL) embedded in Haskell. The DSL supports interactive prototyping and...
详细信息
Coconut, a tool for developing high-assurance, high-performance kernels for scientific computing, contains an extensible domain-specific language (DSL) embedded in Haskell. The DSL supports interactive prototyping and unit testing, simplifying the process of designing efficient implementations of common patterns. Unscheduled C and scheduled assembly language output are supported. Using the patterns, even nonexpert users can write efficient function implementations, leveraging special hardware features. A production-quality library of elementary functions for the Cell BE SPU compute engines has been developed. Coconut-generated and -scheduled vector functions were more than four times faster than commercially distributed functions written in C with intrinsics (a nicer syntax for in-line assembly), wrapped in loops and scheduled by spuxIc. All Coconut functions were faster, but the difference was larger for hard-to-approximate functions for which register-level SIMD lookups made a bigger difference. Other helpful features in the language include facilities for translating interval and polynomial descriptions between GHCi, a Haskell interpreter used to prototype in the DSL, and Maple, used for exploration and minimax polynomial generation. This makes it easier to match mathematical properties of the functions with efficient calculational patterns in the SPU ISA. By using single, literate source files, the resulting functions are remarkably readable.
暂无评论