The main difficulty to implement modern image coding systems in a GPU is that the algorithms employed in the core of the coding scheme are inherently sequential. We recently proposed bitplane image coding with paralle...
详细信息
ISBN:
(纸本)9781467384889
The main difficulty to implement modern image coding systems in a GPU is that the algorithms employed in the core of the coding scheme are inherently sequential. We recently proposed bitplane image coding with parallel coefficient processing (BPC-PaCo), a coding scheme that, contrarily to most systems, permits the processing of multiple coefficients of the image in parallel. This enables the use of simd computing, ideal for its implementation in a GPU. This paper introduces and evaluates the GPU implementation of BPC-PaCo employing two different strategies that tradeoff computational throughput and compression efficiency. The proposed implementation is compared to the best CPU and GPU implementations of JPEG2000, the state-of-the-art image compression standard. Experimental results indicate that BPC-PaCo achieves a computational throughput that is an order of magnitude superior to that achieved with such implementations with a small reduction in coding efficiency.
We present new algorithms for the k mismatches version of approximate string matching. Our algorithms utilize the simd (Single Instruction Multiple Data) instruction set extensions, particularly AVX2 and AVX-512 instr...
详细信息
We present new algorithms for the k mismatches version of approximate string matching. Our algorithms utilize the simd (Single Instruction Multiple Data) instruction set extensions, particularly AVX2 and AVX-512 instructions. Our approach is an extension of an earlier algorithm for exact string matching with SSE2 and AVX2. In addition, we modify this exact string matching algorithm to work with AVX-512. We demonstrate the competitiveness of our solutions by practical experiments. Our algorithms outperform earlier algorithms for both exact and approximate string matching on various benchmark data sets.
Fast multipole methods (FMMs) based on the oscillatory Helmholtz kernel can reduce the cost of solving N-body problems arising from boundary integral equations (BIEs) in acoustics or electromagnetics. However, their c...
详细信息
Fast multipole methods (FMMs) based on the oscillatory Helmholtz kernel can reduce the cost of solving N-body problems arising from boundary integral equations (BIEs) in acoustics or electromagnetics. However, their cost strongly increases in the high-frequency regime. This paper introduces a new directional FMM for oscillatory kernels (defmm: directional equispaced interpolation-based fmm), whose precomputation and application are FFT-accelerated due to poly-nomial interpolations on equispaced grids. We demonstrate the consistency of our FFT approach and show how symmetries can be exploited in the Fourier domain. We also describe the algorithmic de-sign of defmm, well-suited for the BIE nonuniform particle distributions, and present performance optimizations on one CPU core. Finally, we exhibit important performance gains on all test cases for defmm over a state-of-the-art FMM library for oscillatory kernels.
The fast compression of images is a requisite in many applications like TV production, teleconferencing, or digital cinema. Many of the algorithms employed in current image compression standards are inherently sequent...
详细信息
The fast compression of images is a requisite in many applications like TV production, teleconferencing, or digital cinema. Many of the algorithms employed in current image compression standards are inherently sequential. High performance implementations of such algorithms often require specialized hardware like field integrated gate arrays. Graphics Processing Units (GPUs) do not commonly achieve high performance on these algorithms because they do not exhibit fine-grain parallelism. Our previous work introduced a new core algorithm for wavelet-based image coding systems. It is tailored for massive parallel architectures. It is called bitplane coding with parallel coefficient processing (BPC-PaCo). This paper introduces the first high performance, GPU-based implementation of BPC-PaCo. A detailed analysis of the algorithm aids its implementation in the GPU. The main insights behind the proposed codec are an efficient thread-to-data mapping, a smart memory management, and the use of efficient cooperation mechanisms to enable inter-thread communication. Experimental results indicate that the proposed implementation matches the requirements for high resolution (4 K) digital cinema in real time, yielding speedups of 30x with respect to the fastest implementations of current compression standards. Also, a power consumption evaluation shows that our implementation consumes 40x less energy for equivalent performance than state-of-the-art methods.
One of the most commonly used tools by computational biologists is some form of sequence alignment. Heuristic alignment algorithms developed for speed and their multiple results such as BLAST [1] and FASTA [2] are not...
详细信息
ISBN:
(纸本)9781424416936
One of the most commonly used tools by computational biologists is some form of sequence alignment. Heuristic alignment algorithms developed for speed and their multiple results such as BLAST [1] and FASTA [2] are not a total replacement for the more rigorous but slower algorithms like Smith-Waterman [3]. The different techniques complement one another. A heuristic can filter dissimilar sequences from a large database such as GenBank [4] and the Smith-Waterman algorithm performs more detailed, in-depth alignment in a way not adequately handled by heuristic methods. An associative parallel Smith-Waterman algorithm has been improved and further parallelized. Analysis between different algorithms, different types of file input, and different input sizes have been performed and are reported here. The newly developed associative algorithm reduces the running time for rigorous pairwise local sequence alignment.
The continuing development of smaller electronic devices into the nanoelectronic regime offers great possibilities for the construction of highly parallel computers, This paper describes work designed to discover the ...
详细信息
The continuing development of smaller electronic devices into the nanoelectronic regime offers great possibilities for the construction of highly parallel computers, This paper describes work designed to discover the best ways to take advantage of this opportunity, Simulated results are presented which indicate that improvements in clock rates of two orders of magnitude, and in packing density of three orders of magnitude, over the best current systems, should be attainable, These results apply to the class of data-parallel computers, and their attainment demands modifications to the design which are also described, Evaluation of the requirements of alternative classes of parallel architecture is currently under way, together with a study of the vitally important area of fault-tolerance.
暂无评论