In 5G communications, to meet the higher throughput requirement of Turbo decoding, Quadratic Permutation Polynomial (QPP) interleaver is adopted for contention-free parallel memory access. However, with the intrinsic ...
详细信息
ISBN:
(纸本)9781467396042
In 5G communications, to meet the higher throughput requirement of Turbo decoding, Quadratic Permutation Polynomial (QPP) interleaver is adopted for contention-free parallel memory access. However, with the intrinsic attribute of QPP interleaver, the contention-free memory access is achieved only when the codeword length is divisible by the degree of parallelism (P), which leads the selection of P be rather inflexible especially when P is high. In this paper, we present a novel contention-free solution by introducing dual memory access and parallel window dynamic alignment. The proposed method provides more choices of contention- free parallelism. It offers throughput improvements for high parallelism (P>64) decoders with 3GPP LTE-A configurations, and also enables more P selections for QPP-interleaving Turbo decoders.
A new efficient design of second-order spectralnull (2-OSN) codes is presented. The new codes are obtained by applying the technique used to design parallel decoding balanced (i.e., 1-OSN) codes to the random walk met...
详细信息
In this paper, we propose and present implementation results of a high-speed turbo decoding algorithm. The latency caused by (de)interleaving and iterative decoding in a conventional maximum a posteriori turbo decoder...
详细信息
In this paper, we propose and present implementation results of a high-speed turbo decoding algorithm. The latency caused by (de)interleaving and iterative decoding in a conventional maximum a posteriori turbo decoder can be dramatically reduced with the proposed design. The source of the latency reduction is from the combination of the radix-4, center to top, parallel decoding, and early-stop algorithms. This reduced latency enables the use of the turbo decoder as a forward error correction scheme in real-time wireless communication services. The proposed scheme results in a slight degradation in bit error rate performance for large block sizes because the effective interleaver size in a radix-4 implementation is reduced to half, relative to the conventional method. To prove the latency reduction, we implemented the proposed scheme on a field-programmable gate array and compared its decoding speed with that of a conventional decoder. The results show an improvement of at least five fold for a single iteration of turbo decoding.
Nowadays, mobile devices are capable of displaying video up to HD resolution. In this paper, we propose two acceleration strategies for Audio Video coding Standard (AVS) software decoder on multi-core ARM NEON platfor...
详细信息
ISBN:
(纸本)9781479923427
Nowadays, mobile devices are capable of displaying video up to HD resolution. In this paper, we propose two acceleration strategies for Audio Video coding Standard (AVS) software decoder on multi-core ARM NEON platform. Firstly, data level parallelism is utilized to effectively use the SIMD capability of NEON and key modules are redesigned to make them SIMD friendly. Secondly, a macroblock level wavefront parallelism is designed based on the decoding dependencies among macroblocks to utilize the processing capability of multiple cores. Experiment results show that AVS (IEEE 1857) HD video stream can be decoded in real-time by applying the proposed two acceleration strategies.
In this paper, a new parallel Turbo encoding and decoding technique is introduced. In this technique, a long information data frame is first divided into sub-blocks which are then encoded with trellis terminating and ...
详细信息
ISBN:
(纸本)0780378407
In this paper, a new parallel Turbo encoding and decoding technique is introduced. In this technique, a long information data frame is first divided into sub-blocks which are then encoded with trellis terminating and decoded by parallel multiple SISO modules. It is shown that, at a slight increase in hardware complexity and a slight loss in the transmission efficiency due to the extra terminating bits appended, the proposed scheme can effectively reduce the decoding delay, and at the same time achieve noticeably better error performance compared with the regular schemes, especially in high code rate situation.
This paper presents a new hybrid parallelization method for High Efficiency Video Coding (HEVC) decoder. The proposed method groups HEVC decoding modules into entropy decoding, pixel decoding, and in-loop filtering pa...
详细信息
ISBN:
(纸本)9781479927654
This paper presents a new hybrid parallelization method for High Efficiency Video Coding (HEVC) decoder. The proposed method groups HEVC decoding modules into entropy decoding, pixel decoding, and in-loop filtering parts for optimal parallelization considering the characteristic of all the parts. The proposed method employs coding tree unit (CTU)-level 2D wavefront for the pixel decoding part. To decrease the delay between the entropy decoding and pixel decoding, task level parallelism (TLP) is additionally employed for two parts. For the HEVC deblocking filter, CTU-level data level parallelism (DLP) with equally partitioned CTUs is proposed. In addition, CTU row-level DLP for sample adaptive offset (SAO) is proposed to achieve maximum parallel performance and to minimize the overhead of organizing a backup buffer. The experimental results show that the proposed approach for parallel deblocking filter achieved a speed-up of 5.4x and the parallel SAO approach achieved a speed-up of 3.7x maximally on the multi-core platform. Furthermore, the proposed parallel HEVC decoder shows a speed-up of 2.9x with 6 threads without any encoder parallel tools such as wavefront parallel processing (WPP) coding and picture partitioning with tile and slice segments.
Existing scheduling schemes for decoding H.264/AVC multiple streams on multi-core are largely limited by ineffective use of multi-core architecture. Among the reasons are inefficient load balancing, in which common lo...
详细信息
ISBN:
(纸本)9781479947607
Existing scheduling schemes for decoding H.264/AVC multiple streams on multi-core are largely limited by ineffective use of multi-core architecture. Among the reasons are inefficient load balancing, in which common load metrics (e.g. tasks, frames, bytes) are unable to correctly reflect processing load at cores, unscalability of scheduling algorithms for a large scale multi-core, and bottlenecks at schedulers for multi-stream decoding. In this paper, we propose a scalable adaptive Highest Random Weight (HA-HRW) hash scheduler for distributed shared memory multi-core architecture considering the following: 1) memory access and core/cache topology of the multi-core architecture;2) appropriate processing time load metric to enforce a true load balancing;3) hierarchical parallel scheduling to decode multiple streams simultaneously;4) locality characteristics of processing unit candidate to limit search within neighboring cores to enable scalable scheduling. We implement and evaluate our approach on a 32-core SGI server with realistic workload. Comparing with existing schemes, our scheme achieves higher throughput, better load balancing, better CPU utilization, and no jitter problem. Our scheme scales with multi-core and multiple streams as its time complexity is O(1).
In this paper, we propose a parallel block-based Viterbi decoder (PBVD) on the graphic processing unit (GPU) platform for the decoding of convolutional codes. The decoding procedure is simplified and parallelized, and...
详细信息
ISBN:
(纸本)9781509028610
In this paper, we propose a parallel block-based Viterbi decoder (PBVD) on the graphic processing unit (GPU) platform for the decoding of convolutional codes. The decoding procedure is simplified and parallelized, and the characteristic of the trellis is exploited to reduce the metric computation. Based on the compute unified device architecture (CUDA), two kernels with different parallelism are designed to map two decoding phases. Moreover, the optimal design of data structures for several kinds of intermediate information are presented, to improve the efficiency of internal memory transactions. Experimental results demonstrate that the proposed decoder achieves high throughput of 598Mbps on NVIDIA GTX580 and 1802Mbps on GTX980 for the 64-state convolutional code, which are 1.5 times speedup compared to the existing fastest works on GPUs.
In this paper,a new parallel Turbo encoding and decoding technique is *** this technique,a long information data frame is first divided into sub-blocks which are then encoded with trellis terminating and decoded by pa...
详细信息
In this paper,a new parallel Turbo encoding and decoding technique is *** this technique,a long information data frame is first divided into sub-blocks which are then encoded with trellis terminating and decoded by parallel multiple SISO *** is shown that,at a slight increase in hardware complexity and a slight loss in the transmission efficiency due to the extra terminating bits appended,the proposed scheme can effectively reduce the decoding delay,and at the same time achieve noticeably better error performance compared with the regular schemes,especially in high code rate situation.
暂无评论