In this paper, we propose and present implementation results of a high-speed turbo decoding algorithm. The latency caused by (de)interleaving and iterative decoding in a conventional maximum a posteriori turbo decoder...
详细信息
In this paper, we propose and present implementation results of a high-speed turbo decoding algorithm. The latency caused by (de)interleaving and iterative decoding in a conventional maximum a posteriori turbo decoder can be dramatically reduced with the proposed design. The source of the latency reduction is from the combination of the radix-4, center to top, parallel decoding, and early-stop algorithms. This reduced latency enables the use of the turbo decoder as a forward error correction scheme in real-time wireless communication services. The proposed scheme results in a slight degradation in bit error rate performance for large block sizes because the effective interleaver size in a radix-4 implementation is reduced to half, relative to the conventional method. To prove the latency reduction, we implemented the proposed scheme on a field-programmable gate array and compared its decoding speed with that of a conventional decoder. The results show an improvement of at least five fold for a single iteration of turbo decoding.
In spite of continuous improvement of computational power of multi/many-core processors, the memory access performance of the processors has not been improved sufficiently, and thus the overall performance of recent p...
详细信息
In spite of continuous improvement of computational power of multi/many-core processors, the memory access performance of the processors has not been improved sufficiently, and thus the overall performance of recent processors is often restricted by the delay of off-chip memory accesses. Low-delay data compression for last level cache (LLC) would be effective to improve the processor performance because the compression increases the effective size of LLC, and thus reduces the number of off-chip memory accesses. This paper proposes a novel data compression method suitable for high-speed parallel decoding in the LLC. Since cache line data often have periodicity of certain lengths, such as 32- or 64-bit instructions, 32-bit integers, and 64-bit floating point numbers, an information word is encoded as a base pattern and a differential pattern between the original word and the base pattern. Evaluation using a GPU simulator shows that the compression ratio of the proposed coding is comparable to LZSS coding and X-Match Pro and superior to other conventional compression algorithms for cache memories. Also this paper presents an experimental decoder designed for ASIC, and the synthesized result shows that the decoder can decompress cache line data of length 32 bytes in four clock cycles. Evaluation of the IPC on the GPU simulator shows that, for several benchmark programs, the IPC achieved by the proposed coding is higher than that by the conventional B Delta I coding, where the maximum improvement of the IPC is 20%.
A simple scheme was proposed by Knuth to generate binary balanced codewords from any information word. However, this method is limited in the sense that its redundancy is twice that of the full sets of balanced codes....
详细信息
A simple scheme was proposed by Knuth to generate binary balanced codewords from any information word. However, this method is limited in the sense that its redundancy is twice that of the full sets of balanced codes. The gap between Knuth's algorithm's redundancy and that of the full sets of balanced codes is significantly considerable. This paper attempts to reduce that gap. Furthermore, many constructions assume that a full balancing can be performed without showing the steps. A full balancing refers to the overall balancing of the encoded information together with the prefix. We propose an efficient way to perform a full balancing scheme that does not make use of lookup tables or enumerative coding.
Nowadays, mobile devices are capable of displaying video up to HD resolution. In this paper, we propose two acceleration strategies for Audio Video coding Standard (AVS) software decoder on multi-core ARM NEON platfor...
详细信息
ISBN:
(纸本)9781479923427
Nowadays, mobile devices are capable of displaying video up to HD resolution. In this paper, we propose two acceleration strategies for Audio Video coding Standard (AVS) software decoder on multi-core ARM NEON platform. Firstly, data level parallelism is utilized to effectively use the SIMD capability of NEON and key modules are redesigned to make them SIMD friendly. Secondly, a macroblock level wavefront parallelism is designed based on the decoding dependencies among macroblocks to utilize the processing capability of multiple cores. Experiment results show that AVS (IEEE 1857) HD video stream can be decoded in real-time by applying the proposed two acceleration strategies.
In this paper, a new parallel Turbo encoding and decoding technique is introduced. In this technique, a long information data frame is first divided into sub-blocks which are then encoded with trellis terminating and ...
详细信息
ISBN:
(纸本)0780378407
In this paper, a new parallel Turbo encoding and decoding technique is introduced. In this technique, a long information data frame is first divided into sub-blocks which are then encoded with trellis terminating and decoded by parallel multiple SISO modules. It is shown that, at a slight increase in hardware complexity and a slight loss in the transmission efficiency due to the extra terminating bits appended, the proposed scheme can effectively reduce the decoding delay, and at the same time achieve noticeably better error performance compared with the regular schemes, especially in high code rate situation.
This paper presents a new hybrid parallelization method for High Efficiency Video Coding (HEVC) decoder. The proposed method groups HEVC decoding modules into entropy decoding, pixel decoding, and in-loop filtering pa...
详细信息
ISBN:
(纸本)9781479927654
This paper presents a new hybrid parallelization method for High Efficiency Video Coding (HEVC) decoder. The proposed method groups HEVC decoding modules into entropy decoding, pixel decoding, and in-loop filtering parts for optimal parallelization considering the characteristic of all the parts. The proposed method employs coding tree unit (CTU)-level 2D wavefront for the pixel decoding part. To decrease the delay between the entropy decoding and pixel decoding, task level parallelism (TLP) is additionally employed for two parts. For the HEVC deblocking filter, CTU-level data level parallelism (DLP) with equally partitioned CTUs is proposed. In addition, CTU row-level DLP for sample adaptive offset (SAO) is proposed to achieve maximum parallel performance and to minimize the overhead of organizing a backup buffer. The experimental results show that the proposed approach for parallel deblocking filter achieved a speed-up of 5.4x and the parallel SAO approach achieved a speed-up of 3.7x maximally on the multi-core platform. Furthermore, the proposed parallel HEVC decoder shows a speed-up of 2.9x with 6 threads without any encoder parallel tools such as wavefront parallel processing (WPP) coding and picture partitioning with tile and slice segments.
Existing scheduling schemes for decoding H.264/AVC multiple streams on multi-core are largely limited by ineffective use of multi-core architecture. Among the reasons are inefficient load balancing, in which common lo...
详细信息
ISBN:
(纸本)9781479947607
Existing scheduling schemes for decoding H.264/AVC multiple streams on multi-core are largely limited by ineffective use of multi-core architecture. Among the reasons are inefficient load balancing, in which common load metrics (e.g. tasks, frames, bytes) are unable to correctly reflect processing load at cores, unscalability of scheduling algorithms for a large scale multi-core, and bottlenecks at schedulers for multi-stream decoding. In this paper, we propose a scalable adaptive Highest Random Weight (HA-HRW) hash scheduler for distributed shared memory multi-core architecture considering the following: 1) memory access and core/cache topology of the multi-core architecture;2) appropriate processing time load metric to enforce a true load balancing;3) hierarchical parallel scheduling to decode multiple streams simultaneously;4) locality characteristics of processing unit candidate to limit search within neighboring cores to enable scalable scheduling. We implement and evaluate our approach on a 32-core SGI server with realistic workload. Comparing with existing schemes, our scheme achieves higher throughput, better load balancing, better CPU utilization, and no jitter problem. Our scheme scales with multi-core and multiple streams as its time complexity is O(1).
In this paper, we propose a parallel block-based Viterbi decoder (PBVD) on the graphic processing unit (GPU) platform for the decoding of convolutional codes. The decoding procedure is simplified and parallelized, and...
详细信息
ISBN:
(纸本)9781509028610
In this paper, we propose a parallel block-based Viterbi decoder (PBVD) on the graphic processing unit (GPU) platform for the decoding of convolutional codes. The decoding procedure is simplified and parallelized, and the characteristic of the trellis is exploited to reduce the metric computation. Based on the compute unified device architecture (CUDA), two kernels with different parallelism are designed to map two decoding phases. Moreover, the optimal design of data structures for several kinds of intermediate information are presented, to improve the efficiency of internal memory transactions. Experimental results demonstrate that the proposed decoder achieves high throughput of 598Mbps on NVIDIA GTX580 and 1802Mbps on GTX980 for the 64-state convolutional code, which are 1.5 times speedup compared to the existing fastest works on GPUs.
In this paper,a new parallel Turbo encoding and decoding technique is *** this technique,a long information data frame is first divided into sub-blocks which are then encoded with trellis terminating and decoded by pa...
详细信息
In this paper,a new parallel Turbo encoding and decoding technique is *** this technique,a long information data frame is first divided into sub-blocks which are then encoded with trellis terminating and decoded by parallel multiple SISO *** is shown that,at a slight increase in hardware complexity and a slight loss in the transmission efficiency due to the extra terminating bits appended,the proposed scheme can effectively reduce the decoding delay,and at the same time achieve noticeably better error performance compared with the regular schemes,especially in high code rate situation.
暂无评论