CORDIC based IIR digital filters are orthogonal filters whose internal computations consist of orthogonal transformations. These filters possess desirable properties for VLSI implementations such as regularity, local ...
详细信息
ISBN:
(纸本)0819429163
CORDIC based IIR digital filters are orthogonal filters whose internal computations consist of orthogonal transformations. These filters possess desirable properties for VLSI implementations such as regularity, local connection, low sensitivity to finite word-length implementation, and elimination of limit cycles. Recently, fine-grain pipelined CORDIC based IIR digital filter architectures which can perform the filtering operations at arbitrarily high sample rates at the cost of linear increase in hardware complexity have been developed. These pipelined architectures consist of only Givens rotations and a few additions which can be mapped onto CORDIC arithmetic based processors. However, in practical applications, implementations of Givens rotations using traditional CORDIC arithmetic are quite expensive. For example, for 16 bit accuracy, using floating point data format with 16 bit mantissa and 5 bit exponent, it will require approximately 20 pairs of shift-add operations for one Givens rotation. In this paper, we propose an efficient implementation of pipelined CORDIC based IIR digital filters based on fast orthonormal mu-rotations. Using this method, the Givens rotations are approximated by angles corresponding to orthonormal mu-rotations, which are based on the idea of CORDIC and can perform rotation with minimal number of shift-add operations. We present various methods of construction for such orthonormal mu-rotations. A significant reduction (over 76%) of the number of required shift-add operations is achieved. All types of fast rotations can be implemented as a cascade of only four basic types of shift-add stages. These stages can be executed on a modified floating-point CORDIC architecture, making the pipelined filter highly suitable for VLSI implementations.
Three pipelined multiprocessor implementations of adaptive lattice filters are examined. A performance analysis is done for each multiprocessor system, and expressions for approximate computation time and speedup are ...
详细信息
Three pipelined multiprocessor implementations of adaptive lattice filters are examined. A performance analysis is done for each multiprocessor system, and expressions for approximate computation time and speedup are derived. Each system is shown to be flexible with respect to the different algorithms and the different filter sizes it can implement.< >
This paper proposes two efficient architectures for hardware implementation of the advanced Encryption Standard (AES) algorithm. The composite field arithmetic for implementing SubBytes (S-box) and InvSubBytes (Invers...
详细信息
This paper proposes two efficient architectures for hardware implementation of the advanced Encryption Standard (AES) algorithm. The composite field arithmetic for implementing SubBytes (S-box) and InvSubBytes (Inverse S-box) transformations investigated by several authors is used as the basis for deriving the proposed architectures. The first architecture for encryption is based on optimized S-box followed by bit-wise implementation of MixColumns and AddRoundKey and optimized Inverse S-box followed by bit-wise implementation of InvMixColumns and AddMixRoundKey for decryption. The proposed S-box and Inverse S-box used in this architecture are designed as a cascade of three blocks. In the second proposed architecture, the block III of the proposed S-box is combined with the MixColumns and AddRoundKey transformations forming an integrated unit for encryption. An integrated unit for decryption combining the block III of the proposed InvSubBytes with InvMixColumns and AddMixRoundKey is formed on similar lines. The delays of the proposed architectures for VLSI implementation are found to be the shortest compared to the state-of-the-art implementations of AES operating in non-feedback mode. Iterative and fully unrolled sub-pipelined designs including key schedule are implemented using FPGA and ASIC. The proposed designs are efficient in terms of Kgates/Giga-bits per second ratio compared with few recent state-of-the-art ASIC (0.18-mu m CMOS standard cell) based designs and throughput per area (TPA) for FPGA implementations.
High Efficiency Video Coding (HEVC), the recently developed international video compression standard, has 50% better video compression efficiency than H.264 video compression standard at the expense of significantly i...
详细信息
ISBN:
(纸本)9791092279115
High Efficiency Video Coding (HEVC), the recently developed international video compression standard, has 50% better video compression efficiency than H.264 video compression standard at the expense of significantly increased computational complexity. HEVC Inverse Discrete Cosine Transform (IDCT) algorithm accounts for 11% of the computational complexity of an HEVC video encoder. Recently, commercial and academic high-level synthesis (HLS) tools are started to be successfully used for FPGA implementations of digital signalprocessingalgorithms. Therefore, in this paper, the first FPGA implementations of HEVC 2D IDCT algorithm using HLS tools in the literature are proposed. The proposed HEVC IDCT hardware are implemented on Xilinx FPGAs using three HLS tools;Xilinx Vivado HLS, LegUp, MATLAB Simulink HDL Coder. Using HLS tools significantly reduced the FPGA development time, and the resulting FPGA implementations achieved real-time performance. Therefore, HLS tools can be used for FPGA implementation of HEVC video encoder.
In this paper, the parallel implementations of two well-known linear state-space filtering algorithms, namely the Kalman and the Lainiotis filters, in MIMD machines are studied from a computational standpoint. The ana...
详细信息
In this paper, the parallel implementations of two well-known linear state-space filtering algorithms, namely the Kalman and the Lainiotis filters, in MIMD machines are studied from a computational standpoint. The analysis assumes both time invariant and time varying system models and uses precedence graphs and critical paths. The parallelism efficiency of the implementations is also defined and studied. Results indicate that these algorithms can be implemented in parallel using a comparatively small number of processors. Furthermore, the efficiency of the parallel implementations can be very high or very low, depending on the state and measurement vector dimensions.
Matrix decomposition of the channel matrix in the form of QR decomposition (QRD) is needed for advanced multiple input and multiple output (MIMO) demapping algorithms like sphere decoder. Due to the computation-intens...
详细信息
Matrix decomposition of the channel matrix in the form of QR decomposition (QRD) is needed for advanced multiple input and multiple output (MIMO) demapping algorithms like sphere decoder. Due to the computation-intensive nature of the QRD, its implementation has to be highly efficient. Flexibility in several forms, e.g. support for different algorithms, reusability of wireless implementations, portability, etc. is highly sought in wireless devices. The contradictory nature of flexibility and efficiency requires tradeoffs to be made between them in system development. In this paper, we have analyzed such tradeoffs by implementing two minimum mean squared error-sorted QRD algorithms. The algorithms have been implemented in four different methods with varying degree of reusability and in five different forms of portability. The performance of the implementations is evaluated by using the real-time constraints from the LTE standard. For all the implementations, modular equations for accurately estimating the execution time are derived.
This paper describes two new matrix transform algorithms for the Max-Log-MAP decoding of turbo codes. In the proposed algorithms, the successive decoding procedures carried out in the conventional Max-Log-MAP algorith...
详细信息
ISBN:
(纸本)0780370805
This paper describes two new matrix transform algorithms for the Max-Log-MAP decoding of turbo codes. In the proposed algorithms, the successive decoding procedures carried out in the conventional Max-Log-MAP algorithm are performed in parallel, and well formulated into a set of simple and regular matrix operations, which can therefore considerably speed up the decoding operations and reduce the computational complexity. The matrix Max-Log-MAP algorithms also maintain the advantage of the general logarithmic MAP like algorithms in avoiding complex numerical representation problems. They particularly facilitate the implementations of the logarithmic MAP like algorithms in special-purpose parallel processing VLSI hardware architectures. The matrix algorithms also allow simple implementations by using shift registers. The proposed implementation architectures for the matrix Max-Log-MAP decoding can effectively reduce the memory capacity and simplify the data accesses and transfers required by the conventional Max-Log-MAP as well as MAP algorithms.
This paper propose a new parallel implementation of some rotation-based adaptive filters [1]. These filters are characterized by a robust behavior to input signal correlation [2] and good numerical properties. However...
详细信息
ISBN:
(纸本)0780393333
This paper propose a new parallel implementation of some rotation-based adaptive filters [1]. These filters are characterized by a robust behavior to input signal correlation [2] and good numerical properties. However, their implementations have reduced complexities. The circuits based on these block-diagonal adaptive algorithms use less computing cells than the systolic circuit of the QR-RLS algorithm. Nevertheless, these new and low-complexity architectures have no longer a pipeline structure.
Cryptographic substitution boxes (S-boxes) are an integral part of modern block ciphers like the advanced Encryption Standard (AES). There exists a rich literature devoted to the efficient implementation of cryptograp...
详细信息
Cryptographic substitution boxes (S-boxes) are an integral part of modern block ciphers like the advanced Encryption Standard (AES). There exists a rich literature devoted to the efficient implementation of cryptographic S-boxes, wherein hardware designs for FPGAs and standard cells received particular attention. In this paper we present a comprehensive study of different standard-cell implementations of the AES S-box with respect to timing (i.e. critical path), silicon area, power consumption, and combinations of these cost metrics. We examine implementations which exploit the mathematical properties of the AES S-box, constructions based on hardware look-up tables, and dedicated low-power solutions. Our results show that the timing, area, and power properties of the different S-box realizations can vary by up to almost an order of magnitude. In terms of area and area-delay product, the best choice are implementations which calculate the S-box output. On the other hand, the hardware look-up solutions are characterized by the shortest critical path. The dedicated low-power implementations do not only reduce power consumption by a large degree, but they also show good timing properties and offer the best power-delay and power-area product, respectively.
Emerging Software Defined Radio (SDR) baseband platforms are based on multiple processors with massive parallelism. Although the computational power of these platforms would theoretically enable SDR solutions with adv...
详细信息
Emerging Software Defined Radio (SDR) baseband platforms are based on multiple processors with massive parallelism. Although the computational power of these platforms would theoretically enable SDR solutions with advanced wireless signalprocessing, existing work implements still rather basic algorithms. For instance, current Multiple-Input Multiple-Output (MIMO) detector implementations are typically based on simple linear hard-output and not on advanced near-Maximum Likelihood (ML) soft-output detection. However, only the latter enables to exploit the full potential of MIMO technology. In this work, we explore the feasibility of advanced soft-output near-ML MIMO detectors on massive parallel processors. Although such detectors are considered to be very challenging due to their high computational complexity, we combine architecture-friendly algorithm design, application specific instructions and instruction-level/data-level parallelism explorations to make SDR solutions feasible. We show that, by applying the proposed combination of techniques, it is possible to obtain SDR implementations which can deliver data rates that are sufficient for future wireless systems. For example, a 2 x 4 Coarse Grain Array (CGA) processor with 16-way Single Instruction Multiple Data (SIMD) can deliver 192/368 Mbps throughput for 2 x 2 64/16-QAM transmissions. Finally, we estimate the area and power consumption of the programmable solution and compare it against a traditional Application Specific Integrated Circuit (ASIC) approach. This enables us to draw conclusions from the cost perspective.
暂无评论