With the development of wireless communication generations, multiple-input and multiple-output (MIMO) architectures have proven to be a solution to the higher required data rates for the increasing number of mobile us...
详细信息
With the development of wireless communication generations, multiple-input and multiple-output (MIMO) architectures have proven to be a solution to the higher required data rates for the increasing number of mobile users. As a matter of fact, better earnings can be gathered from a low-cost system with effective functions, which results in the hybrid beamforming architecture combining digital chains with large-scale antenna arrays. The programmable metasurface (PM) is a promising antenna array concept that realizes reconfigurable beamforming. It is raised to be an attractive antenna array architecture due to its low price and power consumption. However, experimental investigations combining PMs with modern wireless communication systems have not been intensively studied. Its challenges and difficulties of signalprocessing and system implementation remain uncertainties to be discovered. In this article, a PM hybrid MIMO beamforming system including the aforementioned important topics is presented as follows. First of all, the hybrid beamforming channel estimation algorithm adapted for PM is created by merging analog beam training and digital interleaved orthogonal frequency division multiplexing. Afterward, data transmissions based on the estimated channel state information utilizes variant signal recovery methods to examine the channel estimation accuracy and system feasibilities. To practically analyze the proposed system, the aforementioned aspects are implemented into system-level experiments using three PMs operating at 28 GHz for downlink wireless communication in both single-user and multiuser scenarios. Proper results are delivered, which successfully prove the PM hybrid MIMO beamforming system functionalities.
Two important issues in systolic array designs are addressed: How is fault tolerance provided in systolic arrays to enhance the yield of wafer-scale integration implementations? And, how are efficient systolic arrays ...
详细信息
Two important issues in systolic array designs are addressed: How is fault tolerance provided in systolic arrays to enhance the yield of wafer-scale integration implementations? And, how are efficient systolic arrays with two levels of pipelining designed? (The first level refers to the pipelined organization of the array at the cellular level, and the second refers to the pipelined functional units inside the cells.) The fault-tolerant scheme proposed replaces defective cells with clocked delays. This has the distinct characteristic that data can flow through the array with faulty cells at the original clock speed. It is shown that both the defective cells under this fault-tolerant scheme and the second-level pipeline stages can simply be modeled as additional delays in the data paths of “generic” systolic designs. The mathematical notion of a cut is introduced to solve the problem of how to allow for these extra delays while preserving the correctness of the original systolic array designs. The results obtained by applying these techniques are encouraging. When applied to systolic arrays without feedback cycles, the arrays can tolerate large numbers of failures (with the addition of very little hardware) while maintaining the original throughput. Furthermore, all of the pipeline stages in the cells can be kept fully utilized through the addition of a small number of delay registers. However, adding delays to systolic arrays with cycles typically induces a significant decrease in throughput. In response to this, a new class of systolic algorithms has been derived in which the data cycle around a ring of processing cells. The systolic ring architecture has the property that its performance degrades gracefully as cells fail. Use of the cut theory and ring architectures for arrays with feedback gives effective fault-tolerant and two-level pipelining schemes for most systolic arrays. As a side effect of developing the ring architecture approach, several new systolic al
Polar codes are a class of linear block codes that provably achieves channel capacity. They have been selected as a coding scheme for the control channel of enhanced mobile broadband (eMBB) scenario for 5(th) generati...
详细信息
Polar codes are a class of linear block codes that provably achieves channel capacity. They have been selected as a coding scheme for the control channel of enhanced mobile broadband (eMBB) scenario for 5(th) generation wireless communication networks (5G) and are being considered for additional use scenarios. As a result, fast decoding techniques for polar codes are essential. Previous works targeting improved throughput for successive-cancellation (SC) decoding of polar codes are semi-parallel implementations that exploit special maximum-likelihood (ML) nodes. In this work, we present a new fast simplified SC (Fast-SSC) decoder architecture. Compared to a baseline Fast-SSC decoder, our solution is able to reduce the memory requirements. We achieve this through a more efficient memory utilization, which also enables to execute multiple operations in a single clock cycle. Finally, we propose new special node merging techniques that improve the throughput further, and detail a new Fast-SSC-based decoder architecture to support merged operations. The proposed decoder reduces the operation sequence requirement by up to 39%, which enables to reduce the number of time steps to decode a codeword by 35%. ASIC implementation results with 65 nm TSMC technology show that the proposed decoder has a throughput improvement of up to 31% compared to previous Fast-SSC decoder architectures.
The paper describes two approaches Suitable for a field-programmable gate-array (FPGA) implementation of fast Walsh-Hadamard transforms. These transforms are important in many signal-processing applications including ...
详细信息
The paper describes two approaches Suitable for a field-programmable gate-array (FPGA) implementation of fast Walsh-Hadamard transforms. These transforms are important in many signal-processing applications including speech compression, filtering and coding. Two novel architectures for the fast Hadamard transforms using both a systolic architecture and distributed arithmetic techniques are presented. The first approach uses the Baugh-Wooley multiplication algorithm for a systolic architecture implementation. The second approach is based on both a distributed arithmetic ROM and accumulator structure, and a sparse matrix-factorisation technique. implementations of the algorithms on a Xilinx FPGA board are described. The distributed arithmetic approach exhibits better performances when compared with the systolic architecture approach.
Fully homomorphic encryption (FHE) offers the ability to perform computations directly on encrypted data by encoding numerical vectors onto mathematical structures. However, the adoption of FHE is hindered by substant...
详细信息
ISBN:
(纸本)9798350308600
Fully homomorphic encryption (FHE) offers the ability to perform computations directly on encrypted data by encoding numerical vectors onto mathematical structures. However, the adoption of FHE is hindered by substantial overheads that make it impractical for many applications. Number theoretic transforms (NTTs) are a key optimization technique for FHE by accelerating vector convolutions. Towards practical usage of FHE, we propose to use SPIRAL, a code generator renowned for generating efficient linear transform implementations, to generate high-performance NTT on vector architectures. We identify suitable NTT algorithms and translate the dataflow graphs of those algorithms into SPIRAL's internal mathematical representations. We then implement the entire workflow required for generating efficient vectorized NTT code. In this work, we target the Ring processing Unit (RPU), a multi-tile long vector accelerator designed for FHE computations. On average, the SPIRAL-generated NTT kernel achieves a 1.7x speedup over naive implementations on RPU, showcasing the effectiveness of our approach towards maximizing performance for NTT computations on vector architectures.
This paper presents a new technique which allows interactive optimization of video compression algorithms using massively parallel computers such as the CRAY T3D. This work aims to exploit as much as possible the para...
详细信息
ISBN:
(纸本)0780332598
This paper presents a new technique which allows interactive optimization of video compression algorithms using massively parallel computers such as the CRAY T3D. This work aims to exploit as much as possible the parallel nature of digital image processingalgorithms to obtain almost real-time computing with the flexibility of a software implementation. Thanks to this low computation time, interactive tools have been developed which allow easy and fast visual evaluation of image quality. This leads to significant productivity gain when developing new video compression techniques. Our approach has been validated on advanced region-based video compression algorithms. The interactive facilities offered by the proposed technique permit the accurate optimization of the algorithm parameters in few minutes, where several days were previously needed. Depending on the complexity of the compression algorithms, 8-12 images are compressed, decompressed and visualized per second.
Welcome to the sixth installment of the Design and Implementation (D&I) Series. The D&I Series was created with a specific goal of addressing the needs of the Communications Society's industry members by p...
详细信息
Welcome to the sixth installment of the Design and Implementation (D&I) Series. The D&I Series was created with a specific goal of addressing the needs of the Communications Society's industry members by publishing articles that encourage knowledge transfer between industry engineers and scientists.
Perceptual grouping is a key step in vision to organize image data into structural hypotheses to be used for high level analysis. In this paper, we propose data allocation and load balancing strategies which reduce th...
详细信息
Artificial Intelligence has emerged as a transformative technology, revolutionizing numerous industries by enabling advanced automation, predictive analytics, and decision-making capabilities. For that Artificial Inte...
详细信息
The increasing level of circuit integration is enabling the use of more complex digital signalprocessingalgorithms in modern applications. The fast Fourier transform (FFT) is one the most widely used signal processi...
详细信息
ISBN:
(纸本)9781617385469
The increasing level of circuit integration is enabling the use of more complex digital signalprocessingalgorithms in modern applications. The fast Fourier transform (FFT) is one the most widely used signalprocessingalgorithms. There is no single hardware implementation that fits all needs and in fact higher performance FFTs are used in applications as it becomes feasible to do so. This paper presents a family of register based architectures that can be obtained from a highly parameterizable C++ specification. The size, numerical precision, radix algorithm and parallelism of both computation and I/O transfers are parameterized in the specification. Optimized RTL targeted to a specific ASIC or FPGA technology can be generated from that specification using a high-level synthesis (HLS) flow. The generated RTL is then verified against the original C++ specification using an automated verification environment.
暂无评论