In this paper, we examine several algorithms suitable for the hardware implementation of the discrete Fourier transform (DFT) with non-power-of two problem size. We incorporate these algorithms into Spiral, a tool cap...
详细信息
ISBN:
(纸本)9781424442959
In this paper, we examine several algorithms suitable for the hardware implementation of the discrete Fourier transform (DFT) with non-power-of two problem size. We incorporate these algorithms into Spiral, a tool capable of automatically generating corresponding hardware implementations. We discuss how each algorithm can be used to generate different types of hardware structures, and we demonstrate that our tool is able to produce hardware implementations of non-power-of-two sized DFTs over a wide range of cost/performance tradeoff points.
Precise time interval measurement is required for a number of applications including clock stability analysis, time-of-flight measurements, and particle physics. Commercial time interval measurement devices can achiev...
详细信息
Precise time interval measurement is required for a number of applications including clock stability analysis, time-of-flight measurements, and particle physics. Commercial time interval measurement devices can achieve picosecond resolution but are expensive, especially for multichannel applications. In previous research, the US Army Combat Capabilities Development Command Army Research Laboratory demonstrated 10-ns resolution on 10 channels using a low-cost field-programmable gate array (FPGA) suitable for pulse-per-second monitoring. This technical note details the design of an interface box for this FPGA device, enabling practical time interval measurement with a variety of input signals. The purpose of this note is twofold: 1) to document the interface box to allow for easy use and future modifications and 2) to provide a reference to facilitate the construction of other interface units, including design advice and lessons learned.
Battery technology has been the bottleneck of the development of electric vehicle technology. In order to grasp the battery state in real time, it is becoming more and more important to design a battery management tec...
详细信息
Battery technology has been the bottleneck of the development of electric vehicle technology. In order to grasp the battery state in real time, it is becoming more and more important to design a battery management technology(BMS) that can monitor and adjust battery state in real time. SOC(State of charge) is an important parameter to describe the charge and discharge capacity of the battery. It is of great significance to give full play to the performance of the battery system, improve the safety of the battery, prevent the overcharge and discharge of the battery, and prolong the life of the battery. Therefore, BMS should be able to accurately estimate the SOC of battery in real time. This paper uses the method of OCV-AH to estimate the SOC of battery, and uses field-programmable gate array(FPGA) to realize this method from two aspects of hardware and software.
An intermediate frequency digital receiver based on FPGA is introduced in the *** scheme of system realization is proposed and the design of every hardware circuit is described in *** main function module in FPGA of t...
详细信息
An intermediate frequency digital receiver based on FPGA is introduced in the *** scheme of system realization is proposed and the design of every hardware circuit is described in *** main function module in FPGA of the system is described with carefully and implementation result of every module is *** correctness of every main function module is verified by *** condition of no varying hardware platform,the system can realize different function and technical index by changing software program and have higher universality and practical value.
Streaming applications compose an important portion of the workloads that FPGAs may accelerate but suffer from inefficient data movement. The inefficiency stems from copying data indirectly into the FPGA DRAM rather t...
详细信息
ISBN:
(纸本)9781450394178
Streaming applications compose an important portion of the workloads that FPGAs may accelerate but suffer from inefficient data movement. The inefficiency stems from copying data indirectly into the FPGA DRAM rather than directly into its on-chip memory, substantially diminishing the end-to-end speedup, especially for small workloads (hundreds of kilobytes). AMD Xilinx's Host Memory IP (HMI) aims to address the data movement problem by exposing to the developer an High-Level Synthesis (HLS) interface that moves the data from the host directly to the FPGA's on-chip memory. However, using HMI purely for its interface without additional code changes incurred a 3.3x slowdown in comparison with the current programming model. The slowdown mainly originates from OpenCL call overhead and the kernel control logic unnecessarily switching states. To overcome these issues, we propose Host Memory Library (HMLib), an efficient HLS-based library that facilitates data transfer on behalf of the user. HMLib not only optimizes the runtime stack for efficient data transfer, but also provides HLS compatible and user-friendly interfaces. We demonstrate HMLib's effectiveness for streaming applications (Deflate compression and CRC32) with improvements of up to up to 36.2X over OpenCL-DDR and up to 79.5X over raw HMI for small-scale data while maintaining little-to-no performance loss for large scale inputs. We plan to open source our work in the future.
The recent advance in artificial intelligence (AI) technology has led to a new round of systolic structure innovation. Many AI accelerators have employed systolic structure to realize the core large-scale matrix-vecto...
详细信息
ISBN:
(纸本)9781450361378
The recent advance in artificial intelligence (AI) technology has led to a new round of systolic structure innovation. Many AI accelerators have employed systolic structure to realize the core large-scale matrix-vector multiplication for high-performance processing, which has a complexity of $o(n^2)$ for matrix size of $n\times n$ (difficult to be implemented on the field-programmable gate array (FPGA) platform). To overcome this drawback, in this paper, we propose a super systolization strategy to implement the core circulant matrix-vector multiplication into a systolic structure with subquadratic space complexity. The proposed effort is carried out through two stages of coherent interdependent efforts: (i) a novel matrix-vector multiplication algorithm based on Toeplitz matrix-vector product (TMVP) approach is proposed to obtain subquadratic space complexity; (ii) a series of optimization techniques are introduced to map the proposed algorithm into desired systolic structure. Finally, detailed complexity analysis and comparison have been conducted to prove the efficiency of the proposed strategy. The proposed strategy is highly efficient and can be extended in many neural network based hardware implementation platforms.
A mixed-grained reconfigurable computing platform targeting multiple-standard video decoding is proposed in this paper. The platform integrates eight coarse-grained Reconfigurable Processing Units (RPUs), each of whic...
详细信息
ISBN:
(纸本)9781450333153
A mixed-grained reconfigurable computing platform targeting multiple-standard video decoding is proposed in this paper. The platform integrates eight coarse-grained Reconfigurable Processing Units (RPUs), each of which consists of 16×16 multi-functional Processing Elements (PEs) and are implemented in TSMC 65 nm technology and two Altera Stratix IV EP4SE820 FPGAs. By exploiting dynamic reconfiguration of the RPUs and static reconfiguration of the FPGAs, the proposed platform achieves scalable performances and cost trade-offs to support a variety of video coding standards, including H.264, MPEG-2, AVS and HEVC. Two types of platform configuration are tested in this work. One configuration utilizes two RPUs and targets multiple-standard high-definition (HD) video decoding, while the other utilizes only one RPU, which works under a lower frequency and targets at standard resolution (SD) decoding. The HD configuration can decode 1920×1080 H.264 video streams at 30 frames per second (fps) under 200 MHz and 1920×1080 HEVC video streams at 30 fps under 236 MHz. It achieves a 25% performance gain over an industrial coarse-grained reconfigurable processor for H.264 decoding, and a 3.85× performance boosts over the Intel i5 general-purpose CPU for HEVC decoding.
In this paper,an improved decoding algorithm for Non-binary low density parity check(NB-LDPC) codes with low decoding complexity and suitable for field-programmable gate array(FPGA) implementation is proposed,whic...
详细信息
In this paper,an improved decoding algorithm for Non-binary low density parity check(NB-LDPC) codes with low decoding complexity and suitable for field-programmable gate array(FPGA) implementation is proposed,which is a mixed logarithmic domain FFT-BP decoding algorithm(Mixed Log-FFT-BP) for the problem of high complexity of the existing decoding algorithms for Non-binary LDPC *** algorithm combines the traditional Log-BP algorithm with the FFT-BP algorithm,and simplifies the update of the check nodes in the iterative decoding process.A large number of convolution operations are converted into multiplication operations in frequency domain by using FFT transform and IFFT *** multiplication of the original FFT-BP algorithm is converted into the addition and look-up table operations in the logarithmic ***,the logarithm of the probability information is directly solved,so that it can be decoded in the logarithmic domain,which saves the computation of the log likelihood ratio,and then reduces the *** results show that under the additive Gauss white noise channel,when the bit error rate is 10,compared with BP algorithm,Log-BP algorithm and FFT-BP algorithm,the performance of Mixed Log-FFT-BP algorithm is not decreased,and all of them remain within the range of 0.1-0.2 dB.
This paper presents an optical transceiver, whose packet process is completed with deficit round robin (DRR) and RS232 interface, for optical synchronous optical networking (SONET) which can service on both asymmetric...
详细信息
ISBN:
(纸本)9781509018987
This paper presents an optical transceiver, whose packet process is completed with deficit round robin (DRR) and RS232 interface, for optical synchronous optical networking (SONET) which can service on both asymmetric digital subscriber line (ADSL) and optical packet switching (OPS). To resolve the clock jitter, not only the cycle decision but also the reset function are used to synchronize the clock waveform. In the proposed DRR, it performs the packet process with low delay and low loss. Moreover, the RS232 interface, which is integrated with the field-programmable gate array (FPGA) board, is adopted due to its easy implementation. The processing data will be queued with DRR and be sent to electrical/optical (E/O) converter from the RS232 port on FPGA board (Transmitter). Passing through the optical fiber, the packet from transmitter is sent to the O/E converter and then received at the RS232 port on another FPGA board. The received electrical packet will be displayed on the seven-segment display of FPGA board to verify the transceiver function for SONET. Note that the proposed architecture is designed with Verilog hardware describe language (Verilog HDL). According to the measured results, the data transfer rate is 115, 200 bps with the FPGA operating frequency of 50 MHz and the fiber distance of 5 km.
In the increased demand for video display of embedded small and medium-sized devices, the design uses a universal asynchronous receiver/transceiver UART for data transmission and VGA for image display through a standa...
详细信息
ISBN:
(纸本)9798400708268
In the increased demand for video display of embedded small and medium-sized devices, the design uses a universal asynchronous receiver/transceiver UART for data transmission and VGA for image display through a standard video interface. The serial communication based on fieldprogrammablegatearray FPGA is implemented to control the image display of video graphics array VGA, and the top-level module of VGA display is divided downward based on the modularity idea, and the hardware description language is used to realize the establishment of each functional module circuit. The logic function simulation is carried out by ModelSim simulation software, and the final board-level verification is realized by DE2-115.
暂无评论