Despite FPGAs rapidly evolving to support the requirements of the most demanding emerging applications, their high static power consumption, concentrated within the routing resources, still presents a major hurdle for...
详细信息
Despite FPGAs rapidly evolving to support the requirements of the most demanding emerging applications, their high static power consumption, concentrated within the routing resources, still presents a major hurdle for low-power applications. Augmenting the FPGAs with power-gating ability is a promising way to effectively address the power-consumption obstacle. However, the main challenge when implementing power gating is in choosing the clusters of resources in a way that would allow the most power-saving opportunities. In this paper, we take advantage of machine learning approaches, such as K-means clustering, to propose efficient algorithms for creating power-gating clusters of FPGA routing resources. In the first group of proposed algorithms, we employ K-means clustering and exploit the utilization pattern of routing resources. In the second group of algorithms, we enhance the power-gating efficiency by minimizing the power overhead introduced by power-gating logic and by taking into account the size of routing multiplexers, which influences the power-gating efficiency. Finally, we enhance and further develop the baseline FPGA routing algorithm to be aware and take advantage of power gating opportunities. The experimental results on Titan benchmark suite and the latest Intel Stratix-IV FPGA architecture in VTR 8.0 show that our approaches achieve an improvement of about 70%, on average, in reducing the FPGA static power consumption over the best power-gating approaches proposed in the previous studies.
The Square Kilometre Array Low is a next generation radio telescope, consisting of 512 antenna stations spread over 65 km, to be built in Western Australia. The correlator and beamformer (CBF) design is central to the...
详细信息
The Square Kilometre Array Low is a next generation radio telescope, consisting of 512 antenna stations spread over 65 km, to be built in Western Australia. The correlator and beamformer (CBF) design is central to the telescope signal processing. CBF receives 6 Tera-bits-per-second (Tbps) of station data continuously and processes it in real time with a compute load of 2 Peta-operations-per-second (Pops). The correlator calculates up to 22 million cross products between all pairs of stations, whereas the beamformers (BFs) coherently sum station data to form more than 500 beams. The output of the correlator is up to 7 Tbps, and the BF 2 Tbps. The design philosophy, called "Atomic COTS," is based on commercial off-the-shelf (COTS) hardware. Data routing is implemented in network switches programmed using the Programming Protocol-Independent Packet Processors (P4) language and the signal processing occurs in COTS field-programmablegate array (FPGA) cards. The P4 language allows routing to be determined from the metadata in the Ethernet packets from the stations. That is, metadata describing the contents of the packet determines the routing. Each FPGA card inputs a fraction of the overall bandwidth for all stations and then implements the processing needed to generate complete science data products. Generation of complete science products in a single FPGA is named here as Atomic processing. A Tango distributed control system configures the multitude of processing modes as well as maintaining the overall health of the CBF system hardware. The resulting 6 Tbps in and 9 Tbps out, 2 Pops Atomic COTS network attached accelerator occupies five racks and consumes 60 kW. (C) The Authors. Published by SPIE under a Creative Commons Attribution 4.0 International License.
In this paper, we introduce an FPGA-based processor for elliptic curve cryptography on Koblitz curves. The processor targets specifically to applications requiring very high speed. The processor is optimized for perfo...
详细信息
In this paper, we introduce an FPGA-based processor for elliptic curve cryptography on Koblitz curves. The processor targets specifically to applications requiring very high speed. The processor is optimized for performing scalar multiplications, which are the basic operations of every elliptic curve cryptosystem, only on one specific Koblitz curve;the support for other curves is achieved by reconfiguring the FPGA. We combine efficient methods from various recent papers into a very efficient processor architecture. The processor includes carefully designed processing units dedicated for different parts of the scalar multiplication in order to increase performance. The computation is pipelined providing simultaneous processing of up to three scalar multiplications. We provide experimental results on an Altera Stratix II FPGA demonstrating that the processor computes a single scalar multiplication on average in 11.71 mu s and achieves a throughput of 235,550 scalar multiplications per second on NIST K-163. (C) 2010 Elsevier B.V. All rights reserved.
Quarter-pixel accuracy and variable block-size significantly enhance compression performances of the MPEG-4 AVC/H.264 video compression standard over its predecessors, but also significantly increase computation requi...
详细信息
Quarter-pixel accuracy and variable block-size significantly enhance compression performances of the MPEG-4 AVC/H.264 video compression standard over its predecessors, but also significantly increase computation requirements. Firstly, a digital signal processor (DSP)-based solution that achieves real-time integer motion estimation is proposed. Fractional-pixel refinement is too computationally intensive to be efficiently processed on a software-based processor. To address this restriction, a flexible and low complexity VLSI subpixel refinement coprocessor is designed. Thanks to an improved datapath, a high throughput is achieved with low logic resources. Finally, an heterogeneous (DSP-field-programmablegate array) solution to handle real-time motion estimation with variable block-size and fractional-pixel accuracy for high-definition video is studied. This solution, combining programmability and efficiency, achieves motion estimation of 720 p sequences at up to 60 fps.
The problem of finite state machine (FSM) encoding for low power in field-programmable gate arrays (FPGAs) is addressed. In this technology, one-hot encoding is typically recommended for large FSMs and binary encoding...
详细信息
The problem of finite state machine (FSM) encoding for low power in field-programmable gate arrays (FPGAs) is addressed. In this technology, one-hot encoding is typically recommended for large FSMs and binary encoding for small FSMs. A partitioned encoding approach is proposed which uses a combination of both binary encoding and zero-one-hot encoding with intermediate code size. Experimental results demonstrate that the proposed encoding approach can produce significant power savings.
The abundant hardware resources on current reconfigurable computing systems provide new opportunities for high-performance parallel implementations of scientific computations. In this paper, we study designs for float...
详细信息
The abundant hardware resources on current reconfigurable computing systems provide new opportunities for high-performance parallel implementations of scientific computations. In this paper, we study designs for floating- point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems. We first analyze design trade-offs in implementing this kernel. These trade-offs are caused by the inherent parallelism of matrix multiplication and the resource constraints, including the number of configurable slices, the size of on-chip memory, and the available memory bandwidth. We propose three parameterized algorithms which can be tuned according to the problem size and the available hardware resources. Our algorithms employ a linear array architecture with simple control logic. This architecture effectively utilizes the available resources and reduces routing complexity. The Processing Elements (PEs) used in our algorithms are modular so that it is easy to embed floating- point units into them. Experimental results on a Xilinx Virtex-II Pro XC2VP100 show that our algorithms achieve good scalability and high sustained GFLOPS performance. We also implement our algorithms on Cray XD1. XD1 is a high-end reconfigurable computing system that employs both general-purpose processors and reconfigurable devices. Our algorithms achieve a sustained performance of 2.06 GFLOPS on a single node of XD1.
Inmodern high-power electrical drives, the efficiency of the system is a crucial constraint. Moreover, the efficiency of power converters plays a fundamental role in modern applications requiring also a limited weight...
详细信息
Inmodern high-power electrical drives, the efficiency of the system is a crucial constraint. Moreover, the efficiency of power converters plays a fundamental role in modern applications requiring also a limited weight, such as the electric vehicles and novel more electric aircraft. The reduction of losses pushes for systems with a dc bus and a high number of dc/ac converters, widespread in the vehicle, not burdened by a too expensive data processing system. The purpose of this article is to concur to reduce losses by proposing an innovative selective harmonic mitigation method based on the identification of the working areas where the reference harmonics present lower amplitudes. In particular, the main objective is to find a new way to calculate the control angles in real-time operation without solving nonlinear equations, whose resolution would require expensive controllers. Through a very simple approach, the polynomial equations, which drive the control angles, were detected for a three-phase five-level cascaded H-bridge inverter and implemented in a digital system to real-time operation with a low computational cost. As a result, a comparison between the simulation and experimental behavior is presented. In the last part of this article, a real electric machine is driven by considering the appropriate working areas and current harmonics are also evaluated.
Using a field-programmablegate array (FPGA) development board, a digital signal processor (DSP) builder, and the phase-to-amplitude conversion principle, a low-cost system for measuring the amplitude-to-amplitude (AM...
详细信息
Using a field-programmablegate array (FPGA) development board, a digital signal processor (DSP) builder, and the phase-to-amplitude conversion principle, a low-cost system for measuring the amplitude-to-amplitude (AM/AM) and amplitude-to-phase (AM/PM) distortion curves of radio frequency (RF) power amplifiers (PAs) is presented. The state of the art based on the measurements and preliminary studies of AM/AM and AM/PM distortion curves is discussed. A full digital control of the test bed simulated/emulated in Matlab/Simulink is introduced to recalculate the known AM/AM and AM/PM measurements stored as look-up table (LUT). Finally, the low-cost system comprises the memory polynomial model (MPM) that involves the nonlinearity order and memory effects of real PAs. (C) 2015 Elsevier B.V. All rights reserved.
In this paper, we investigate the impact of two pulse shapes on the performance of a real-time free-space optical communication link. The two candidate pulse shapes are the square-root raised cosine and Xia pulse, res...
详细信息
In this paper, we investigate the impact of two pulse shapes on the performance of a real-time free-space optical communication link. The two candidate pulse shapes are the square-root raised cosine and Xia pulse, respectively which are tested as the basis function for multi-band carrier-less amplitude and phase modulation. We first develop a real-time system based on a Xilinx Zynq ZCU102 system-on-chip platform utilising a high-resolution analogue-to-digital-converter. We then generate multi-band carrier-less amplitude and phase modulation formats using it and test the error vector magnitude whilst varying parameters. We emulate the fog environment utilising neutral density filters and evaluate the error performance of the link under increasingly poor visibility conditions. We show that contrary to previous reports, the SRRC pulse shape offers superior performance over the first-order Xia pulse in the FSO environment operating at data rates exceeding 1 Gb/s.
Video processing algorithms are computationally intensive and place stringent requirements on performance and efficiency of memory bandwidth and capacity. As such, efficient hardware accelerations are inevitable for f...
详细信息
Video processing algorithms are computationally intensive and place stringent requirements on performance and efficiency of memory bandwidth and capacity. As such, efficient hardware accelerations are inevitable for fast video processing systems. In this paper, we propose resource- and power-optimized FPGA-based configurable architecture for video object detection by integrating noise estimation, Mixture-of-Gaussian background modeling, motion detection, and thresholding. Due to large amount of background modeling parameters, we propose a novel Gaussian parameter compression technique suitable for resource- and power-constraint embedded video systems. The proposed architecture is simulated, synthesized and verified for its functionality, accuracy and performance on a Virtex-5 FPGA-based embedded platform by directly interfacing to a digital video input. Intentional exploitation of heterogeneous resources in FPGAs, and advanced design techniques such as heavy pipelining and data parallelism yield real-time processing of HD-1080p video streams at 30 frames per second. Objective and subjective evaluations to existing hardware-based methods show that the proposed architecture obtains orders of magnitude performance improvements, while utilizing minimal hardware resources. This work is an early attempt to devise a complete video surveillance system onto a stand-alone resource-constraint FPGA-based smart camera.
暂无评论