The shift from centralized cloud to edge computing demands hardware systems with data processing capability at ultra-low power. Reconfigurable solutions such as field-programmable gate arrays (FPGAs) offer a high flex...
详细信息
ISBN:
(纸本)9783981926347
The shift from centralized cloud to edge computing demands hardware systems with data processing capability at ultra-low power. Reconfigurable solutions such as field-programmable gate arrays (FPGAs) offer a high flexibility in terms of hardware implementation and are thus popular for use in many edge computing systems. However, breaking through the energy wall of FPGAs is a challenge, as low-power operation often requires compromising performances. In this paper, we study a low-power high-performance FPGA architecture exploiting Resistive Random Access Memory (RRAM) technology. To perform a comprehensive analysis, we introduce a novel design flow which can rapidly prototype FPGA fabrics from which accurate area, delay, and power results can be obtained. Based on full-chip layouts and SPICE simulations, we show that RRAM-based FPGAs can improve up to 8%/22%/16% in area/delay/power compared to SRAM-based counterparts at nominal voltage. Even when operated at a near-Vt supply, the proposed RRAM-based FPGA can improve the Energy-Delay Product by about 2 X without any delay overhead, when compared to an SRAM-based FPGA. In addition, Monte Carlo simulations showed that the proposed RRAM-based FPGA architecture stays robust under different CMOS process corners as well as under a 30% RRAM resistance standard deviation.
In this paper, two real-time architectures of medium access techniques useful for future generation of wireline and wireless communication systems are presented. One architecture is based on discrete cosine transform ...
详细信息
In this paper, two real-time architectures of medium access techniques useful for future generation of wireline and wireless communication systems are presented. One architecture is based on discrete cosine transform (DCT), while the second approach implements a filter-bank multi-carrier (FBMC) system. A comparative analysis, in terms of resource consumption, performance, and precision, is shown. The comparison considers a floating-point model, a fixed-point model, and experimental tests. These models make it possible to evaluate the effect of the fixed-point precision in the implementation and, in turn, to verify the correctness of the developed architecture. The simulation models and the experimental tests have been carried out in different practical environments in order to achieve a further analysis. The two proposed architectures have been implemented on a field-programmablegate array (FPGA) device. Furthermore, the architectures have been included as advanced peripherals in a system-on-chip, which also integrates a soft microprocessor to monitor the whole system and manage the data transfers. As a communication scenario, the proposed architectures have been particularized to operate in real time while meeting all timing requirements de fined by a broadband power line communications standard. For that case, the system has achieved a desired transmission rate of 62.5 Ms/s at the converters, providing mean squared errors, at the output for an ideal channel, below 3 .10(-5) for both the DCT and FBMC approaches, whereas each transmitter/receiver requires around 50% of the DSP cells available in the Xilinx XC6VLX240T FPGA, the most demanded resource in the device.
Online arithmetic operators offer advantages of reduction in resource utilization and interconnection complexity besides providing pipelining at digit level. Multiplierless constant coefficient multiplication using th...
详细信息
Online arithmetic operators offer advantages of reduction in resource utilization and interconnection complexity besides providing pipelining at digit level. Multiplierless constant coefficient multiplication using the shift-and-add technique is widely used in digital signal processing applications. This paper proposes a novel bit serial adaptation of the parallel shift-and-add algorithm to online arithmetic. The proposed multipliers use right shifts instead of the traditional left shifts resulting in causal online implementations. Graph-based and hybrid algorithms are developed for the estimation of the distance of a constant from a set of constants in terms of the number of additions and for the synthesis of online multiple constant multipliers under area and online delay constraints. The computational complexity of the algorithms is determined. Results of implementation on randomly generated constant sets and FIR filter instances show substantial improvements in the number of operations required using the distance heuristic. Further, it is shown that the proposed techniques and algorithms result in significant savings in resource utilization, logic depth, and clock frequency compared to parallel and digit-serial algorithms.
Due to the increasing demands of onboard sensor and autonomous processing, one of the principal needs and challenges for future spacecraft is onboard computing. Space computers must provide high performance and reliab...
详细信息
Due to the increasing demands of onboard sensor and autonomous processing, one of the principal needs and challenges for future spacecraft is onboard computing. Space computers must provide high performance and reliability (which are often at odds), using limited resources (power, size, weight, and cost), in an extremely harsh environment (due to radiation, temperature, vacuum, and vibration). As spacecraft shrink in size, while assuming a growing role for science and defense missions, the challenges for space computing become particularly acute. For example, processing capabilities on CubeSats (smaller class of SmallSats) have been extremely limited to date, often featuring microcontrollers with performance and reliability barely sufficient to operate the vehicle let alone support various sensor and autonomous applications. This article surveys the challenges and opportunities of onboard computers for small satellites (SmallSats) and focuses upon new concepts, methods, and technologies that are revolutionizing their capabilities, in terms of two guiding themes: hybrid computing and reconfigurable computing. These innovations are of particular need and value to CubeSats and other SmallSats. With new technologies, such as CHREC Space Processor (CSP), we demonstrate how system designers can exploit hybrid and reconfigurable computing on SmallSats to harness these advantages for a variety of purposes, and we highlight several recent missions by NASA and industry that feature these principles and technologies.
Despite the considerable effort has been put on the application of Non-Volatile Memories (NVMs) in field-programmable gate arrays FPGAs, previously suggested designs are not mature enough to substitute the state of-th...
详细信息
Despite the considerable effort has been put on the application of Non-Volatile Memories (NVMs) in field-programmable gate arrays FPGAs, previously suggested designs are not mature enough to substitute the state of-the-art SRAM-based counterparts mainly due to the inefficient building blocks and/or the overhead of programming structure which can impair their potential benefits. In this paper, we present a Resistive Random Access Memory RRAM-based FPGA architecture employing efficient Switch Box (SB) and Look-Up Table (LUT) designs with programming circuitry integrated in both SB and LUT designs that creates area and power efficient programmable components while precluding performance overhead to these blocks. In addition, we present an efficient scheme to load the configuration bitstream into the memory elements, which makes the configuration time comparable to that of SRAM-based FPGAs. Besides, we investigate the correct functionality and reliability of the programming structure subject to fluctuations in attributes of RRAM cells. Using Versatile Place and Route (VTR) tool with the obtained characteristics of the proposed blocks demonstrate that the average area and delay of the proposed FPGA architecture are 59.4% and 20.1% less than conventional SRAM-based FPGAs. Compared with a recent RRAM-based architecture, the proposed architecture improves the area and power by 49.7% and 33.8% while keeps the delay intact.
The computation of the electromagnetic transients in a power transformer with nonlinear material using the finite element method (FEM) is so dense that the traditional nonlinear solver employing the Newton-Raphson met...
详细信息
The computation of the electromagnetic transients in a power transformer with nonlinear material using the finite element method (FEM) is so dense that the traditional nonlinear solver employing the Newton-Raphson method can hardly execute in real time. In this paper, we emulate the finite-element computation of electromagnetic transients of a transformer in real time for the first time. The transmission line modeling (TLM) method employed in the FEM successfully decoupled the nonlinear elements from the linear network so the nonlinearities could he solved individually, which is perfect for parallel processing. The parallelism of the TLM-FE solution is sufficiently explored and realized on a field-programmablegate array with deep data pipelining, and the implementation can execute in real time and provide detailed field information of the transformer during the transients. The proposed noniterative field-circuit coupling enabled the transformer to interface with an external network and the comparison with commercial FEM software proved the accuracy and computational efficiency of the real-time FE model.
This paper presents i) an equivalent model of the half-bridge modular multilevel converter (HB-MMC) which is suitable for real-time applications, ii) a hybrid central-processing unit/ field-programmablegate array (CP...
详细信息
This paper presents i) an equivalent model of the half-bridge modular multilevel converter (HB-MMC) which is suitable for real-time applications, ii) a hybrid central-processing unit/ field-programmablegate array (CPU/FPGA)-based architecture for real-time simulation of electromagnetic transients of systems which include HB-MMC, and iii) a novel arrangement for sorting results referred to as the "sub-module (SM) rank list", which tackles the bottleneck for parallel implementation of the MMC arm model solver on the FPGA. The Adam-Bashforth (AB) method is used for numerical integration of the HB-SM capacitor model. The second-order AB method provides a constant admittance matrix of the HB-MMC and, thus, reduces computational burden while offering the same accuracy as that of the widely used Trapezoidal method. The CPU/FPGA-based architecture is optimized to obtain maximum parallelism of the HB-MMC model implementation, adopting a standard, single-precision, floating-point computational engine. The proposed sorting arrangement is independent of the utilized sorting algorithm and its application to the odd-even bubble sorting scheme is presented in this paper. The proposed architecture offers a simulation time-step of 825 ns while including the sorting module as the SM capacitor voltage-balancing control unit. This enables accurate analysis of MMC controls based on either software-in-the-loop or hardware-in-the-loop approaches. Performance and accuracy of the MMC model and the hybrid CPU/FPGA-based architecture are evaluated based on a set of case studies on a 401-level HB-MMC-based HVDC station and verified based on offline simulation results in the PSCAD/EMTDC environment.
Intel recently introduced the Heterogeneous Architecture Research Platform, HARP. In this platform, the Central Processing Unit and a field-programmablegate Array are connected through a high-bandwidth, low-latency i...
详细信息
Intel recently introduced the Heterogeneous Architecture Research Platform, HARP. In this platform, the Central Processing Unit and a field-programmablegate Array are connected through a high-bandwidth, low-latency interconnect and both share DRAM memory. For this platform, Open Computing Language (OpenCL), a High-Level Synthesis (HLS) language, is made available. By making use of HLS, a faster design cycle can be achieved compared to programming in a traditional hardware description language. This, however, comes at the cost of having less control over the hardware implementation. We will investigate how OpenCL can be applied to implement a real-time guided image filter on the HARP platform. In the first phase, the performance-critical parameters of the OpenCL programming model are defined using several specialized benchmarks. In a second phase, the guided image filter algorithm is implemented using the insights gained in the first phase. Both a floating-point and a fixed-point implementation were developed for this algorithm, based on a sliding window implementation. This resulted in a maximum floating-point performance of 135 GFLOPS, a maximum fixed-point performance of 430 GOPS and a throughput of HD color images at 74 frames per second.
The exponentially increasing performance of chip multiprocessors (CMPs) predicted by Moore's Law is no longer due to the increasing clock rate of a single CPU core, but on account of the increase of core counts in...
详细信息
The exponentially increasing performance of chip multiprocessors (CMPs) predicted by Moore's Law is no longer due to the increasing clock rate of a single CPU core, but on account of the increase of core counts in the CMP. More transistors are integrated within the same footprint area as the technology node shrinks to deliver higher performance. However, this is accompanied by higher power dissipation that usually exceeds the coping capability of inexpensive cooling techniques. This Power Wall prevents the chip from running at full speed with all the devices powered-on. This is known as the dark silicon problem. Another major bottleneck in CMP development is the imbalance between the CPU clock rate and memory access speed. This Memory Wall keeps the CPU from fully utilizing its compute power. To address both the Power and Memory Walls, we propose a monolithic 3D hybrid architecture that consists of a multi-core CPU tier, a fine-grain dynamically reconfigurable (FDR) field-programmablegate array (FPGA) tier, and multiple resistive RAM (RRAM) tiers. The FDR tier is used as an accelerator. It uses the concept of temporal logic folding to localize on-chip communication. The RRAM tiers are connected to the CPU and FDR tiers through an efficient memory interface that takes advantage of the tremendous bandwidth available from monolithic inter-tier vias and hides the latency of large data transfers. We evaluate the architecture on two types of benchmarks: compute-intensive and memory-intensive. We show that the architecture reduces both power and energy significantly at a better performance for both types of applications. Compared to the baseline, our architecture achieves an average of 43.1x and 2.5x speedup on compute-intensive and memory-intensive benchmarks, respectively. The power and energy consumption are reduced by 5.0x and 40.5x, respectively, for compute-intensive applications, and 2.0x and 4.2x, respectively, for memory-intensive applications. This translates to
As Moore's law meets bottlenecks, the demand for heterogeneous parallel processing systems is increasing. field-programmable gate arrays (FPGAs) are becoming more efficient acceleration devices due to their powerf...
详细信息
As Moore's law meets bottlenecks, the demand for heterogeneous parallel processing systems is increasing. field-programmable gate arrays (FPGAs) are becoming more efficient acceleration devices due to their powerful processing performance, and the CPU + FPGA architecture under the OpenCL framework has become the trend of heterogeneous parallel processing systems. This study focuses on the optimisation of pulse compression algorithm in FPGA based on OpenCL, which plays an important role in modern radar signal processing systems. By using double cache for ping-pang storage of data between matched filter and inverse fast Fourier transform (IFFT), an optimised processing method is proposed by using a pipeline and verify the method by using Arria 10 GX1150 FPGA with two groups of 2 GB DDR3;the results show that the proposed method can achieve 2.89x performance improvement over the conventional implementation.
暂无评论