The Self-Organizing Map (SOM) algorithm is a clustering algorithm used in a wide variety of application domains. Over the last few decades, it has been accelerated using various hardware architectures, including FPGAs...
详细信息
The Self-Organizing Map (SOM) algorithm is a clustering algorithm used in a wide variety of application domains. Over the last few decades, it has been accelerated using various hardware architectures, including FPGAs, CPUs, and GPUs. This publication presents an High-Level Synthesis-based implementation that utilizes multiple processing elements to realize a high-performance system architecture. An extensive design space exploration was conducted to evaluate the performance range of the architecture. For this, vector dimensions ranging from 8 up to 512 and map sizes from 16x 16 to 512x512 were used. The evaluation was performed using two different AMD/Xilinx UltraScale+ FPGA systems, the VCU128 PCIe-based accelerator card and the ZCU106 stand-alone evaluation kit. From the achieved results, it can be seen that the performance scales nearly linearly for a given vector dimension when the map size is increased. In addition, the energy efficiency for both FPGAs was analyzed, revealing that while the ZCU106 is less powerful in terms of raw compute power, it requires up to 4x less power and, depending on the configuration, can be 2x more energy efficient compared to the VCU128. One of the main reasons for this is that it does not require a dedicated host system but utilizes its internal ARM cores. Finally, a comparison against state-of-the-art SOM implementations was conducted. The proposed design achieves a speed-up of up to 458.7, 1,630.4 , and 4.9 compared to other CPU, GPU, and FPGA realizations, respectively.
The field of motor drive makes extensive use of electronic power modeling and simulation of three-phase IGBT full-bridge inverter circuits. The accuracy and computational efficiency of these models have a direct impac...
详细信息
The field of motor drive makes extensive use of electronic power modeling and simulation of three-phase IGBT full-bridge inverter circuits. The accuracy and computational efficiency of these models have a direct impact on the dependability of the motor control system. The majority of earlier research focused solely on static processes of turning on and off in IGBTs, disregarding the transient proesses that occur when three-phase IGBT full-bridge inverter circuits are switched at high frequencies. This has an impact on the circuits' accuracy in real-time simulation. Therefore, this paper proposes and builds a field-programmable logic gatearray (FPGA)-based steady-state and transient dual-phase three-phase IGBT full-bridge inverter circuit model for the static and transient characteristics of the insulated gate bipolar transistor (IGBT) element in the circuit. Depending on whether or not the switching states of the six IGBTs in the three-phase IGBT fullbridge inverter circuit are altered, the simulation process is split into steady state and transient phases. In the steady state phase with large step size, the circuit is d iscretized using the binary L/ C approach. In the transient phase, the transient process is divided into several small-step-long time domains. Real-time simulation waveforms are generated by interleaving and combining the multistage fitting method's solution of the circuit's transient waveforms at tiny step lengths with the steady state phase. Finally, in order to demonstrate the accuracy of the circuit model in this work, the simulation results of the two-stage three-phase IGBT full-bridge inverter circuit model based on FPGA are compared with those of the conventional ideal model for waveform comparison and data analysis.
The fractional Fourier transform (FrFT) is a useful mathematical tool for signal and image processing. In some applications, the eigendecomposition-based discrete FrFT (DFrFT) is suitable due to its properties of orth...
详细信息
The fractional Fourier transform (FrFT) is a useful mathematical tool for signal and image processing. In some applications, the eigendecomposition-based discrete FrFT (DFrFT) is suitable due to its properties of orthogonality, additivity, reversibility and approximation of continuous FrFT. Although recent studies have introduced reduced arithmetic complexity algorithms for DFrFT computation, which are attractive for real-time and low-power consumption practical scenarios, reliable hardware architectures in this context are gaps in the literature. In this paper, we present two hardware architectures based on the referred algorithms to obtain N-point DFrFT (N = 4L, L is a positive integer). We validate and compare the performance of such architectures by employing field-programmablegatearray implementations, co-designed with an embedded hard processor unit. In particular, we carry out computer experiments where synthesis, error and latency analyses are performed, and consider an application related to compact signal representation.
Non-orthogonal multiple access (NOMA) has been widely regarded as the most promising technique for achieving high spectral efficiency in optical communication systems. However, the practical implementation of power do...
详细信息
Non-orthogonal multiple access (NOMA) has been widely regarded as the most promising technique for achieving high spectral efficiency in optical communication systems. However, the practical implementation of power domain NOMA faces challenges related to inter-user interference and decoding complexity, limiting its multiplexing capability to a pair of users. In this paper, we experimentally demonstrate a hybrid multiple access scheme in the four-user underwater wireless optical communication (UWOC) system. Specifically, power domain NOMA is employed to multiplex two users within a user pair (UP), while time division multiple access (TDMA) is utilized for each UP. To validate the efficacy of the hybrid multiple access technique, robust watertight transceivers are designed and implemented in a 10-m outdoor pool. A calculation method based on the channel condition is first introduced to retain the received power within a proper range when users in the UP are randomly positioned. Besides, an adaptive stochastic gradient descent-based proportional-integral-derivation (SGD-PID) algorithm is proposed for practical scenarios where determining the system and channel parameters is difficult. Experimental results show that the proposed adaptive power control schemes can effectively enhance system performance under different channel conditions experienced by users. The UWOC system achieves a data rate of 30 Mbps for each user, maintaining bit error rates (BERs) below forward error correction (FEC) threshold. The results highlight the remarkable potential of the hybrid multiple access scheme along with our proposed adaptive power control algorithm.
Editor's notes: Side-channel attacks, like those exploiting power and timing, are generally thought to require physical access to a system. This research challenges that idea by demonstrating how such attacks can ...
详细信息
Editor's notes: Side-channel attacks, like those exploiting power and timing, are generally thought to require physical access to a system. This research challenges that idea by demonstrating how such attacks can be carried out on remote systems without physical access. It emphasizes the need to rethink how modern shared-FPGA systems are designed, prioritizing security as a core consideration. -Jeyavijayan Rajendran, Texas AM University, USA
Open Radio Access Networks (Open-RANs) require cost- and energy-efficient solutions to facilitate their large-scale deployment. A significant concern in multiple-input multiple-output (MIMO) systems employing traditio...
详细信息
Open Radio Access Networks (Open-RANs) require cost- and energy-efficient solutions to facilitate their large-scale deployment. A significant concern in multiple-input multiple-output (MIMO) systems employing traditional linear processing is the substantial number of radio frequency (RF) chains at the base station (BS), which is required to ensure accurate decoding of the spatially multiplexed streams. Recently, however, practical non-linear approaches, which facilitate near-optimal parallelizable tree searches, have been successfully implemented on actual systems and have demonstrated the capability to considerably reduce the required RF chains without affecting user performance. Similar to QR decomposition (QRD), which is used to perform channel inversion in linear systems, these non-linear approaches employ a sorted QRD (SQRD) to curtail the search complexity. However, this can be a significant bottleneck for general software-based non-linear solutions, preventing them from fully exploiting their gains. To address the latency limitations of SQRD, this work presents a high-throughput hardware accelerator based on reformulating the underlying Modified Gram Schmidt process (MGS) to extract further parallelism than previous designs. Implementations of the proposed architecture demonstrate at least 2-fold improvements in the achievable throughput and processing latency over existing 4 x 4 and 8 x 8 field-programmablegatearray (FPGA) implementations and can be scaled up to 16 x 16 MIMO systems. Furthermore, the proposed accelerator was integrated with a software framework that can considerably offload the processing burden for a higher number of streams under strict latency conditions.
Stochastic Computing (SC) has recently gained attraction due to its inherent error tolerance and extremely simple arithmetic hardware, making it particularly effective for accelerating modern applications such as neur...
详细信息
Stochastic Computing (SC) has recently gained attraction due to its inherent error tolerance and extremely simple arithmetic hardware, making it particularly effective for accelerating modern applications such as neural networks on resource-constrained devices. Traditional SC architectures often adopt binary computing principles, relying on dedicated hardware, i.e. AND gates for multiplication and multiplexers (MUX) for addition. However, SC's mathematical foundation enables the fusion of complex operations into remarkably simple hardware. Several SC studies demonstrated the potential of MUX-based architectures to perform multiply-and-accumulate (MAC) operations, but existing designs face correlation complication, scaling problems, and limited application scope. This paper introduces an auxiliary logic block to address the complexity of MUX select inputs, significantly enhancing the scalability of MUX-based MAC operations. The proposed approach has been validated through SC image processing tasks, including grayscale conversion and Sobel edge detection, achieving up to 75% reduction in hardware resource utilization on field-programmablegatearrays (FPGAs) and up to 96% improvement in computational accuracy compared to traditional AND/XNOR-based SC multipliers.
In this article, we demonstrate a real-time, multi-user downlink underwater wireless optical communication (UWOC) system for practical applications via field programmable gate arrays (FPGAs). The performance of the es...
详细信息
In this article, we demonstrate a real-time, multi-user downlink underwater wireless optical communication (UWOC) system for practical applications via field programmable gate arrays (FPGAs). The performance of the established UWOC system is experimentally investigated under diverse channel conditions. The established UWOC system utilizes arrayed light emitting diodes (LEDs) as the transmitter and employs optical superimposition-based non-orthogonal multiple access (NOMA). The feasibility of employing the arrayed LEDs as the transmitter under different channel conditions is validated by simulating the optical intensity distributions of LED groups. Both the simulation and experimental results reveal that the bit error rate (BER) of user 1 (higher power user) decreases with increasing power allocation ratio (PAR), while the BER of user 2 (lower power user) increases with the increase in PAR. Uniform post-equalization is employed to extend the bandwidth of each LED group. Besides, a metal-oxide-semiconductor field-effect transistor (MOSFET)-based driver circuit and a remaining carrier sweep-out circuit unit (CSCU) are proposed to facilitate the independent control over different LED groups and enhance the response speed of LEDs, which outperform bias-tee and the single MOSFET-based driver circuit. Experimental results also indicate that NOMA exhibits superior spectral efficiency compared to the conventional time division multiple access (TDMA). A PAR of approximately 2:1 is appropriate to ensure both users operate within the forward error correction (FEC) threshold. Based on the proposed schemes, the real-time UWOC system achieves a data rate of 40 Mbps for both users with BER below FEC limit under different channel conditions. The results highlight the significant potential of the designed UWOC system to effectively meet diverse real-time, multi-user UWOC application requirements.
Homomorphic encryption is an important technology for protecting data privacy, and the performance of modular multiplication directly affects the efficiency of homomorphic encryption. Currently, there are numerous FPG...
详细信息
Homomorphic encryption is an important technology for protecting data privacy, and the performance of modular multiplication directly affects the efficiency of homomorphic encryption. Currently, there are numerous FPGA-based acceleration techniques targeting modular multiplication. However, many of these implementations require substantial hardware resources or suffer from resource imbalance. This leads to a lower throughput. Therefore, we present a novel FPGA-based implementation of Montgomery Modular Multiplication aimed at addressing these challenges. Our design employs a suitable radix bit width and word size based on the digital signal processing (DSP) bit width rather than the conventional binary powers of two. We aim to instantiate more modular multipliers using limited resources while minimizing latency. We also introduce a novel DSP cascade structure, called parallel grouping cascade DSP, which reduces the number of clock cycles of internal multipliers. To balance the ratio of lookup table (LUT) and DSP usage, we also use multipliers implemented in the LUT to replace some DSPs. Our results, implemented on Xilinx Virtex-7 field -programmablegatearray (FPGA), demonstrate more than 27% improvement in throughput on 1024bit modular multiplication and more than 70% improvement on 2048 -bit compared to the best previous state-of-the-art references.
The Pietra-Ricci index detector (PRIDe) has been recently proposed as one of the simplest techniques for centralized, data-fusion cooperative spectrum sensing, attaining robustness against time-varying signal and nois...
详细信息
The Pietra-Ricci index detector (PRIDe) has been recently proposed as one of the simplest techniques for centralized, data-fusion cooperative spectrum sensing, attaining robustness against time-varying signal and noise levels, constant false alarm rate, and high detection power. In this paper, we propose the design and implementation of the PRIDe detector, targeting field programmable gate array (FPGA) and application-specific integrated circuit (ASIC) solutions. Novel approaches are proposed for computing the PRIDe's test statistic, including the absolute value of complex quantities, the complex multiplier-accumulator, and the spectrum occupancy decision. The absolute value operation, which is critical to the PRIDe test statistic computational cost, applies the coordinate rotation digital computer (CORDIC) algorithm as a low latency and resource-efficient option. Register transfer level (RTL) and Monte Carlo simulations show that the resulting ultra-low latency PRIDe detector architectures attain no performance loss with respect to floating-point simulations. One of the two proposed ASIC design versions of the PRIDe sensor occupies 34.9% lower area compared to the most area-efficient sensor reported in literature, whereas the other one is $5.7\times$ faster than the fastest state-of-the-art sensor. In a nutshell, the proposed detector architecture delivers the highest area and power efficiencies, considering the scaled values of area-time product (ATP) and power-delay product (PDP) metrics, in comparison to implementations reported to date.
暂无评论