Algorithms for mobile networking are increasingly being moved from centralized servers towards the edge in order to decrease latency and improve the user experience. While much of this work is traditionally done using...
详细信息
ISBN:
(纸本)9781450394178
Algorithms for mobile networking are increasingly being moved from centralized servers towards the edge in order to decrease latency and improve the user experience. While much of this work is traditionally done using ASICs, 6G emphasizes the adaptability of algorithms for specific user scenarios, which motivates broader adoption of fpgas. In this paper, we propose the fpga-based Weightless Intrusion Warden (FWIW), a novel solution for detecting anomalous network traffic on edge devices. While prior work in this domain is based on conventional deep neural networks (DNNs), FWIW incorporates a weightless neural network (WNN), a table lookup-based model which learns sophisticated nonlinear behaviors. This allows FWIW to achieve accuracy far superior to prior fpga-based work at a very small fraction of the model footprint, enabling deployment on small, low-cost devices. FWIW achieves a prediction accuracy of 98.5% on the UNSW-NB15 dataset with a total model parameter size of just 192 bytes, reducing error by 7.9x and model size by 262x vs. LogicNets, the best prior edge-optimized implementation. Implemented on a Xilinx Virtex UltraScale+ fpga, FWIW demonstrates a 59x reduction in LUT usage with a 1.6x increase in throughput. The accuracy of FWIW comes within 0.6% of the best-reported result in literature (Edge-Detect), a model several orders of magnitude larger. Our results make it clear that WNNs are worth exploring in the emerging domain of edge networking, and suggest that fpgas are capable of providing the extreme throughput needed.
Interactive intelligent services (e.g., smart web search) are becoming essential datacenter workloads. They rely on data-intensive artificial intelligence (AI) algorithms that do not use batch computation due to their...
详细信息
ISBN:
(纸本)9781450361378
Interactive intelligent services (e.g., smart web search) are becoming essential datacenter workloads. They rely on data-intensive artificial intelligence (AI) algorithms that do not use batch computation due to their tight latency constraints. Since off-chip data accesses have higher latency and energy consumption than on-chip accesses, a persistent AI approach with the entire model stored in on-chip memory is becoming the new norm for real-time AI. This approach is the cornerstone of Microsoft's Brainwave fpga-based AI cloud and was recently added to Nvidia's cuDNN library. In this work, we implement, optimize and evaluate a Brainwave-like neural processing unit (NPU) on a large Stratix-10 fpga. We benchmark it against a large Nvidia Volta GPU running cuDNN persistent AI kernels. Across real-time persistent RNN, GRU, and LSTM workloads, we show that Stratix-10 offers ~3× (FP32) and ~10× (INT8) better latency than GPU (FP32), which uses only ~6% of its peak throughput. Then, we propose TensorRAM, an ASIC chiplet for persistent AI that is 2.5D integrated with an fpga in the same package. TensorRAM enhances the on-chip memory capacity and bandwidth, with enough multi-precision INT8/4/2/1 throughput to match that bandwidth. Multiple TensorRAMs can be integrated with Stratix-10. Our evaluation shows that a small 32-mm2 TensorRAM on 10nm offers 64MB of SRAMs with 32TB/s on-chiplet bandwidth and 64 TOP/s (INT8). A small Stratix-10 with a TensorRAM (INT8) offers 16× better latency and 34× energy efficiency compared to GPU (FP32). Overall, Stratix-10 with TensorRAM offers compelling and scalable persistent AI solutions.
A high pipeline JPEG2000 encoder is implemented with low memory, dual buffers to save the wavelet coefficients and pre-rate allocation are used to reduce on-chip RAM. Pipeline and parallel architecture is used in disc...
详细信息
ISBN:
(纸本)0780389204
A high pipeline JPEG2000 encoder is implemented with low memory, dual buffers to save the wavelet coefficients and pre-rate allocation are used to reduce on-chip RAM. Pipeline and parallel architecture is used in discrete wavelet transform (DWT), bit-plane encoder (BPE) and arithmetic encoder (AE) to enhance coding speed, byte presentation of rate-distortion (RD) slope simplify searching the threshold value to truncate passes in post-coding rate distortion (PCRD) in Tier2. Packet formation, clock distribution and asynchronous interface are also presented. The encoder is verified on fpga platform, performance is followed: the size of tile is up to 256 /spl times/ 256 with code block in size of 32 /spl times/ 32. Input sampling rate is up to 45M samples/s when Tier1 is working at the clock 100 MHz, difference of the PSNR of image compressed by the encoder and JASPER is less than O.5 db at the rate of 0.4 bit per sample (bps). Equivalent gates synthesized are about 109 K and on-chip RAM is 862 Kb.
In order to develop high performance computer systems efficiently, environments to evaluate architectural ideas are required. Software environments such as simulators are very flexible, and thus often used. On the oth...
详细信息
In order to develop high performance computer systems efficiently, environments to evaluate architectural ideas are required. Software environments such as simulators are very flexible, and thus often used. On the other hand, if the target hardware is complex and large, it is very hard to finish the simulation in practical time because of software's slow simulation speed. Thus, we develop a hardware environment for efficient evaluation of computer systems. We propose and develop an IBM PC Compatible SoC on an fpga where hardware developers can evaluate their custom architectures. The SoC has an x86 soft core processor which can run general purpose operating systems. By making the proposed system run on fpgas of two major vendors, i.e. Xilinx and Altera, we believe that it can be widely adopted. Besides, the SoC can be used for learning computer systems, because of its open-source policy. In this paper, we detail the design and implementation of the proposed SoC, and verify that it accurately runs some applications. As a case study to demonstrate usability of the SoC for computer research, we implement two types of L2 caches in Verilog HDL and evaluate their performance by running the SPEC CPU2000 INT benchmark suite. Additionally, we discuss how the SoC can be used for computer education.
Given a programmable chip (such as a WSI systolic array, or a fieldprogrammablegate array (fpga)) made of equally-like configurable logic blocks (cells), the problem of programming the interconnect resources (consis...
详细信息
ISBN:
(纸本)0769507190
Given a programmable chip (such as a WSI systolic array, or a fieldprogrammablegate array (fpga)) made of equally-like configurable logic blocks (cells), the problem of programming the interconnect resources (consisting of switches) has been well studied in the literature. This process can be used for fault tolerance by logically reconfiguring the fault free cells of the array into a new array as well as to customize the fpga by configuring it to perform the desired functions. Once the desired logical configuration has been achieved (as in the presence of faulty cells), the (programmable) switches in the interconnect resources of the array must be programmed to implement the target topology on the physical array. In this paper, we study the problem of minimizing the programming time (or cost) required for implementing this step. We show that current techniques are not likely to lead to cost optimal polynomial time algorithms, because the underlying covering problems are NPC-complete in the strong sense. The NPC-completeness is also extended to grid arrays.
The propose of this work is to describe the implementation of a controller for tracking the point of maximum power transfer based on neural networks, using a microcontroller of the PIC family, on energy systems that u...
详细信息
ISBN:
(纸本)0780366069
The propose of this work is to describe the implementation of a controller for tracking the point of maximum power transfer based on neural networks, using a microcontroller of the PIC family, on energy systems that use photovoltaic panels. A neural network is used to determine, at each instant, the output voltage in a DC-DC boost converter connected to the solar panels, in way to obtain the maximum power transfer of these panels. This implementation releases the use of a great performance computer, once the neural network was already trained and this way, it can be implemented in a dedicated system. The maximum power transfer of the system is obtained adjusting the duty cycle of the DC-DC boost converter. One three-phase inverter was implemented with a new control circuit to generate the PWM signal.
暂无评论