Intel's Xeon roadmap includes package-integrated fpgas in every new generation. In this talk, we will dissect why this is such a powerful combination at this time of great change in datacenter workloads. We will s...
详细信息
The proceedings contain 29 papers. The topics discussed include: accelerating subsequence similarity search based on dynamic time warping distance with fpga;video-rate stereo matching using Markov random field TRW-S i...
ISBN:
(纸本)9781450318877
The proceedings contain 29 papers. The topics discussed include: accelerating subsequence similarity search based on dynamic time warping distance with fpga;video-rate stereo matching using Markov random field TRW-S inference on a hybrid CPU+fpga computing platform;fully-functional fpga prototype with fine-grain programmable body biasing;sensing nanosecond-scale voltage attacks and natural transients in fpgas;word-length optimization beyond straight line code;word-length optimization beyond straight line code;embedding-based placement of processing element networks on fpgas for physical model simulation;a remote memory access infrastructure for global address space programming models in fpgas;architecture support for custom instructions with memory operations;high throughput and programmable online traffic classifier on fpga;and indirect connection aware attraction for fpga clustering.
In recent years, Convolutional Neural Network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. Howe...
详细信息
ISBN:
(纸本)9781450338561
In recent years, Convolutional Neural Network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are computational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. fpga is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of fpga accelerator for CNN. In this paper, we go deeper with the embedded fpga platform on accelerating CNNs and propose a CNN accelerator design on embedded fpga for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-theart CNN, VGG16-SVD, is implemented on an embedded fpga platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on fpga end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of Convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperforms previous approaches significantly.
Multi-fpga platforms are a popular choice today for complex system prototyping because they offer high execution speed, low cost, and real world testing experience. However, performance of multi-fpga based systems is ...
详细信息
ISBN:
(纸本)9781450345354
Multi-fpga platforms are a popular choice today for complex system prototyping because they offer high execution speed, low cost, and real world testing experience. However, performance of multi-fpga based systems is severely affected by widening logic to I/O gap in fpgas. In order to address the performance issue, in this work, we propose an exploration and optimization flow for multi-fpga based prototyping that gives an end-to-end experience starting from benchmark generation to optimized inter-fpga routing. Using generic tools of the flow, ten large benchmarks are generated. Then, through a generic novel inter-fpga routing environment, effect of variation of number of fpgas as well as number of inter-fpga tracks on the performance of a target design is explored. For performance exploration and optimization, five different fpga boards are utilized where number of fpgas on board are varied from two to six. Moreover, for each board four different inter-fpga track combinations are used. Experimental results reveal that multi-fpga boards with inter-fpga tracks corresponding optimally to the cut net requirements of benchmarks under consideration give best frequency results. Furthermore, frequency comparison between different boards shows that fpga board with six fpgas gives, on average, best frequency results. Finally, we also perform frequency-price analysis which shows that board with four fpgas gives better frequency-price tradeoff as compared to other fpga boards under consideration.
Zynq-7000 All programmable SoC and the new Zynq Ultrascale+ MPSoC provide proven alternatives to traditional domain-specific application SoCs and enable extensive system-level differentiation, integration and flexibil...
详细信息
ISBN:
(纸本)9781450338561
Zynq-7000 All programmable SoC and the new Zynq Ultrascale+ MPSoC provide proven alternatives to traditional domain-specific application SoCs and enable extensive system-level differentiation, integration and flexibility through hardware, software and I/O programmability. The SDSoC Development Environment is a heterogeneous design environment for implementing embedded systems using the Zynq SoC and MPSoC. It enables the broader community of embedded software developers to leverage the power of hardware and software programmable devices, entirely from a higher-level of abstraction. The SDSoC environment provides a greatly simplified embedded C/C++ application programming experience including an easy-to-use Eclipse IDE and a comprehensive development platform. SDSoC includes a full-system optimizing C/C++ compiler, system-level profiling and hardware/software event tracing, automated software acceleration in programming logic, automated generation of SW-HW connectivity, and integration with libraries to speed programing. The SDSoC compiler transforms programs into complete hardware/software systems based on user-specified target platform and functions within the program to compile into programmable hardware logic. Hardware accelerators communicate with the CPU and external memory through an automatically-generated, application-specific data motion network comprised of DMAs, interconnects and other standard IP blocks. The SDSoC Environment also provides flows for customer and 3rd party developers to enable their platforms and integrate RTL IPs as C-callable libraries. It builds upon customer-proven design tools from Xilinx including Vivado Design Suite, Vivado High-level Synthesis and SDK. In this presentation, we will introduce the motivation and basic concepts behind SDSoC, describe its capabilities and the user-flow, and provide a brief demonstration of the tool using an example.
In recent ten years, lots of new applications emerged, such as AI, big data and cloud. Though the workloads of these applications are very diverse, they demand huge resource of data center. In contrast, the silicon te...
详细信息
ISBN:
(纸本)9781450341851
In recent ten years, lots of new applications emerged, such as AI, big data and cloud. Though the workloads of these applications are very diverse, they demand huge resource of data center. In contrast, the silicon technology moves slower and slower because the Moore's law is going to the end. Consequently, the data center building from commodity hardware cannot provide enough costefficiency and power-efficiency. To meet the increasingly resource needs of emerging applications, the scale of data center is become much larger and larger. It consumes huge power and cost of hardware. From the business perspective, the slow development of hardware technology limits the value creation of emerging applications. We, Baidu, the largest search engine in China, have faced this challenge in several years ago. We find that the server number increases much faster than the scale of business. And this case is common for internet companies. Because the iteration of general processor becomes slower and slower. For example, Intel announced that the Tick-Tock production strategic was out of date in this early year. This problem drive us to look for new methods to boost business. From Internet Company's perspective, building new chips or new architecture based on its applications' characteristics makes sense. This method can break the limitation of commodity chips and commodity hardware. And according to academic and industry experiences, domain-specified architecture can achieve much better performance and power efficiency than general architecture. Consequently, we are exploring new architecture to extend Moore's law. In this paper, we present the works on exploring new architecture for data center. The data center resource includes storage, memory, computing and networking. Hence, we focus on these four areas. Firstly, we implemented SDF for large-scale distributed storage system. The SDF aims to low cost and high performance flash storage system. Secondly, we implemented SDA for dee
暂无评论