this paper presents and discusses implementation of a barotropic operator used in ocean model simulation called Parallel Ocean Program (POP) using SRC-6 MAP. While a lot of high-end reconfigurable machines on which us...
详细信息
ISBN:
(纸本)9781424410590
this paper presents and discusses implementation of a barotropic operator used in ocean model simulation called Parallel Ocean Program (POP) using SRC-6 MAP. While a lot of high-end reconfigurable machines on which users can implement applications with a programming language are now available, enough implementation experience has not been accumulated for practical applications. In this paper, several implementation techniques accompanied by modification on original application source code are empirically evaluated and analyzed. the results show that appropriate use of internal memory and streaming DMA make 100 MHz FPGAs achieve comparative performance with GHz processors by using 100 MHz FPGAs.
It is often desirable to change the logic and/or the connections within an FPGA design on-the-fly without the benefit of a workstation or vendor CAD software. this paper presents a dynamic router for Xilinx FPGAs, des...
详细信息
ISBN:
(纸本)9781424419609
It is often desirable to change the logic and/or the connections within an FPGA design on-the-fly without the benefit of a workstation or vendor CAD software. this paper presents a dynamic router for Xilinx FPGAs, designed to run on stand-alone embedded systems. With information obtained from Xilinx's XDL tool, a compact routing database for the Virtex-II/IIP/4 devices is built which only requires 96 KB of storage. A channel routing algorithm is used because of its deterministic execution time and because all routing resources in the channel are available. Sample channels are routed withthe router and compared withthe Xilinx PAR tool. Improvements in both execution time and in memory usage of several orders of magnitude are observed.
In this paper, we present the first multilevel implementation of the Harris-Stephens corner detector and the ORB feature extractor running on FPGA hardware, for computer vision and robotics applications. ORB is a fund...
详细信息
ISBN:
(纸本)9789090304281
In this paper, we present the first multilevel implementation of the Harris-Stephens corner detector and the ORB feature extractor running on FPGA hardware, for computer vision and robotics applications. ORB is a fundamental component of many robotics applications, and requires significant computation. the design has been validated both in behavioural simulation and in implementation on an Arria V FPGA connected to a desktop PC via PCI-Express. A Linux kernel-mode driver and userspace library allow integration of the acceleration hardware into C++ programs. the device has significantly higher throughput than a CPU implementation (150 MPixel/s vs 27 MPixel/s) and a GPU implementation (40 MPixel/s), with much lower power draw (5.3 W vs 145 W). this throughput is equivalent to 72 fps at 1920 x 1080 or 488 fps at 640 x 480.
this work introduces an FPGA implementation for vessel-tree extraction on retinal images. the retinal vessel-tree can be used in disease diagnoses, e.g. diabetes, or in person authentication. In such cases, a portable...
详细信息
ISBN:
(纸本)9781424438914
this work introduces an FPGA implementation for vessel-tree extraction on retinal images. the retinal vessel-tree can be used in disease diagnoses, e.g. diabetes, or in person authentication. In such cases, a portable device with a high performance may be a need. the FPGA implementation discussed here, although application-oriented, features a fully programmable SIMD architecture, allowing for an efficient realization of low-level image processing algorithms. It is mapped onto a Spartan 3, amounting to 90 processing elements. the on-chip memory utilized was 1.4MB and stores 8 gray images of 144 x 160px. the working frequency is 53MHz, allowing for a 3 x 3 convolution in less than 110 mu s.
TCP/IP is widely used both in the Internet as well as in data centers. the protocol makes very few assumptions about the underlying network and provides useful guarantees such as reliable transmission, in-order delive...
详细信息
ISBN:
(纸本)9782839918442
TCP/IP is widely used both in the Internet as well as in data centers. the protocol makes very few assumptions about the underlying network and provides useful guarantees such as reliable transmission, in-order delivery, or control flow. the price for this functionality is complexity, latency, and computational overhead, which is especially pronounced in software implementations. While for Internet communication this is acceptable, the overhead is too high in data centers. In this paper, we explore how to optimize a TCP/IP stack running on an FPGA for data center applications with an emphasis on data processing (e.g., key value stores). Using a key-value store and a low-latency consensus protocol implemented on an FPGA as an example of the requirements that arise in data centers, we provide an extensive analysis of the overheads of TCP/IP and the solutions that can be adopted to minimize such an overhead. the proposed optimized TCP/IP stack minimizes tail latencies (a key metric in distributed data processing) and is efficiently implemented so as to be able to share the FPGA with application logic.
this work presents a programmable, configurable motion estimation processor for the H.264 video coding standard, capable of handling the processing requirements of high definition (HD) video and suitable for FPGA impl...
详细信息
ISBN:
(纸本)9781424419609
this work presents a programmable, configurable motion estimation processor for the H.264 video coding standard, capable of handling the processing requirements of high definition (HD) video and suitable for FPGA implementation. the programmable aspect of the processor follows the ASIP (Application Specific Instruction set Processor) approach with a instruction set targeted to accelerating block matching motion estimation algorithms. Configurability relates to the ability to optimize the microarchitecture for the selected algorithm and performance requirements through varying the number and type of execution units at compile time.
the poor scalability of current mesh-based FPGA interconnection networks is impeding our attempts to build next-generation FPGA of larger logic capacity. A few alternative interconnection network architectures have be...
详细信息
ISBN:
(纸本)9781424419609
the poor scalability of current mesh-based FPGA interconnection networks is impeding our attempts to build next-generation FPGA of larger logic capacity. A few alternative interconnection network architectures have been proposed for future FPGAs, but they still have several design challenges that need to be addressed. In this paper we propose sFPGA, a scalable FPGA architecture, which is a hybrid between hierarchical interconnection and Network-on-Chip. the logic resources in sFPGA are organized into an array Of logic tiles. the tiles are connected by a hierarchical network of switches, which route data packets over the network. In addition, we have proposed a design flow for sFPGA which integrates current design flows seamlessly. By doing a case study in our emulation prototype, we have validated our sFPGA design flow.
We demonstrate a hybrid reconfigurable cluster-on-chip architecture with a cross-platform Message Passing Interface (MPI), a cross-platform parallel image processing library and a sample application. We describe the s...
详细信息
ISBN:
(纸本)9781424410590
We demonstrate a hybrid reconfigurable cluster-on-chip architecture with a cross-platform Message Passing Interface (MPI), a cross-platform parallel image processing library and a sample application. We describe the system, network architecture, MPI library and the parallel image processing library implementations. We validate the performance, scalability and suitability of MPI as a software interface to enable cross-platform application parallelism on reconfigurable hybrid cluster-on-chip systems and desktop cluster systems. the presented results are promising, showing the suitability, scalability and performance of parallelisation of image processing algorithms with a cross-platform MPI implementation.
Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). this paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GR...
详细信息
ISBN:
(纸本)9782839918442
Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). this paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GRU). We first proposed memoization optimization to avoid 3 out of the 6 dense matrix vector multiplications (SGEMVs) that are the majority of the computation in GRU. then, we study the opportunities to accelerate the remaining SGEMVs using FPGAs, in comparison to 14-nm ASIC, GPU, and multi-core CPU. Results show that FPGA provides superior performance/Watt over CPU and GPU because FPGA's on-chip BRAMs, hard DSPs, and reconfigurable fabric allow for efficiently extracting fine-grained parallelisms from small/medium size matrices used by GRU. Moreover, newer FPGAs with more DSPs, on-chip BRAMs, and higher frequency have the potential to narrow the FPGA-ASIC efficiency gap.
A nonvolatile FPGA using atom-switch crossbars is implemented in a 28nm CMOS. the depopulated atom-switch crossbar with double-gate layout achieves 75% area saving. the routability degradation due to the depopulation ...
详细信息
ISBN:
(纸本)9781728199023
A nonvolatile FPGA using atom-switch crossbars is implemented in a 28nm CMOS. the depopulated atom-switch crossbar with double-gate layout achieves 75% area saving. the routability degradation due to the depopulation is mitigated by a modified routing architecture of mixed segment lengths withthinning out connection block populations. To our knowledge, the novel FPGA provides the largest logic capacity of 171k lookup-table (LUT) among 3D-FPGAs based on monolithically integrated nonvolatile switch and memory. the operating frequency and dynamic power are significantly improved as compared to conventional atom-switch FPGAs.
暂无评论