this paper presents an FPGA implementation of a low cost 8bit reconfigurable processor core for media processing applications. the core is optimized to provide all basic arithmetic and logic functions required by the ...
详细信息
ISBN:
(纸本)9781424438914
this paper presents an FPGA implementation of a low cost 8bit reconfigurable processor core for media processing applications. the core is optimized to provide all basic arithmetic and logic functions required by the media processing and other domains, as well as to make it easily integrable into a 2D array. this paper presents an investigation of the feasibility of the core as a potential soft processing architecture for FPGA platforms. the core was synthesized on the entire Virtex FPGA family to evaluate its overall performance, scalability and portability. A special feature of the proposed architecture is its simple programming model which allows low level programming. throughput results for popular benchmarks coded using the programming model and cycle accurate simulator are presented.
Nowadays, FPGAs are integrated in high-performance computing systems, servers, or even used as accelerators in System-on-Chip (SoC) platforms. Since the execution is performed in hardware, FPGA gives much higher perfo...
详细信息
Networked embedded systems have seen tremendous growth with many more complex critical and non-critical systems exchanging information over networks of various types. At each node, information is processed by the netw...
详细信息
ISBN:
(数字)9781538685174
ISBN:
(纸本)9781538685174
Networked embedded systems have seen tremendous growth with many more complex critical and non-critical systems exchanging information over networks of various types. At each node, information is processed by the network stack before the application sees the data. Large portions of the stack are in software, resulting in significant and non-deterministic delays. While hybrid compute platforms like the Xilinx Zynq can accelerate processing tasks through offloading to programmablelogic, the delays incurred due to connectivity can significantly impact overall application latency. In this paper, we present a smart network interface approach for the Xilinx Zynq platform based on datapath extensions within the otherwise standard Ethernet interface. We show that this approach improves computation offload latency by 24-27% and throughput by 37% for a complex computational kernel.
Sharing multi-cycle hardware blocks like the DSP48E1 primitive in Xilinx FPGAs can result in significant resource savings, but complicates scheduling. For high-throughput, DSP blocks must be pipelined, which results i...
详细信息
ISBN:
(纸本)9782839918442
Sharing multi-cycle hardware blocks like the DSP48E1 primitive in Xilinx FPGAs can result in significant resource savings, but complicates scheduling. For high-throughput, DSP blocks must be pipelined, which results in a high initiation interval (II) for resource shared implementations. In this paper, we propose a resource reduction technique that minimises DSP block usage while also offering improved II over traditional approaches. this is integrated in a high-level tool which takes datapath descriptions in C and generates synthesisable Verilog RTL with different levels of resource sharing. We demonstrate significantly improved throughput compared to traditional resource sharing while achieving resource reduction compared to resource unconstrained and HLS implementations. the approach explores an otherwise infeasible design space between resource unconstrained and traditional resource sharing methods.
Image features are broadly used in embedded computer vision applications, from object detection and tracking to motion estimation and 3D reconstruction. Efficient feature extraction and description are crucial due to ...
详细信息
ISBN:
(纸本)9782839918442
Image features are broadly used in embedded computer vision applications, from object detection and tracking to motion estimation and 3D reconstruction. Efficient feature extraction and description are crucial due to the real-time requirements of such applications over a constant stream of input data. High-speed computation typically comes at the cost of high power dissipation, yet embedded systems are often highly power constrained, making discovery of power-aware solutions especially critical for these systems. In this paper, we present a power and performance evaluation of three low cost feature detection and description algorithms implemented on various embedded systems (embedded CPUs, GPUs and FPGAs). We show that FPGAs in particular offer attractive solutions for both performance and power and describe several design techniques utilized to accelerate feature extraction and description algorithms on low-cost Zynq SoC FPGAs.
the interconnection networks used by current fine grain FPGAs are not scalable for very big array sizes. To address this issue, we apply the GALS (Globally Asynchronous and Locally Synchronous) paradigm to build scala...
详细信息
ISBN:
(纸本)9781424438914
the interconnection networks used by current fine grain FPGAs are not scalable for very big array sizes. To address this issue, we apply the GALS (Globally Asynchronous and Locally Synchronous) paradigm to build scalable FPGAs. the logic resources are divided into locally synchronous tiles and asynchronous communications among different tiles. To route the asynchronous communications, we build a serial network-on-chip. Targeting streaming applications, we propose a design flow that maps user applications to our new FPGA architecture. To validate our architecture and design flow, we build an emulation prototype and develop a JPEG baseline encoder as the case study. We have successfully demonstrated the concept and predict a maximum frequency of 224MHz for designs mapping to sFPGA2 architecture.
this paper presents a fast and scalable method of computing signal toggle rate in FPGA-based circuits. Our technique is a vectorless estimation technique, which can be used in a CAD tool to identify the parts of the c...
详细信息
ISBN:
(纸本)9781424419609
this paper presents a fast and scalable method of computing signal toggle rate in FPGA-based circuits. Our technique is a vectorless estimation technique, which can be used in a CAD tool to identify the parts of the circuit that can benefit from power optimization. A key advantage of our approach is its ability to efficiently account for spatial correlation of related logic cones, which is accomplished using a novel XOR-based decomposition. In addition, our approach uses post-routing circuit delays to account for glitches in a logic circuit. the proposed approach was tested on 14 MCNC benchmark circuits compiled for the Altera Stratix II devices. the results indicate that our method improves the vectorless estimation technique available in the latest version of Altera's Quartus II commercial CAD tool, reducing the average error by 37% and standard deviation by 59%.
We present our latest FPGA acceleration card NFB-200G2QL that is specifically designed to enable traffic processing at 200 Gbps. Unique high-speed DMA engines in the FPGA together with highly optimized Linux drivers e...
详细信息
ISBN:
(数字)9781538685174
ISBN:
(纸本)9781538685174
We present our latest FPGA acceleration card NFB-200G2QL that is specifically designed to enable traffic processing at 200 Gbps. Unique high-speed DMA engines in the FPGA together with highly optimized Linux drivers enable data transfer through PCIe interfaces with minimal CPU overhead. Captured traffic can be independently distributed between individual cores of two physical CPUs (NUMA nodes) without utilization of QPI. As a result, wire-speed packet capture to the host memory from two fully saturated 100 Gbps Ethernet interfaces (QSFP28+ cages) is achieved and various network monitoring applications can utilize the power of the latest FPGAs and CPUs for data processing. this is especially useful when both directions of a single 100GbE link are monitored. the live demonstration shows how the packets are received from two 100 Gbps Ethernet links at wire-speed and captured to the host memory at 200 Gbps without a loss. the opposite direction of communication is also shown, i.e. how the packets are transmitted from the host memory and fully saturate the two 100GbE network interfaces. Achieved speeds are demonstrated by counters and gauges showing generated, received/transmitted and captured packets. We also show statistics of CPU load during the packet capture/transmission for different packet lengths.
QR decomposition, especially through the means of Householder transformation, is often used to solve least squares problems. A matrix to be decomposed withthis method is usually very large, often large enough that it...
详细信息
ISBN:
(纸本)9781424410590
QR decomposition, especially through the means of Householder transformation, is often used to solve least squares problems. A matrix to be decomposed withthis method is usually very large, often large enough that it is not able to fit into the main memory of a workstation, let alone the internal memory of an FPGA nowadays. Efficient out-of-core algorithms have been developed to address the factorization of large matrices. this paper describes the application of variants of Householder QR decomposition on FPGA-based systems. More specifically, issues on applying out-of-core algorithms to the relatively small internal memory architecture of FPGA's are investigated.
In this paper we present the design and the implementation of an FPGA-based floating-point adder withthree inputs. the design is based on a 5-level pipeline stage in order to distribute the critical paths and to maxi...
详细信息
ISBN:
(纸本)9781424419609
In this paper we present the design and the implementation of an FPGA-based floating-point adder withthree inputs. the design is based on a 5-level pipeline stage in order to distribute the critical paths and to maximize the performance. We examine the data dependencies to minimize the number of the pipeline stages and to reduce the resource allocation. Our design is parameterisable in order to cope with different floating-point formats, including the standard IEEE 754 formats and the custom configurations. the proposed design withthe single precision, 32-bit floating-point format, can be operated at 143 MHz on Xilinx Virtex2Pro XC2VP30-7.
暂无评论