We demonstrate a novel FPGA-based accelerator architecture that can tackle a range of standard computer vision (CV) problems, with scalable performance and attractive speedups. the architecture relies on multiple pipe...
详细信息
ISBN:
(纸本)9782839918442
We demonstrate a novel FPGA-based accelerator architecture that can tackle a range of standard computer vision (CV) problems, with scalable performance and attractive speedups. the architecture relies on multiple pipelined processing elements (PEs) that can be configured to support various belief propagation (BP) settings for different CV tasks. Inside each PE, innovative implementation of Jump Flooding for efficient computation of BP solves the core configurability challenge. A novel block-parallel memory interface supports parallelization by distributing BP inference workloads across the PEs. Experimental results demonstrate that our accelerator achieves scalable performance with 11-41x speedup over standard sequential CPU implementations across a subset of well-known Middlebury and OpenGM benchmarks, with no compromise in quality of inference results. To the best of our knowledge, this is the first FPGA hardware implementation of BP capable of running a range of standard CV benchmarks with significant speedups.
Capacity of FPGAs has grown significantly, leading to increased complexity of designs targeting these chips. Traditional FPGA design methodology using HDLs is no longer sufficient and new methodologies are being sough...
详细信息
ISBN:
(纸本)9781424438914
Capacity of FPGAs has grown significantly, leading to increased complexity of designs targeting these chips. Traditional FPGA design methodology using HDLs is no longer sufficient and new methodologies are being sought. An attractive possibility is to use streaming languages. Streaming languages group data into streams, which are processed by computational nodes called kernels. they are suitable for implementation in FPGAs because they expose parallelism, which can be exploited by implementing the application in FPGA logic. Designers can express their designs in a streaming language and target FPGAs without needing a detailed understanding of digital logic design. In this paper we show how the Brook streaming language can be used to simplify design for FPGAs, while providing reasonable performance compared to other methodologies. We show that throughput of streaming applications can be increased through automatic kernel replication. Using our compiler, the FPGA designer can trade off FPGA area and performance by changing the amount of kernel replication. We describe the details of our compiler and present performance and area of a set of benchmarks. We found that throughput scales well with increased replication for most applications.
Random sampling based path planning algorithms have shown their high efficiency in robotics, navigation and related fields. the Rapidly-Exploring Random Trees (RRT) is the typical method and works well in a variety of...
详细信息
ISBN:
(纸本)9782839918442
Random sampling based path planning algorithms have shown their high efficiency in robotics, navigation and related fields. the Rapidly-Exploring Random Trees (RRT) is the typical method and works well in a variety of applications. Due to the sub-optimal issue of original RRT, the recent algorithm, known as RRT*, significantly improves the optimality of solution by adding the "cost review" procedure. However, the original RRT experiences the bottle neck of complicated iterations and it becomes worse in RRT*. this paper presents the developed hardware architecture for RRT*, which fully exploits the parallel potential of algorithm. Unlike the sequential execution in software, the "exploration" and "review" are identified as independent processes and executed in parallel. For the complicated operation of inserting vertexes, one pipelined Kd-tree constructor is designed to fast rebuild the tree when new vertex generated. Furthermore, to speed up the near neighbors and nearest neighbor searching, the vertexes are stored in separate Kd-trees so that the search processes can be carried out concurrently in each data tree. this work explores the possible and power-efficient RRT* hardware architecture on FPGAs compared to PC implementation.
fieldprogrammable Gate Arrays (FPGAs) are becoming pervasive in various kinds of computationally demanding applications. Working in a tightly coupled processor-coprocessor architecture, FPGAs are often anticipated to...
详细信息
ISBN:
(纸本)9781467381239
fieldprogrammable Gate Arrays (FPGAs) are becoming pervasive in various kinds of computationally demanding applications. Working in a tightly coupled processor-coprocessor architecture, FPGAs are often anticipated to accelerate multiple fine-grained or coarse-grained tasks simultaneously. Single-context FPGAs are commonly used in such systems. Withthe recent development of emerging memory technologies, multicontext FPGAs that support dynamic reconfiguration with high-density non-volatile memories become feasible. Compared to single-context FPGAs, multi-context FPGAs are able to accelerate significantly more tasks with only moderate area and power overhead. However, the best way to utilize the computation capacity advantage of multi-context FPGAs for hardware task mapping remains an interesting and unexploited problem. In this paper, we first propose the framework of a processor-coprocessor architecture with multi-context FPGA as the coprocessor for multiple-task acceleration. Under the framework, a hybrid placement strategy based on genetic and greedy algorithms is proposed to efficiently place a set of tasks onto the multi-context FPGA to achieve the best logic capacity utilization. Experiments on real and synthetic benchmarks demonstrate the efficiency of the proposed algorithm compared with other general approaches.
Today, quasi-Monte Carlo (QMC) methods are widely used in finance to price derivative securities. the QMC approach is popular because for many types of derivatives it yields an estimate of the price, to a given accura...
详细信息
ISBN:
(纸本)9781424419609
Today, quasi-Monte Carlo (QMC) methods are widely used in finance to price derivative securities. the QMC approach is popular because for many types of derivatives it yields an estimate of the price, to a given accuracy, faster than other competitive approaches, like Monte Carlo (MC) methods. the calculation of the large number of underlying asset pathways consumes a significant portion of the overall run-time and energy of modem QMC derivative pricing simulations. therefore, we present an FPGA-based accelerator for the calculation of asset pathways suitable for use in the QMC pricing of several types of derivative securities. Although this implementation uses constructs (recursive algorithms and double-precision floating point) not normally associated with successful FPGA computing, we demonstrate performance in excess of 50x that of a 3 GHz multi-core processor.
Many applications in image processing have high inherent parallelism. FPGAs have shown very high performance in spite of their low operational frequency by fully extracting the parallelism. In recent micro processors,...
详细信息
ISBN:
(纸本)9781424438914
Many applications in image processing have high inherent parallelism. FPGAs have shown very high performance in spite of their low operational frequency by fully extracting the parallelism. In recent micro processors, it also becomes possible to utilize the parallelism using multi-cores which support improved SIMD instructions, though programmers have louse them explicitly to achieve high performance. Recent GPUs support a large number of cores, and have a potential for high performance in many applications. However, the cores are grouped, and data transfer between the groups is very limited. Programming tools for FPGA, SIMD instructions on CPU and a large number of cores on GPU have been developed, but it is still difficult to achieve high performance on these platforms. In this paper, we compare the performance of FPGA, GPU and CPU using three applications in image processing;two-dimensional filters, stereo-vision and k-means clustering, and make it clear which platform is faster under which conditions.
Post-beamforming second order Volterra filter (SOVF) was previously introduced for decomposing the pulse echo ultrasonic radio-frequency (RF) signal into its linear and quadratic components. Using singular value decom...
详细信息
ISBN:
(纸本)9781424419609
Post-beamforming second order Volterra filter (SOVF) was previously introduced for decomposing the pulse echo ultrasonic radio-frequency (RF) signal into its linear and quadratic components. Using singular value decomposition (SVD), an optimal minimum-norm least squares algorithm for deriving the coefficients of the linear and quadratic kernels of the SOVF was developed and verified. the "Separable" implementation algorithm of a SOVF based on the eigenvalue decomposition (EVD) of the quadratic kernel was introduced and verified. In this paper, the "Separable" version of a Second Order Volterra filter is implemented in Xilinx Virtex-E FPGA. Parallel operation, efficient use of instructions per task, and data streaming capability of FPGA are identified. this implementation should allow for real-time implementation of quadratic filtering on commercial ultrasound scanners.
this paper advocates the use of 3D integration technology to stack a DRAM on top of an FPGA. the DRAM will store future FPGA contexts. A configuration is read from the DRAM into a latch array on the DRAM layer while t...
详细信息
ISBN:
(纸本)9781424438914
this paper advocates the use of 3D integration technology to stack a DRAM on top of an FPGA. the DRAM will store future FPGA contexts. A configuration is read from the DRAM into a latch array on the DRAM layer while the FPGA executes;the new configuration is loaded from the latch array into the FPGA in 60ns (5 cycles). the latency between reconfigurations, 8.42 mu s, is dominated by the time to read data from the DRAM into the latch array. We estimate that the DRAM can cache 289 FPGA contexts.
Due to continuous improvements in the resources available on FPGAs, it is becoming increasingly possible to accelerate floating point algorithms. the solution of a system of linear equations forms the basis of many pr...
详细信息
ISBN:
(纸本)9781424419609
Due to continuous improvements in the resources available on FPGAs, it is becoming increasingly possible to accelerate floating point algorithms. the solution of a system of linear equations forms the basis of many problems in engineering and science, but its calculation is highly time consuming. the minimum residual algorithm (MINRES) is one method to solve this problem, and is highly effective provided the matrix exhibits certain characteristics. this paper examines an IEEE 754 single precision floating point implementation of the MINRES algorithm on an FPGA. It demonstrates that through parallelisation and heavy pipelining of all floating point components it is possible to achieve a sustained performance of up to 53 GFLOPS on the Virtex5-330T. this compares favourably to other hardware implementations of floating point matrix inversion algorithms, and corresponds to an improvement of nearly an order of magnitude compared to a software implementation.
this paper describes the integration of a thermally assisted switching magnetic random access memory (TAS-MRAM) in FPGA design. the non-volatility of the latter is achieved through the use of magnetic tunneling juncti...
详细信息
ISBN:
(纸本)9781424419609
this paper describes the integration of a thermally assisted switching magnetic random access memory (TAS-MRAM) in FPGA design. the non-volatility of the latter is achieved through the use of magnetic tunneling junctions (MTJ) in the MRAM cell. A thermally assisted switching scheme is used to write data in the MTJ device, which helps to reduce power consumption during write operation in comparison to the writing scheme in classical MTJ device. Plus, the non-volatility of such a design should reduce both power consumption and configuration time required at each power up of the circuit in comparison to classical SRAM based FPGAs. A real time reconfigurable (RTR) micro-FPGA using TAS-MRAM allows dynamic reconfiguration mechanisms, while featuring simple design architecture.
暂无评论