We present here a novel approach to use FPGA to accelerate the Haar-classifier based face detection algorithm. With highly pipelined microarchitecture and utilizing abundant parallel arithmetic units in the FPGA, we...
详细信息
ISBN:
(纸本)9781424419609
We present here a novel approach to use FPGA to accelerate the Haar-classifier based face detection algorithm. With highly pipelined microarchitecture and utilizing abundant parallel arithmetic units in the FPGA, we've achieved real-time performance of face detection having very high detection rate and low false positives. Moreover, our approach is flexible toward the resources available on the FPGA chip. this work also provides us an understanding toward using FPGA for implementing non-systolic based vision algorithm acceleration. Our implementation is realized on a HiTech Global PCIe card that contains a Xilinx XC5 VLX110T FPGA chip.
A dynamically sefl-reconfigurable Master-Slaves MPSoC architecture framework is introduced which can be fully embedded into a single FPGA device. the Master core can request a Configuration Manager module to add, or r...
详细信息
ISBN:
(纸本)9781424419609
A dynamically sefl-reconfigurable Master-Slaves MPSoC architecture framework is introduced which can be fully embedded into a single FPGA device. the Master core can request a Configuration Manager module to add, or remove, a slave core at runtime. If the request can be satisfied, self-reconfiguration commences, implemented by a pipeline of light-weight specialized blocks. the M-S architecture utilizes a simple and general token-based bus control mechanism that is reconfiguration aware. All system modules have been described in synthesizable VHDL. A first system prototype has been built and validated using the affordable XUP XC2VP30 board. Even when using CRC check of bitstreams dynamic reconfiguration can proceed at the maximum speed that can be supported by the ICAP Xilinx interface. the reconfiguration support logic consumes as little as 1012 slices on the Virtex II Pro FPGA.
Wavefront algorithms, such as the Smith-Waterman algorithm, are commonly used in bioinformatics for exact local and global sequence alignment. these algorithms are highly computationally intensive and are therefore ex...
详细信息
ISBN:
(纸本)9781424419609
Wavefront algorithms, such as the Smith-Waterman algorithm, are commonly used in bioinformatics for exact local and global sequence alignment. these algorithms are highly computationally intensive and are therefore excellent candidates for FPGA-based code acceleration. However there is no standard form of these algorithms, they are used in a wide variety of situations with various constraints. It is therefore not practical to have a standard kernel that can be mapped to an FPGA, hence the importance of being able to compile such codes from a high level language. ROCCC is a C to VHDL compiler, which optimizes and parallelizes the most frequently executed kernel loops in applications such as in multimedia, scientific and high-performance computing. In this paper we describe the transformations performed by ROCCC, which transformed the kernel of the Smith-Waterman algorithm into a hardware systolic array that is mapped onto the FPGA on the SGI Altix RASC blade. We report a throughput increase by over 3,000X over a 2.8 GHz Xeon.
the division of an application between a conventional processor and an acceleration card with FPGA chips has been proved as a suitable way for an acceleration of computationally intensive tasks. In such applications, ...
详细信息
ISBN:
(纸本)9781424419609
the division of an application between a conventional processor and an acceleration card with FPGA chips has been proved as a suitable way for an acceleration of computationally intensive tasks. In such applications, the designer usually has to implement an interconnection between components placed in FPGA and the host system bus. this task is often complicated by different requirements of user components for throughput, latency of reading operations, need for DMA transfers etc. the objective of this work is to show a new approach for implementation of interconnection systems and to enable the designer to focus on the development of the target application. the proposed interconnection system is based on tree topology. the system eliminates the sensitivity of wide buses to the distance, supports the connection of components with different requirements for throughput, supports split transaction model and many other features. the proposed system is implemented and evaluated on chips with Virtex 5 technology.
the SMILE project accelerates scientific and industrial applications by means of a cluster of low-cost FPGA boards. Withthis approach the intensive calculation tasks are accelerated using the FPGA logic, while the co...
详细信息
ISBN:
(纸本)9781424419609
the SMILE project accelerates scientific and industrial applications by means of a cluster of low-cost FPGA boards. Withthis approach the intensive calculation tasks are accelerated using the FPGA logic, while the communication patterns of the applications remains unchanged by using a Message Passing Library over Linux. this paper explains the cluster architecture: the SMILE nodes and the developed high-speed communication network for the FPGA RocketIO interfaces. A SystemC model developed to simulate the cluster is also detailed. In order to show the potential of the SMILE proposal a Content-Based Information Retrieval parallel application has been developed and compared with a HP cluster architecture in terms of response time and power consumption.
In recent years the financial world has seen an increasing demand for faster risk simulations, driven by growth in client portfolios. Traditionally many financial models employ Monte-Carlo simulation, which can take e...
详细信息
ISBN:
(纸本)9781424419609
In recent years the financial world has seen an increasing demand for faster risk simulations, driven by growth in client portfolios. Traditionally many financial models employ Monte-Carlo simulation, which can take excessively long to compute in software. this paper describes a hardware implementation for Collateralized Debt Obligations (CDOs) pricing, using the One-Factor Gaussian Copula (OFGC) model. We explore the precision requirements and the resulting resource utilization for each number representation. Our results show that our hardware implementation mapped onto a Xilinx XC5VSX50T is over 63 times faster than a software implementation running on a 3.4 GHz Intel Xeon processor.
We propose a variation-aware post-fabrication optimization scheme on FPGAs. Variation-aware optimization usually takes huge measurement cost. the proposed scheme achieves a constant optimization cost for any circuit c...
详细信息
ISBN:
(纸本)9781424419609
We propose a variation-aware post-fabrication optimization scheme on FPGAs. Variation-aware optimization usually takes huge measurement cost. the proposed scheme achieves a constant optimization cost for any circuit configuration. We utilize delay detectors embedded in clustered CLBs to choose fastest paths among multiple candidates. the delay detectors enable simultaneous measurement of critical path candidates to partition all critical paths into segments. the number of measurement to choose fastest paths on all critical paths does not depends on configurations but on FPGA architectures. We confirm that a simple heuristic algorithm can find the order of measurement near the lowest bound of the measurement cost and it is almost constant regardless of circuit configurations.
A geometric programming framework is proposed in this paper to automate exploration of the design space consisting of data reuse (buffering) exploitation and loop-level parallelization, in the context of FPGA-targeted...
详细信息
ISBN:
(纸本)9781424419609
A geometric programming framework is proposed in this paper to automate exploration of the design space consisting of data reuse (buffering) exploitation and loop-level parallelization, in the context of FPGA-targeted hardware compilation. We expose the dependence between data reuse and data-level parallelization and explore both problems under the on-chip memory constraint for performance-optimal designs within a single optimization step. Results from applying this framework to several real benchmarks demonstrate that given different constraints on on-chip memory utilization, the corresponding performance-optimal designs are automatically determined by the framework, and performance improvements up to 4.7 times have been achieved compared withthe method that first explores data reuse and then performs parallelization.
In the context of FPGAs, system downgrade consists in preventing the update of the hardware configuration or in replaying an old bitstream. the objective can be to preclude a system designer from fixing security vulne...
详细信息
ISBN:
(纸本)9781424419609
In the context of FPGAs, system downgrade consists in preventing the update of the hardware configuration or in replaying an old bitstream. the objective can be to preclude a system designer from fixing security vulnerabilities in a design. Such an attack can be performed over a network when the FPGA-based system is remotely updated or on the bus between the configuration memory and the FPGA chip at power-up. Several security schemes providing encryption and integrity checking of the bitstream have been proposed in the literature. However, as we show in this paper, they do not detect the replay of old FPGA configurations;hence they provide adversaries withthe opportunity to downgrade the system. We thus propose a new architecture that, in addition to ensuring bitstream confidentiality and integrity, precludes replay of old bitstreams. We show that the hardware cost of this architecture is negligible.
Over the last years LDPC codes became more and more popular because of their near Shannon limit error correcting performance. Structured code classes which ease decoder design have already been standardized for DVB-S2...
详细信息
ISBN:
(纸本)9781424419609
Over the last years LDPC codes became more and more popular because of their near Shannon limit error correcting performance. Structured code classes which ease decoder design have already been standardized for DVB-S2, IEEE WiMax 802.16e or WiFi. In this paper we introduce a flexible decoder architecture which can decode any structured or unstructured LDPC code using the identical hardware. Furthermore we present a mapping algorithm which "compiles" the parity-check matrix of the desired LDPC code. this concept allows adaption of the decoder controller to different LDPC codes without requiring a new synthesis run. We implemented the proposed decoder on a XILINX XC4LX160 FPGA and give bit error rates to verify design and mapping algorithm. In contrast to previously presented flexible implementations our design is able to decode LDPC codes of 30 times longer codeword lengths up to N = 65, 000.
暂无评论