In recent years the financial world has seen an increasing demand for faster risk simulations, driven by growth in client portfolios. Traditionally many financial models employ Monte-Carlo simulation, which can take e...
详细信息
ISBN:
(纸本)9781424419609
In recent years the financial world has seen an increasing demand for faster risk simulations, driven by growth in client portfolios. Traditionally many financial models employ Monte-Carlo simulation, which can take excessively long to compute in software. this paper describes a hardware implementation for Collateralized Debt Obligations (CDOs) pricing, using the One-Factor Gaussian Copula (OFGC) model. We explore the precision requirements and the resulting resource utilization for each number representation. Our results show that our hardware implementation mapped onto a Xilinx XC5VSX50T is over 63 times faster than a software implementation running on a 3.4 GHz Intel Xeon processor.
Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture ...
详细信息
ISBN:
(纸本)9782839918442
Sparse Matrix Vector multiplication (SpMV) is an important kernel in many scientific applications. In this work we propose an architecture and an automated customisation method to detect and optimise the architecture for block diagonal sparse matrices. We evaluate the proposed approach in the context of the spectral/hp Finite Element Method, using the local matrix assembly approach. this problem leads to a large sparse system of linear equations with block diagonal matrix which is typically solved using an iterative method such as the Preconditioned Conjugate Gradient. the efficiency of the proposed architecture combined withthe effectiveness of the proposed customisation method reduces BRAM resource utilisation by as much as 10 times, while achieving identical throughput with existing state of the art designs and requiring minimal development effort from the end user. In the context of the Finite Element Method, our approach enables the solution of larger problems than previously possible, enabling the applicability of FPGAs to more interesting HPC problems.
Capacity of FPGAs has grown significantly, leading to increased complexity of designs targeting these chips. Traditional FPGA design methodology using HDLs is no longer sufficient and new methodologies are being sough...
详细信息
ISBN:
(纸本)9781424438914
Capacity of FPGAs has grown significantly, leading to increased complexity of designs targeting these chips. Traditional FPGA design methodology using HDLs is no longer sufficient and new methodologies are being sought. An attractive possibility is to use streaming languages. Streaming languages group data into streams, which are processed by computational nodes called kernels. they are suitable for implementation in FPGAs because they expose parallelism, which can be exploited by implementing the application in FPGA logic. Designers can express their designs in a streaming language and target FPGAs without needing a detailed understanding of digital logic design. In this paper we show how the Brook streaming language can be used to simplify design for FPGAs, while providing reasonable performance compared to other methodologies. We show that throughput of streaming applications can be increased through automatic kernel replication. Using our compiler, the FPGA designer can trade off FPGA area and performance by changing the amount of kernel replication. We describe the details of our compiler and present performance and area of a set of benchmarks. We found that throughput scales well with increased replication for most applications.
Today, quasi-Monte Carlo (QMC) methods are widely used in finance to price derivative securities. the QMC approach is popular because for many types of derivatives it yields an estimate of the price, to a given accura...
详细信息
ISBN:
(纸本)9781424419609
Today, quasi-Monte Carlo (QMC) methods are widely used in finance to price derivative securities. the QMC approach is popular because for many types of derivatives it yields an estimate of the price, to a given accuracy, faster than other competitive approaches, like Monte Carlo (MC) methods. the calculation of the large number of underlying asset pathways consumes a significant portion of the overall run-time and energy of modem QMC derivative pricing simulations. therefore, we present an FPGA-based accelerator for the calculation of asset pathways suitable for use in the QMC pricing of several types of derivative securities. Although this implementation uses constructs (recursive algorithms and double-precision floating point) not normally associated with successful FPGA computing, we demonstrate performance in excess of 50x that of a 3 GHz multi-core processor.
Packet classification is a kernel application performed at network routers. Many classification engines are optimized for prefix and exact match, while a range-to-prefix translation can lead to rule set expansion. Und...
详细信息
ISBN:
(纸本)9781467381239
Packet classification is a kernel application performed at network routers. Many classification engines are optimized for prefix and exact match, while a range-to-prefix translation can lead to rule set expansion. Under limited power budget, it is challenging to achieve high classification throughput. In this paper, we present a high-performance and power-efficient packet classification engine on FPGA. We construct a modular Processing Element ( PE);each PE compares a stride of the input packet header against a stride of a range boundary. We concatenate multiple PEs into a systolic array. Efficient power optimization techniques including self-enabled power gating and entropy-based scheduling are explored on our architecture. Experimental results show that, for 4 K 15-field rule sets, our prototype on a state-of-the-art FPGA can achieve 250 Million Packets Per Second (MPPS) throughput. Using the proposed power optimization techniques, our classification engine consumes 30 % of the power without sacrificing the throughput.
this paper presents an analytical model that relates FPGA architectural parameters to the expected speed of FPGA implementation. More precisely, the model relates the lookup-table size, cluster size, and number of inp...
详细信息
ISBN:
(纸本)9781424438914
this paper presents an analytical model that relates FPGA architectural parameters to the expected speed of FPGA implementation. More precisely, the model relates the lookup-table size, cluster size, and number of inputs per cluster to the depth of the circuit after technology mapping and after clustering. Comparison to experimental results with large MCNC circuits shows that our models are accurate. We show how the models can be used in FPGA architectural investigations to complement the more usual experimental approach.
this paper presents a new Single Event Upset (SEU), Multiple Bit Upset (MBU) and Single Hardware Error (SHE) mitigation strategy to be used in Virtex-4 FPGAs. this strategy aims to increase not only the effectiveness ...
详细信息
ISBN:
(纸本)9781424438914
this paper presents a new Single Event Upset (SEU), Multiple Bit Upset (MBU) and Single Hardware Error (SHE) mitigation strategy to be used in Virtex-4 FPGAs. this strategy aims to increase not only the effectiveness of traditional Triple Module Redundancy (TMR), but also the overall system availability. Frame readback with ECC detection and frame scrubbing are combined in a dynamically reconfigurable TMR architecture, designed under both spatial and implementation diversification premises. Moreover, since the strategy works on the device's bitstream domain, the basis for Virtex-4 FPGAs bitstream definition are also shown.
A geometric programming framework is proposed in this paper to automate exploration of the design space consisting of data reuse (buffering) exploitation and loop-level parallelization, in the context of FPGA-targeted...
详细信息
ISBN:
(纸本)9781424419609
A geometric programming framework is proposed in this paper to automate exploration of the design space consisting of data reuse (buffering) exploitation and loop-level parallelization, in the context of FPGA-targeted hardware compilation. We expose the dependence between data reuse and data-level parallelization and explore both problems under the on-chip memory constraint for performance-optimal designs within a single optimization step. Results from applying this framework to several real benchmarks demonstrate that given different constraints on on-chip memory utilization, the corresponding performance-optimal designs are automatically determined by the framework, and performance improvements up to 4.7 times have been achieved compared withthe method that first explores data reuse and then performs parallelization.
Recent advances in FPGA technology and the proliferation of High Level Synthesis (HLS) tools makes it possible to implement complex System on Chip (SoC) designs that realize complete applications in a single FPGA devi...
详细信息
ISBN:
(纸本)9782839918442
Recent advances in FPGA technology and the proliferation of High Level Synthesis (HLS) tools makes it possible to implement complex System on Chip (SoC) designs that realize complete applications in a single FPGA device. To be able to exploit the large performance vs. area search space of such modern FPGA-based SoCs, system architects must have the appropriate performance analysis tools to evaluate-preferably at runtime-the computational requirements and the data flow of such a system to determine potential performance bottlenecks when running realistic workloads. In this paper we introduce SoCLog, a framework that automatically enhances the platform architecture with additional hardware components used to generate activity logs when the base SoC architecture executes an application. this real-time information can be analyzed by the system designer to expose performance bottlenecks not only on aggregate, but also at clock cycle granularity thus revealing potential design inefficiencies when there is burst of activity in the system. We evaluate our framework with a number of SoC designs to show that such a logging information can be valuable for the design of a complex SoC in a modern FPGA with minimal area overhead.
Random sampling based path planning algorithms have shown their high efficiency in robotics, navigation and related fields. the Rapidly-Exploring Random Trees (RRT) is the typical method and works well in a variety of...
详细信息
ISBN:
(纸本)9782839918442
Random sampling based path planning algorithms have shown their high efficiency in robotics, navigation and related fields. the Rapidly-Exploring Random Trees (RRT) is the typical method and works well in a variety of applications. Due to the sub-optimal issue of original RRT, the recent algorithm, known as RRT*, significantly improves the optimality of solution by adding the "cost review" procedure. However, the original RRT experiences the bottle neck of complicated iterations and it becomes worse in RRT*. this paper presents the developed hardware architecture for RRT*, which fully exploits the parallel potential of algorithm. Unlike the sequential execution in software, the "exploration" and "review" are identified as independent processes and executed in parallel. For the complicated operation of inserting vertexes, one pipelined Kd-tree constructor is designed to fast rebuild the tree when new vertex generated. Furthermore, to speed up the near neighbors and nearest neighbor searching, the vertexes are stored in separate Kd-trees so that the search processes can be carried out concurrently in each data tree. this work explores the possible and power-efficient RRT* hardware architecture on FPGAs compared to PC implementation.
暂无评论