Multi-point distributed random variables whose moments match those of a Gaussian random variable up to a certain order play an important role in Monte Carlo simulations of weak approximations of stochastic differentia...
详细信息
Multi-point distributed random variables whose moments match those of a Gaussian random variable up to a certain order play an important role in Monte Carlo simulations of weak approximations of stochastic differential equations. In applications such as finance, where "real time" execution is required, there is a strong need for highly efficient implementations. In this paper a fast and flexible dedicated hardware solution on a fieldprogrammablegate Array (FPGA) is presented. A comparative performance analysis between a software-only and the proposed hardware solution demonstrates that the FPGA solution is bottleneck-free, retains the flexibility of the software solution and significantly increases the computational efficiency.
To implement high-density and high-speed FPGA circuits, designers need tight control over the circuit implementation process. However, current design tools are unsuited for this purpose as they lack fast turnaround ti...
详细信息
ISBN:
(纸本)9780897919784
To implement high-density and high-speed FPGA circuits, designers need tight control over the circuit implementation process. However, current design tools are unsuited for this purpose as they lack fast turnaround times, interactiveness, and integration. We present a system for the Xilinx XC6200 FPGA, which addresses these issues. It consists of a suite of tightly integrated tools for the XC6200 architecture centered around an architecture-independent tool framework. The system lets the designer easily intervene at various stages of the design process and features design cycle times (from an HDL specification to a complete layout) in the order of seconds.
This paper examines the tradeoffs between flexibility, area, and power dissipation of programmable clock networks for field-programmablegatearrays (FPGA's). The paper begins by describing a parameterized clock n...
详细信息
ISBN:
(纸本)1595932925
This paper examines the tradeoffs between flexibility, area, and power dissipation of programmable clock networks for field-programmablegatearrays (FPGA's). The paper begins by describing a parameterized clock network model that describes a broad range of programmable clock network architectures. Specifically, the model supports architectures with multiple local and global clock domains and varying amounts of flexibility at various levels of the clock network. Using the model, the architectural parameters that control the flexibility of the clock network are varied to determine the cost of this flexibility in terms of area and power dissipation. From these experiments, the study finds that area and power costs are highest for networks with flexibility close to the logic blocks. Furthermore, it found that clock networks with local clock domains have little overhead and are significantly more efficient than clock networks without local clock domains for applications with multiple clocks. Copyright 2006 acm.
This article is a concise literature review of the actual state of the art in arithmetic for field-programmablegatearrays (FPGAs), including studies, implementation techniques, operators, and structures, in various ...
详细信息
This article is a concise literature review of the actual state of the art in arithmetic for field-programmablegatearrays (FPGAs), including studies, implementation techniques, operators, and structures, in various area-time tradeoffs. It covers the integer operations of addition/subtraction, multiplication, squaring, division, and square root, in parallel, and in both serial modes (least-significant digit first, and online). Many people, including researchers in the field of computer arithmetic, parallel computing, digital signal and image processing, system-on-a-programmable chip (SoPC) designers, and other people with a need to implement special purpose arithmetic circuits on FPGAs, might find such a review useful, either as an introduction to the topic, as a knowledge update, or for reference.
Via-programmablegatearrays (VPGAs) offer a middle ground application specific integrated circuits and fieldprogrammablearrays in terms of flexibility, manufactuing , speed, power and area. In this paper, we presen...
详细信息
ISBN:
(纸本)9781605584102
Via-programmablegatearrays (VPGAs) offer a middle ground application specific integrated circuits and fieldprogrammablearrays in terms of flexibility, manufactuing , speed, power and area. In this paper, we present a VPGA logic cell, the complementary universal logic (CULG) which can be used to implement both sequential combinatorial elements. Its performance is compared a number of other designs including transmission , differential cascode voltage switch with pass gate, standard cell. The CULG is found to have comparable delay product and process variation sensitivity to the other designs while offering the lowest power consumption. Copyright 2009 acm.
Architects of programmable logic devices (PLDs) face several challenges when optimizing a new device family for low manufacturing cost. When given an aggressive die-size goal, functional blocks that seem otherwise ins...
详细信息
ISBN:
(纸本)9780897919784
Architects of programmable logic devices (PLDs) face several challenges when optimizing a new device family for low manufacturing cost. When given an aggressive die-size goal, functional blocks that seem otherwise insignificant become targets for area reduction. Once low die cost is achieved, it is seen that testing and packaging costs must be considered. Interactions among these three cost contributors pose trade-offs that prevent independent optimization. This paper discusses solutions discovered by the architects optimizing the Altera FLEX 6000 architecture.
This paper outlines the Network-on-Chip (NoC) on Xilinx's next generation Versal (TM) architecture. It is a hardened NoC that is present in Xilinx's next-generation 7nm architecture devices. These devices incl...
详细信息
ISBN:
(纸本)9781450361378
This paper outlines the Network-on-Chip (NoC) on Xilinx's next generation Versal (TM) architecture. It is a hardened NoC that is present in Xilinx's next-generation 7nm architecture devices. These devices include many other new hardened features that make up the Adaptable Computing Acceleration Platform (ACAP) devices. There is a trend in FPGA devices of hardening many commonly used components such as processors, memory controllers and other IO controllers. The next generation of Xilinx devices take this a step further by providing a device-global memory mapped NoC which connects these components and the fabric in an integrated fashion. The NoC unifies communication between the processor system, FPGA fabric, memory subsystem and other hardened accelerator functions. This paper gives an overview of the Versal architecture NoC. It also motivates some of the specific characteristics of the architecture. We show how hardening the NoC lets users quickly implement high performance system level interconnect.
QR decomposition (QRD) is of increasing importance for many current applications, such as wireless and radar. Data dependencies in known algorithms and approaches, combined with the data access patterns used in many o...
详细信息
ISBN:
(纸本)9781450356145
QR decomposition (QRD) is of increasing importance for many current applications, such as wireless and radar. Data dependencies in known algorithms and approaches, combined with the data access patterns used in many of these methods, restrict the achievable performance in software programmable targets. Some FPGA architectures now incorporate hard floating-point (HFP) resources, and in combination with distributed memories, as well as the flexibility of internal connectivity, can support high-performance matrix arithmetic. In this work, we present the mapping to parallel structures with inter-vector connectivity of a new QRD algorithm. Based on a Modified Gram-Schmidt (MGS) algorithm, this new algorithm has a different loop organization, but the dependent functional sequences are unchanged, so error analysis and numerical stability are unaffected. This work has a theoretical sustained-to-peak performance close to 100% for large matrices, which is roughly three times the functional density of the previously best known implementations. Mapped to an Intel Arria 10 device, we achieve 80us for a 256x256 single precision real matrix, for a 417 GFLOP equivalent. This corresponds to a 95% sustained to peak ratio, for the portion of the device used for this work.
Latency insensitive communication oers many potential benets for FPGA designs, including easier timing closure by enabling automatic pipelining, and easier interfacing with embedded NoCs. However, it is important to u...
详细信息
ISBN:
(纸本)9781450326711
Latency insensitive communication oers many potential benets for FPGA designs, including easier timing closure by enabling automatic pipelining, and easier interfacing with embedded NoCs. However, it is important to understand the costs and trade-os associated with any new design style. This paper presents optimized implementations of latency insensitive communication building blocks, quanties their overheads in terms of area and frequency, and provides guidance to designers on how to generate high-speed and areae cient latency insensitive systems.
暂无评论