Architects of programmable logic devices (PLDs) face several challenges when optimizing a new device family for low manufacturing cost. When given an aggressive die-size goal, functional blocks that seem otherwise ins...
详细信息
ISBN:
(纸本)9780897919784
Architects of programmable logic devices (PLDs) face several challenges when optimizing a new device family for low manufacturing cost. When given an aggressive die-size goal, functional blocks that seem otherwise insignificant become targets for area reduction. Once low die cost is achieved, it is seen that testing and packaging costs must be considered. Interactions among these three cost contributors pose trade-offs that prevent independent optimization. This paper discusses solutions discovered by the architects optimizing the Altera FLEX 6000 architecture.
This paper outlines the Network-on-Chip (NoC) on Xilinx's next generation Versal (TM) architecture. It is a hardened NoC that is present in Xilinx's next-generation 7nm architecture devices. These devices incl...
详细信息
ISBN:
(纸本)9781450361378
This paper outlines the Network-on-Chip (NoC) on Xilinx's next generation Versal (TM) architecture. It is a hardened NoC that is present in Xilinx's next-generation 7nm architecture devices. These devices include many other new hardened features that make up the Adaptable Computing Acceleration Platform (ACAP) devices. There is a trend in FPGA devices of hardening many commonly used components such as processors, memory controllers and other IO controllers. The next generation of Xilinx devices take this a step further by providing a device-global memory mapped NoC which connects these components and the fabric in an integrated fashion. The NoC unifies communication between the processor system, FPGA fabric, memory subsystem and other hardened accelerator functions. This paper gives an overview of the Versal architecture NoC. It also motivates some of the specific characteristics of the architecture. We show how hardening the NoC lets users quickly implement high performance system level interconnect.
QR decomposition (QRD) is of increasing importance for many current applications, such as wireless and radar. Data dependencies in known algorithms and approaches, combined with the data access patterns used in many o...
详细信息
ISBN:
(纸本)9781450356145
QR decomposition (QRD) is of increasing importance for many current applications, such as wireless and radar. Data dependencies in known algorithms and approaches, combined with the data access patterns used in many of these methods, restrict the achievable performance in software programmable targets. Some FPGA architectures now incorporate hard floating-point (HFP) resources, and in combination with distributed memories, as well as the flexibility of internal connectivity, can support high-performance matrix arithmetic. In this work, we present the mapping to parallel structures with inter-vector connectivity of a new QRD algorithm. Based on a Modified Gram-Schmidt (MGS) algorithm, this new algorithm has a different loop organization, but the dependent functional sequences are unchanged, so error analysis and numerical stability are unaffected. This work has a theoretical sustained-to-peak performance close to 100% for large matrices, which is roughly three times the functional density of the previously best known implementations. Mapped to an Intel Arria 10 device, we achieve 80us for a 256x256 single precision real matrix, for a 417 GFLOP equivalent. This corresponds to a 95% sustained to peak ratio, for the portion of the device used for this work.
Latency insensitive communication oers many potential benets for FPGA designs, including easier timing closure by enabling automatic pipelining, and easier interfacing with embedded NoCs. However, it is important to u...
详细信息
ISBN:
(纸本)9781450326711
Latency insensitive communication oers many potential benets for FPGA designs, including easier timing closure by enabling automatic pipelining, and easier interfacing with embedded NoCs. However, it is important to understand the costs and trade-os associated with any new design style. This paper presents optimized implementations of latency insensitive communication building blocks, quanties their overheads in terms of area and frequency, and provides guidance to designers on how to generate high-speed and areae cient latency insensitive systems.
The increasing computational power enables various new applications that are runtime prohibitive before. FPGA is one of such computational power with both reconfigurability and energy efficiency. In this paper, we dem...
详细信息
ISBN:
(纸本)9781450343541
The increasing computational power enables various new applications that are runtime prohibitive before. FPGA is one of such computational power with both reconfigurability and energy efficiency. In this paper, we demonstrate the feasibility of eyeglasses-free displays through FPGA acceleration. Specifically, we propose several techniques to accelerate the sparse matrix-vector multiplication and the L-BFGS iterative optimization algorithm with the consideration of the characteristics of FPGAs. The experimental results show that we reach a 12.78X overall speedup of the glass-free display application.
Locality exploitation is essential to asymptotic energy minimization for gate array netlist evaluation. Naive implementations that ignore locality, including flat crossbars and simple processors based on monolithic me...
详细信息
While reconfigurable computing promises to deliver incomparable performance, it is still a marginal technology due to the high cost of developing and upgrading applications. Hardware virtualization can be used to sign...
详细信息
ISBN:
(纸本)9780897919784
While reconfigurable computing promises to deliver incomparable performance, it is still a marginal technology due to the high cost of developing and upgrading applications. Hardware virtualization can be used to significantly reduce both these costs. In this paper we describe the benefits of hardware virtualization, and show how it can be achieved using a combination of pipeline reconfiguration and run-time scheduling of both configuration streams and data streams. The result is PipeRench, an architecture that supports robust compilation and provides forward compatibility. Our preliminary performance analysis predicts that PipeRench will outperform commercial FPGAs and DSPs in both overall performance and in performance per mm2.
A fundamental feature of Dynamically Reconfigurable FPGAs (DRFPGAs) is that the logic and interconnect is time-multiplexed. Thus for a circuit to be implemented on a DRFPGA, it needs to be partitioned such that each s...
详细信息
A fundamental feature of Dynamically Reconfigurable FPGAs (DRFPGAs) is that the logic and interconnect is time-multiplexed. Thus for a circuit to be implemented on a DRFPGA, it needs to be partitioned such that each subcircuit can be executed at a different time. In this paper, the partitioning of sequential circuits for execution on a DRFPGA is studied. To determine how to correctly partition a sequential circuit, and what are the costs in doing so, we propose a new gate-level model that handles time-multiplexed computation. We also introduce an enhanced force directed scheduling (FDS) algorithm to partition sequential circuits that finds a correct partition with low logic and communication costs, under the assumption that maximum performance is desired. We use our algorithm to partition seven large ISC AS'89 sequential benchmark circuits. The experimental results show that the enhanced FDS reduces communication costs by 27.5% with only a 1.1% increase in the gate cost compared to traditional FDS.
We present in this paper the first reported FPGA implementation of the Position Specific Iterated BLAST (PSI-BLAST) algorithm. The latter is a heuristic biological sequence alignment algorithm that is widely used in t...
详细信息
ISBN:
(纸本)9781605584102
We present in this paper the first reported FPGA implementation of the Position Specific Iterated BLAST (PSI-BLAST) algorithm. The latter is a heuristic biological sequence alignment algorithm that is widely used in the bioinformatics and computational biology world in order to detect weak homologs. The architecture of our FPGA implementation is parameterized in terms of sequence lengths, scoring matrix, gap penalties and cut-off and threshold values. It is composed of various blmocks each of which performs one step of the algorithm in parallel. This results in high performance implementations, which easily outperform equivalent software implementations by one order of magnitude or more. Furthermore, the core was captured in an FPGA-platformindependent language, namely the Handel-C language, to which no specific resource inference or placement constraints were applied. This makes our core portable across different FPGA families and architectures. Copyright 2009 acm.
The aim of this paper is to propose a real time reconfigurable (RTR) micro-FPGA using new non volatile memory. Magnetic tunneling junctions (MTJ) used in Magnetic random access memories (MRAM.) are compatible with cla...
详细信息
ISBN:
(纸本)1595932925
The aim of this paper is to propose a real time reconfigurable (RTR) micro-FPGA using new non volatile memory. Magnetic tunneling junctions (MTJ) used in Magnetic random access memories (MRAM.) are compatible with classical CMOS processes. Moreover remanent property of such a memory could limit configuration time and power consumption required at each power up of the die. Nevertheless, each configuration memory point has to be readable independently from each other, that is why the approach is different from the classical memory array one. Copyright 2006 acm.
暂无评论