Gorilla is a methodology for generating fpga-based solutions especially well suited for data parallel applications with fine grain irregularity. Irregularity simultaneously destroys performance and increases power con...
详细信息
ISBN:
(纸本)9781450311557
Gorilla is a methodology for generating fpga-based solutions especially well suited for data parallel applications with fine grain irregularity. Irregularity simultaneously destroys performance and increases power consumption on many data parallel processors such as General Purpose Graphical Processor Units (GPGPUs). Gorilla achieves high performance and low power through the use of fpga-tailored parallelizadon techniques and application-specific hardwired accelerators, processing engines, and communication mechanisms. Automatic compilation from a stylized C language and templates that define the hardware structure coupled with the intrinsic flexibility of fpgas provide high performance, low power, and programmability. Gorilla's capabilities are demonstrated through the generation of a family of core-router network processors processing up to 100Gbps (200MPPS for 64B packets) supporting any mix of IPv4, IPv6, and Multi-Protocol Label Switching (MPLS) packets on a single fpga with off-chip IP lookup tables. A 40Gbps version of that network processor was run with an embedded test rig on a Xilinx Virtex-6 fpga, verifying for performance and correctness. Its measured power consumption is comparable to full custom, commercial network processors. In addition, it is demonstrated how Gorilla can be used to generate merged virtual routers, saving fpga resources.
This tutorial describes tools for efficiently implementing floating point applications on fpgas. We present both the SDK for OpenCL and DSP Builder Advanced Blockset and show that they can be effectively used to imple...
详细信息
Recent years have seen a tremendous increase in the capacities and capabilities of field-programmablegatearrays (fpga's). Much of this dramatic improvement has been the result of changes to the fpgas' intern...
详细信息
ISBN:
(纸本)9781581134520
Recent years have seen a tremendous increase in the capacities and capabilities of field-programmablegatearrays (fpga's). Much of this dramatic improvement has been the result of changes to the fpgas' internal architectures. New architectural proposals are routinely generated in both academia and industry. For fpga's to continue to grow, it is important that these new architectural ideas are fairly and accurately evaluated, so that those worthy ideas can be included in future chips. Typically, this evaluation is done using experimentation. However, the use of experimentation is dangerous, since it requires making assumptions regarding the tools and architecture of the device in question. If these assumptions are not accurate, the conclusions from the experiments may not be meaningful. In this paper, we investigate the sensitivity of fpga architectural conclusions to experimental variations. To make our study concrete, we evaluate the sensitivity of four previously published and well-known fpga architectural results: lookup-table size, switch block topology, cluster size, and memory size. It is shown that these experiments are significantly affected by the assumptions, tools, and techniques used in the experiments.
Bitwidth optimization of fpga datapaths can save hardware resources by choosing the fewest number of bits required for each datapath variable to achieve a desired quality of result. However, it is an NP-hard problem t...
详细信息
ISBN:
(纸本)9781450338561
Bitwidth optimization of fpga datapaths can save hardware resources by choosing the fewest number of bits required for each datapath variable to achieve a desired quality of result. However, it is an NP-hard problem that requires unacceptably long runtimes when using sequential CPU-based heuristics. We show how to parallelize the key steps of bitwidth optimization on the GPU by performing a fast brute-force search over a carefully constrained search space. We develop a high-level synthesis methodology suitable for rapid prototyping of bitwidth-annotated RTL code generation using gcc's GIMPLE backend. For range analysis, we perform parallel evaluation of sub-intervals to provide tighter bounds compared to ordinary interval arithmetic. For bitwidth allocation, we enumerate the different bitwidth combinations in parallel by assigning each combination to a GPU thread. We demonstrate up to 10-1000x speedups for range analysis and 50-200x speedups for bitwidth allocation when comparing NVIDIA K20 GPU implementation to an Intel Core i5-4570 CPU while maintaining identical solution quality across various benchmarks. This allows us to generate tailor-made RTL with minimum bitwidths in hundreds of milliseconds instead of hundreds of minutes when starting from high-level C descriptions of dataflow computations.
Packet classification is an important operation for applications such as routers, firewalls or intrusion detection systems. Many algorithms and hardware architectures for packet classification have been created, but n...
详细信息
ISBN:
(纸本)9781605584102
Packet classification is an important operation for applications such as routers, firewalls or intrusion detection systems. Many algorithms and hardware architectures for packet classification have been created, but none of them cancompete with the speed of TCAMs in the worst case. We propose new hardware-based algorithm for packet classification. The solution is based on problem decomposition and is aimed at the highest network speeds. A unique property of the algorithm is the constant time complexity in terms of external memory accesses. The algorithm performs exactly two external memory accesses to classify a packet. Using fpga and one commodity SRAM chip, a throughput of 150 million packets per second can be achieved. This makes throughput of 100 Gbps for the shortest packets. Further performance scaling is possible with more or faster SRAM chips. Copyright 2009 acm.
Voltage noise not only detracts from reliability and performance, but has been used to attack system security. Most systems are completely unaware of fluctuations occurring on nanosecond time scales. This paper quanti...
详细信息
Each generation of fpga architecture benefits from optimizations around its technology node and target usage. In this paper, we discuss some of the changes made to the CLB for Xilinx's 20nm UltraScale product fami...
详细信息
ISBN:
(纸本)9781450333153
Each generation of fpga architecture benefits from optimizations around its technology node and target usage. In this paper, we discuss some of the changes made to the CLB for Xilinx's 20nm UltraScale product family. We motivate those changes and demonstrate better results than previous CLB architectures on a variety of metrics. We show that, in demanding scenarios, logic placed in an UltraScale device requires 16% less wirelength than 7-series. Designs mapped to UltraScale devices also require fewer logic tiles. In this paper, we demonstrate the utilization benefits of the Ultra-Scale CLB attributed to certain CLB enhancements. The enhancements described herein result in an average packing improvement of 3% for the example design suite. We also show that the UltraScale architecture handles aggressive, tighter packing more gracefully than previous generations of fpga. These significant reductions in wirelength and CLB counts translate directly into power, performance and ease-of-use benefits.
High-capacity fpgas pose device architects with a variety of problems. The most obvious of these problems is interconnect capacity. Others include interconnect performance, clock distribution and IO capacity. This pap...
详细信息
ISBN:
(纸本)9780897918015
High-capacity fpgas pose device architects with a variety of problems. The most obvious of these problems is interconnect capacity. Others include interconnect performance, clock distribution and IO capacity. This paper describes these problems and the solutions to these problems chosen in the Xilinx XC4000EX family architecture.
In database query processing, aggregation is an operator by which data with a common property is grouped and expressed in a summary form. Early aggregation is a popular method for improving the performance of the aggr...
详细信息
ISBN:
(纸本)9781450394178
In database query processing, aggregation is an operator by which data with a common property is grouped and expressed in a summary form. Early aggregation is a popular method for improving the performance of the aggregation operator. In this paper, we study early aggregation algorithms in the context of query processing acceleration in database systems on fpgas. The comparative study leads us to set-associative caches with a low inter-reference recency set (LIRS) replacement policy. They show both great performance and modest implementation complexity compared to some of the most prominent early aggregation algorithms. We also present a novel application-specific architecture for implementing set-associative caches. Benchmarks of our implementation show speedups of up to 3x for end-to-end aggregation compared to a state-of-the-art fpga-based query engine.
Wave-steering is a new design methodology that realizes high throughput circuits by embedding layout friendly synthesized structures in silicon. In the wave-steering design methodology, circuits inherently utilize lat...
详细信息
ISBN:
(纸本)9781581133417
Wave-steering is a new design methodology that realizes high throughput circuits by embedding layout friendly synthesized structures in silicon. In the wave-steering design methodology, circuits inherently utilize latches. Inside the synthesized structures they are used for signal skewing, and on the interconnects to guarantee the correct arrival times at the inputs. Recently, we proposed a novel high-throughput fpga architecture based on the wavesteering design principle to handle throughput-intensive applications. Previously our work was focussed mainly on the Logic Block (LB) design. In this paper we discuss a pipelined interconnect scheme to support the strict timing requirements that is necessitated by the wave-steered design style. We characterize designs that best fit the new architecture and show that as technology scales down towards deep submicron (DSM), this fpga fabric shows an increasing throughput performance.
暂无评论