Although high-level synthesis improves fpga productivity by enabling designers to use high-level code, the resulting performance is often significantly worse than register-transfer-level designs. One cause of such lim...
详细信息
High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises the performance and energy efficiency of hardware designs with a lower barrier to entry...
详细信息
ISBN:
(纸本)9781450318877
High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises the performance and energy efficiency of hardware designs with a lower barrier to entry in design expertise, and shorter design time. State-of-the-art high level synthesis now includes a wide variety of powerful optimizations that implement efficient hardware. These optimizations can implement some of the most important features generally performed in manual designs including parallel hardware units, pipelining of execution both within a hardware unit and between units, and fine-grained data communication. We may generally classify the optimizations as those that optimize hardware implementation within a code block (intra-block) and those that optimize communication and pipelining between code blocks (inter-block). However, both optimizations are in practice difficult to apply. Real-world applications contain data-dependent blocks of code and communicate through complex data access patterns. Existing high level synthesis tools cannot apply these powerful optimizations unless the code is inherently compatible, severely limiting the optimization opportunity. In this paper we present an integrated framework to model and enable both intra- and inter-block optimizations. This integrated technique substantially improves the opportunity to use the powerful HLS optimizations that implement parallelism, pipelining, and fine-grained communication. Our polyhedral model-based technique systematically defines a set of data access patterns, identifies effective data access patterns, and performs the loop transformations to enable the intra- and inter-block optimizations. Our framework automatically explores transformation options, performs code transformations, and inserts the appropriate HLS directives to implement the HLS optimizations. Furthermore, our framework can automatically generate the optimized communication blocks for fine-grained communica
There are large numbers of high-level algorithms consisting of multiple loop nests in image compression, pattern recognition and digital signal processing. fpga provides a convenient and flexible solution to speed up ...
详细信息
This work describes a compact fpga hardware architecture for computing modular multiplications over GF(p) using the Montgomery method, suitable for public key cryptography for embedded or mobile systems. The multiplie...
详细信息
作者:
Madden, LiamXilinx
Inc. 2100 Logic Drive San Jose CA 95124 United States
Since the advent of integrated circuit technology in 1958, the industry has focused primarily on monolithic integration. Unfortunately, due to physical and economic issues, the vast majority of high performance analog...
详细信息
The computational complexity of disparity estimation algorithms and the need of large size and bandwidth for the external and internal memory make the real-time processing of disparity estimation challenging, especial...
详细信息
fieldprogrammablegatearrays (fpga) are extensively used for rapid prototyping in embedded system applications. While hardware acceleration can be done via specialized processors like a Graphical Processing Unit (GP...
详细信息
ISBN:
(纸本)9781450318877
fieldprogrammablegatearrays (fpga) are extensively used for rapid prototyping in embedded system applications. While hardware acceleration can be done via specialized processors like a Graphical Processing Unit (GPU), they can also be accomplished with fpgas for more specialized scenarios. GPUs essentially consist of massively parallel cores and have high memory bandwidth; fpgas, on the other hand, provide flexibility in terms of customizable I/O and computational resources. In this paper, we explore the usage of GPUs and fpgas as cryptographic co-processors in streaming dataflow systems with huge rate of data inhalation. Two classic lightweight encryption algorithms, Tiny Encryption Algorithm (TEA) and Extended Tiny Encryption Algorithm (XTEA), are targeted for implementation on GPUs and fpgas. The GPU implementations of TEA and XTEA in this study depict a maximum speedup of 13x over CPU based implementation. The pipelined fpga implementation is able to realize a throughput of 6-9x more than the GPU for small plaintext sizes.
暂无评论