Emerging embedded System-on-Chip (SoC) platforms are increasingly becoming multiprocessor architectures. the advances in the FPGA chip technology make the implementation of such architectures in a single chip feasible...
详细信息
ISBN:
(纸本)9781424403127
Emerging embedded System-on-Chip (SoC) platforms are increasingly becoming multiprocessor architectures. the advances in the FPGA chip technology make the implementation of such architectures in a single chip feasible and very appealing. Although the FPGA chip technology is well developed by companies such as Xilinx and Altera, the concepts and the necessary tool support for building multiprocessor systems on a single FPGA chip are still not mature enough. As a consequence, system designers experience significant difficulties in 1) designing multiprocessor systems on FPGAs in a short amount of time and 2) programming such systems in order to satisfy the performance needs of applications executed on them. In this paper we present our concept for multiprocessor system design, programing, and implementation that addresses and solves the above two problems in a particular way. We have implemented the concept in a tool called ESPAM which is briefly introduced as well. Also, we present some results obtained by applying our concept and ESPAM tool to automatically generate multiprocessor systems that execute a real-life application, namely a Motion-JPEG encoder
this paper introduces a secure FPGA implementation of a coprocessor for public key cryptography. It supports Elliptic Curve Cryptography (ECC) as well as the older RSA standard. When choosing adequate key lengths, RSA...
详细信息
ISBN:
(纸本)9781424403127
this paper introduces a secure FPGA implementation of a coprocessor for public key cryptography. It supports Elliptic Curve Cryptography (ECC) as well as the older RSA standard. When choosing adequate key lengths, RSA and ECC are assumed to be secure from an algorithmic point of view. On the other hand, an implementation of these algorithms should also guarantee side-channel security. this feature does not only cause an inevitable performance degradation, but also an area increase. We overcome these drawbacks by fitting the public key architecture and algorithms into a coprocessor that optimally exploites the dedicated features on a Spartan XC3S4000. Although this is a very low-cost FPGA, the performance results of our implementation meet the requirements of a broad range of high-end applications.
this paper describes an implementation of a parallel and pipelined watershed algorithm on FPGA. In the algorithm, pixels in a given image are repeatedly scanned from top-left to bottom-right, and then from bottom-righ...
详细信息
ISBN:
(纸本)9781424403127
this paper describes an implementation of a parallel and pipelined watershed algorithm on FPGA. In the algorithm, pixels in a given image are repeatedly scanned from top-left to bottom-right, and then from bottom-right to top-left. Because of these simplified memory accesses, N pixels in a given image can be processed in parallel by reading N lines at the same time. However, N is limited by the number of external memory banks that store image data. In our implementation, in order to achieve high performance using an FPGA with limited number of external memory banks, (1) a given image is divided to K regions, (2) several of them are cached on the FPGA, (3) the watershed algorithm is applied on those regions, and (4) the next (or previous) region is loaded to the FPGA during the computation to hide the loading time. In our current implementation on XC2V6000, up to 32 pixels can be processed in parallel. the performance for 512 x 512 pixel images is about 3 - 4 msec, which is fast enough for real-time applications.
Due to their increasing resource densities, fieldprogrammable gate arrays (FPGAs) have become capable of efficiently implementing large scale scientific applications involving floating point computations. In this pap...
详细信息
ISBN:
(纸本)9781424403127
Due to their increasing resource densities, fieldprogrammable gate arrays (FPGAs) have become capable of efficiently implementing large scale scientific applications involving floating point computations. In this paper FPGAs are compared to a high end microprocessor with respect to sustained performance for a popular floating point CPU performance benchmark, namely LINPACK 1000. A set of translation and optimization steps have been applied to transform a sequential C description of the LINPACK benchmark, based on a monolithic memory model, into a parallel Handel-C description that utilizes the plurality of memory resources available on a realistic reconfigurable computing platform. the experimental results show that the latest generation of FPGAs, programmed using Handel-C, can achieve a sustained floating point performance up to 6 times greater than the microprocessor while operating at a clock frequency that is 60 times lower. the transformations are applied in a way that could be generalized, allowing efficient compilation approaches for the mapping of high level descriptions onto FPGAs.
FPGAs have reached densities that can implement floating-point applications, but floating-point operations still require a large amount of FPGA resources. One major component of IEEE compliant floating-point computati...
详细信息
ISBN:
(纸本)9781424403127
FPGAs have reached densities that can implement floating-point applications, but floating-point operations still require a large amount of FPGA resources. One major component of IEEE compliant floating-point computations is variable length shifters. they account for over 30% of a double-precision floating-point adder and 25% of a double-precision multiplier. this paper introduces two alternatives for implementing these shifters. One alternative is a coarse-grained approach: embedding variable length shifters in the FPGA fabric. these units provide significant area savings with a modest clock rate improvement over existing architectures. Another alternative is a fine-grained approach: adding a 4:1 multiplexer inside the slices, in parallel to the LUTs. While providing a more modest area savings, these multiplexers provide a significant boost in clock rate with a small impact on the FPGA fabric.
Interconnect delays are becoming an increasingly significant part of the critical path delay for circuits implemented in FPGAs. Pipelined interconnects have been proposed to address this problem, where long distance r...
详细信息
ISBN:
(纸本)9781424403127
Interconnect delays are becoming an increasingly significant part of the critical path delay for circuits implemented in FPGAs. Pipelined interconnects have been proposed to address this problem, where long distance routes are pipe-lined using registers available in the configurable interconnect architecture. Unfortunately, pipelined interconnects are much harder to route than simple interconnects. QuickRoute is a fast, heuristic router based on PathFinder for pipelined interconnects. While its performance scales well with circuit size, it requires O(N-2) space and in practice can only be used for circuits with up to about 10,000 nodes. this paper describes an efficient solution to this space problem based on arithmetic coding, a technique widely used in data compression. We show that this reduces the space complexity to O(NlogN) while only slightly affecting performance. this result will allow pipelined routing to be used even for very large FPGA architectures. Experiments show that memory usage is reduced by 90% even for our relatively small coarse-grained benchmark circuits.
Stochastic simulation of biochemical systems has become one of major approaches to study life processes as system, yet is a computational challenge to run the simulation due to its vast calculation cost. this paper sh...
详细信息
ISBN:
(纸本)9781424403127
Stochastic simulation of biochemical systems has become one of major approaches to study life processes as system, yet is a computational challenge to run the simulation due to its vast calculation cost. this paper shows the implementation and evaluation of a stochastic simulation algorithm (SSA) called "First Reaction Method" on an FPGA-based biochemical simulator. It achieves high throughput by (1) consecutively throwing data into deeply-pipelined floating point arithmetic units, and (2) by distruibuting multiple simulators for parallel execution. As the result of evaluation on an FPGA-based simulation platform called ReCSiP2, the simulator outperforms execution on Xeon 2.80 GHz by approximately 80 times, even with large-scale biochemical systems.
In this paper we present a novel design for an efficient FPGA architecture of Fast Walsh Transform (FWT) for hardware implementation of pattern analysis techniques such as projection kernel calculation and feature ext...
详细信息
ISBN:
(纸本)9781424403127
In this paper we present a novel design for an efficient FPGA architecture of Fast Walsh Transform (FWT) for hardware implementation of pattern analysis techniques such as projection kernel calculation and feature extraction. the proposed architecture is based on Distributed Arithmetic (DA) principles using ROM ACcumulate (RAC) technique and sparse matrix factorisation. the implementation has been carried out using a hybrid design approach based on Celoxica Handel-C which is used as a wrapper for highly optimised VHDL cores. the algorithm has been implemented and verified on the Xilinx Virtex-2000E FPGA. An evaluation has also been reported based on maximum system frequency and chip area for different system parameters, and have been shown to outperform existing work in all key performance measures. Additionally, a novel Functional Level Power Analysis and Modelling (FLPAM) methodology has been proposed to enable a high level estimation of power consumption.
Modular systems implemented on field-programmable gate arrays (FPGAs) can benefit from being able to load and unload modules at run-time, a concept that is of much interest in the research community. Although dynamic ...
详细信息
Modular systems implemented on field-programmable gate arrays (FPGAs) can benefit from being able to load and unload modules at run-time, a concept that is of much interest in the research community. Although dynamic partial reconfiguration is possible in Virtex and Spartan series FPGAs, the configuration architecture of these devices is not amenable to modular reconfiguration, a limitation which has relegated research to theoretical or compromised resource allocation models. Two methods for implementing modular reconfiguration in Virtex FPGAs are compared and contrasted. the first method offers simplicity and fast reconfiguration times, but limits the geometry and connectivity of the system. the second method, developed recently, enables modules to be allocated arbitrary areas of the FPGA, bridging the gap between theory and reality and unlocking the latent potential of dynamic reconfiguration. the cost of this advancement is increased reconfiguration time. the second method has been demonstrated in three applications, including the first reported implementation of modular reconfiguration in a Virtex-4 device.
Globally Asynchronous Locally Synchronous (GALS) is a paradigm for complexity management and re-use of large System-on-Chip (SoC) architectures. GALS is most often based on specific ASIC design components or special F...
详细信息
ISBN:
(纸本)9781424403127
Globally Asynchronous Locally Synchronous (GALS) is a paradigm for complexity management and re-use of large System-on-Chip (SoC) architectures. GALS is most often based on specific ASIC design components or special FPGA platforms with custom development tools. In this paper we present a multiprocessor GALS implementation on a standard commercial FPGA with standard development tools. the key building block is a novel, reliable RTL mixed clock FIFO. A complete MPEG-4 video encoder with four processors is implemented for proofing the concept. the area overhead compared to a fully synchronous design is shown to be only 2% and the performance overhead is 3%. this is negligible compared to the benefits that are much better flexibility, ASIC or FPGA vendor independency, and reduced design time. Furthermore, the mixed-clock interfaces allow easy re-usability, since the RTL-level blocks do not need to be re-verfied in design iterations.
暂无评论