In this paper we present an automatic design flow for generating customized embedded FPGA (eFPGA) fabric and a domain specific SOC+eFPGA architecture. this design flow encompasses boththe eFPGA user and automatic lay...
详细信息
ISBN:
(纸本)9781424410590
In this paper we present an automatic design flow for generating customized embedded FPGA (eFPGA) fabric and a domain specific SOC+eFPGA architecture. this design flow encompasses boththe eFPGA user and automatic layout generator perspectives. We discuss generic FPGA modeling based on VPR tool, simulation and high-level models of reconfigurable components, and we present an innovative floor-planing for island style FPGAs using rectilinear macros. Several system integration issues are highlighted. Layout of a real life SOC with an embedded RTR FPGA for cryptographic applications, designed withthis flow, is also presented.
the current generations of FPGA comprise of many specialized hardware cores, like embedded processors, multipliers, RAMs and FIFOs, along withthe regular arrays of reconfigurable logic. On any FPGA device, these embe...
详细信息
ISBN:
(纸本)9781424410590
the current generations of FPGA comprise of many specialized hardware cores, like embedded processors, multipliers, RAMs and FIFOs, along withthe regular arrays of reconfigurable logic. On any FPGA device, these embedded cores are located at fixed locations only. this makes the task of floorplanning for the applications with heterogeneous components very difficult. Recently, some researchers have started looking into this problem of heterogeneous floorplanning on FPGA. However, all these work suffer from a fundamental flaw which affects the quality of solutions leading to higher device areas or excessively high runtime. In [1], we propose a heterogeneous floorplanner for FPGA, HPlan, which is highly efficient in finding floorplans of variety of resources. In this paper, we extend the floorplanner to include an adaptive placer algorithm. We also perform our experiments on the MCNC benchmarks for the floorplan with random heterogeneous resource allocations. We observe that as the statistical variation in the heterogeneous resource allocations is increased, the traditional floorplanner gives an increasing area of all the benchmarks whereas the HPlan floorplanner does not. the proposed floorplanner thus provides an efficient way to handle floorplans with large variations in the heterogeneous resources.
this paper describes a correlator that is optimized for the Xilinx Virtex-4 SX FPGA, and its application in the SKAMP radio telescope at the Molonglo Radio Observatory. the digital backend of the SKAMP telescope consi...
详细信息
ISBN:
(纸本)9781424410590
this paper describes a correlator that is optimized for the Xilinx Virtex-4 SX FPGA, and its application in the SKAMP radio telescope at the Molonglo Radio Observatory. the digital backend of the SKAMP telescope consists of more than 800 Virtex-4 FPGAs. Correlation is performed between each and every pairing of antenna inputs, so the SKAMP telescope, with its 384 inputs, has approximately 74,000 antenna correlations;with 100 MHz of input bandwidth from each antenna this requires real-time processing of more than 7 tera complex multiply-accumulates per second. the correlation cell described takes advantage of the hard IP blocks found within the Virtex-4 FPGA to perform one 4+4-bit complex correlation per cycle at a clock rate exceeding 256 MHz. At the core of each cell is an efficient 4-bit signed complex multiplier, implemented using the 18-bit signed multiplier of the Virtex-4 DSP slice, and a short term accumulator, implemented using the adjacent Block RAM. Nearly 30,000 correlation cells are instantiated across 192 Virtex-4SX35 devices in order to process all the data from the SKAMP telescope.
Configurable architectures offer the unique opportunity of customising the storage allocation to meet specific applications' needs. A compiler approach to map the arrays of a loop-based computation to internal mem...
详细信息
Configurable architectures offer the unique opportunity of customising the storage allocation to meet specific applications' needs. A compiler approach to map the arrays of a loop-based computation to internal memories of a configurable architecture withthe objective of minimising the overall execution time is described. An algorithm that considers the data access patterns of the arrays along the critical path of the computation as well as the available storage and memory bandwidth is presented. Experimental results are presented which demonstrate the application of this approach for a set of kernel codes when targeting a field-programmable gate-array. the results reveal that the proposed algorithm outperforms the naive and custom data layout techniques by an average of 33% and 15% in terms of execution time, while taking into account the available hardware resources.
In this paper, we investigate three different realizations of the same block from different points of view. the mentioned different realizations include two realizations with embedded processors (custom 16-bit RISC pr...
详细信息
ISBN:
(纸本)9781424410590
In this paper, we investigate three different realizations of the same block from different points of view. the mentioned different realizations include two realizations with embedded processors (custom 16-bit RISC processor and general soft-core processor) and the third realization uses Handel-C as an example of synthesisable high-level abstraction languages. the results show that development time of complete solution (HW and SW) is approximately the same for the Handel-C design and the design with soft-core processor;the development time of the Custom 16-bit RISC processor is about five times higher. Moreover, the throughput of the Handel-C design measured in the number of bits processed in one second is the highest. the obtained frequency and occupied area of the Handel-C design depends on the complexity of the used program. However, results are comparable or even better than results of the embedded processors.
Process variations of deep sub-micron technologies have created significant timing uncertainty. this generates the need for a new variability-aware physical synthesis tool for field-programmable Gate-Arrays (FPGAs). I...
详细信息
ISBN:
(纸本)9781424410590
Process variations of deep sub-micron technologies have created significant timing uncertainty. this generates the need for a new variability-aware physical synthesis tool for field-programmable Gate-Arrays (FPGAs). Ideally, variability-aware tools should be able to perform both timing variability estimation during the synthesis and timing variability analysis after the synthesis. Statistical static timing analysis (SSTA) methods are developed to perform timing variability analysis, but are computationally expensive and not fast enough. We propose a fast and accurate interval-based method for the timing variability estimation. this method uses correlation-aware affine intervals instead of probability density distributions to model timing uncertainties. Our model estimates the mean of timing variation within an accuracy of 99.9% and an average range looseness of -7.5% for the Monte Carlo (MC) model. A speed-up of about 80X and 4900X is achieved for the Correlation Aware Canonical Timing (CACT) model and MC model respectively.
Withthe increasing capacity of FPGAs following the Moore's law, it is possible to build in a single FPGA, a large system on chip (SoC) composed by several cores. their performances depend strongly on their interc...
详细信息
ISBN:
(纸本)9781424410590
Withthe increasing capacity of FPGAs following the Moore's law, it is possible to build in a single FPGA, a large system on chip (SoC) composed by several cores. their performances depend strongly on their interconnection structure. Traditional and hierarchical busses are not suitable to be used. the Networks on Chip (NoC), due to their characteristics such as scalability, flexibility, high bandwidth, have been proposed as a valid approach to meet communication requirements in SoC Most of the current NoCs uses mesh topology. With mesh topology, central channels are significantly solicited this Often leads to the congestion of the center area of the mesh. the solution for such situation is to add routers in the mesh or to use torus topology which, withthe symmetry introduced on the routers in the opposite edges, has a good behavior to face congestion, and this, with a small increase of resources. In this paper, we propose a scalable implementation of a NoC for FPGA using torus topology. We proposed router architecture, a routing algorithm and a solution to the problem introduced by the long wires in torus topology.
this paper presents a high-speed implementation of a 2-D fixed-point Discrete Wavelet Transform (DWT) using the embedded DSP48 blocks available on a Xilinx Virtex-4 XC4VLX15-10 FPGA. the full transform uses just 10 DS...
详细信息
ISBN:
(纸本)9781424410590
this paper presents a high-speed implementation of a 2-D fixed-point Discrete Wavelet Transform (DWT) using the embedded DSP48 blocks available on a Xilinx Virtex-4 XC4VLX15-10 FPGA. the full transform uses just 10 DSP48 blocks, 3 block RAMs and 2,126 logic elements when synthesized using Xilinx ISE Version 8.2i and can perform calculations at 197.2 MHz. the results clearly show that by using the DSP48 blocks, it is possible to build computationally efficient DWT algorithms that can operate at higher speeds and with lower overall logic resources than other FPGA solutions that have been reported previously.
A significant challenge in designing algorithms for FPGA-based reconfigurable computers is the exposed, non-cached memory subsystem. In the absence of dedicated hardware to manage a cached memory hierarchy, the algori...
详细信息
ISBN:
(纸本)9781424410590
A significant challenge in designing algorithms for FPGA-based reconfigurable computers is the exposed, non-cached memory subsystem. In the absence of dedicated hardware to manage a cached memory hierarchy, the algorithm designer must explicitly allocate data within a collection of memory banks, and schedule access to the memories in the algorithm's datapaths. the physical location in memory affects the datapath schedule, yet data dependencies in the algorithm can suggest allocation strategies to increase instruction level parallelism. In this work, we present three algorithms that automatically allocate arrays to memory banks and schedule datapaths that use those memories. Our algorithm allows the user to trade-off optimal results versus longer iterative analysis.
A biological organism's ability to sense and adapt to its environment is essential to its survival. Likewise, environmentally aware computing systems avail themselves to a longer operational life and a wider range...
详细信息
ISBN:
(纸本)9781424410590
A biological organism's ability to sense and adapt to its environment is essential to its survival. Likewise, environmentally aware computing systems avail themselves to a longer operational life and a wider range of applicationsthan traditional systems. In this paper, we propose a novel circuit design methodology that allows parameterizable hardware to self-regulate its temperature. We apply this methodology to an image recognition system on an Xilinx Virtex 4 FX100 fieldprogrammable gate array (FPGA). the image recognition system sustains a safe operational temperature by automatically adjusting its frequency and output quality. the circuit sacrifices output performance and quality to lower its internal temperature as the ambient temperature increases, and can leverage cooler temperatures by increasing output performance and quality. Furthermore, the circuit will shutdown if the ambient temperature becomes too hot for the device to function properly. A performance evaluation of our adaptive circuit under various thermal conditions shows up to a 4x factor increase in performance and a 2x factor increase in quality over a system without dynamic thermal control.
暂无评论