Large-scale, direct-mapped fpga computing systems are traditionally very difficult to debug due to the high level of parallelism and limited access to internal signal values. This poster describes our solution to this...
详细信息
ISBN:
(纸本)9781595939340
Large-scale, direct-mapped fpga computing systems are traditionally very difficult to debug due to the high level of parallelism and limited access to internal signal values. This poster describes our solution to this problem, in which the concepts of variables and process control are brought into the fpga hardware domain. Declarations made in the design environment are translated into logic inserted automatically into the hardware implementation. Variables provide full read/write access to hardware signals during runtime, complete with dynamic assertion checking capable of automatically halting the system clock. System data is consistently cached via attached DRAM, providing a very deep history of variable sample values and the ability to "rewind" system state. Process execution can also be controlled by the user on a cycle-by-cycle basis, either manually or through the declaration of breakpoints. Assertion failures and breakpoints are accurate within the same cycle of detection, and are implemented using on-chip gated clock buffers. All debugging controls are provided by a remote graphical user interface, which also supports back-annotation in the input design for improved data visibility and comprehension. The complete hardware and software infrastructure of the debugger has already been fully implemented, with user trials and overhead measurements ongoing at the time of writing
The problem of hardware-software codesign for embedded systems using configurable architectures has been studied extensively in the past decade. In this work we studied the feasibility of utilizing Commercial Off-The-...
详细信息
ISBN:
(纸本)9781595939340
The problem of hardware-software codesign for embedded systems using configurable architectures has been studied extensively in the past decade. In this work we studied the feasibility of utilizing Commercial Off-The-Shelf (COTS) fpga systems for codesign. We partitioned the implementation of a set of benchmark applications on hardware and software and studied the performance and resource consumption in the system. The result of experiments demonstrated that the communication between the processor and the reconfigurable architecture is the major hurdle in the codesign, especially when using COTS System on Chips. It is demonstrated that although implementing algorithms in hardware can lead to enormous speedup, the communication overhead for transferring data variables between the configurable architecture and the processor can destroy all the achieved speedup. We especially showed that in COTS fpgas this bottleneck is more restricting because of the weak communication structure between different IPs. Furthermore, analyzing the experimental results, we propose a partitioning mechanism; the evaluation results show that the achieved speedup using the proposed partitioning mechanism is between 2 to 300 based on application's data dependency
A decoder is a hardware module that expands an x-bit input into an n-bit output, where x << n. It can be viewed as producing a set S of subsets of an n-element set Zn. If this set S can be altered by the user, t...
详细信息
ISBN:
(纸本)9781595939340
A decoder is a hardware module that expands an x-bit input into an n-bit output, where x << n. It can be viewed as producing a set S of subsets of an n-element set Zn. If this set S can be altered by the user, the decoder is said to be configurable. We propose a class of configurable decoders (called "mapping-unit" based decoders or simply MU-decoders) that facilitate efficient selection of elements in an fpga (in general, in any chip). Conventional solutions for this selection use either (a) a fixed (non-reconfigurable) decoder that lacks the flexibility to generate many subsets quickly, or (b) a large look-up table (LUT) which is flexible, but too expensive. The proposed class of MU-decoders have much of the flexibility of the large-LUT solution (also called a LUT decoder here) at the cost of the fixed decoder solution. Specifically, we show that for any fixed gate cost, the a MU-decoder can produce any set of subsets that the LUT decoder can; in addition, the MU-decoder can exploit any available structure in the application at hand to produce many more subsets than the LUT decoder. We illustrate this ability in the context of totally ordered sets of subsets
With the increase in system complexity, designers are increasingly using IP blocks as a means for filling the designer productivity gap. This has given rise to system level languages which connect IP blocks together. ...
详细信息
ISBN:
(纸本)9781595939340
With the increase in system complexity, designers are increasingly using IP blocks as a means for filling the designer productivity gap. This has given rise to system level languages which connect IP blocks together. However, these languages have in general not been subject to formalisation. They are considered too trivial to justify the formalisation effort. Unfortunately, the lack of formality in these languages can give rise to errors that are not caught until late in the design cycle. We present a type system for static typing of such a system level language. We argue that the proposed type system will eliminate an important class of errors currently permitted by existing system level languages. A comparison is made against existing tools and we show that the type checker detects errors earlier in the design flow. This reduces synthesis iterations and decreases the time to market
It has been shown that fpgas could outperform high-end microprocessors on floating-point computations thanks to massive parallelism. However, most previous studies re-implement in the fpga the operators present in a p...
详细信息
ISBN:
(纸本)9781595939340
It has been shown that fpgas could outperform high-end microprocessors on floating-point computations thanks to massive parallelism. However, most previous studies re-implement in the fpga the operators present in a processor. This conservative approach is relatively straightforward, but it doesn't exploit the greater flexibility of the fpga. We survey the many ways in which the fpga implementation of a given floating-point computation can be not only faster, but also more accurate than its microprocessor counterpart. Techniques studied here include custom precision, mixing and matching fixed- and floating-point, specific accumulator design, dedicated architectures for coarser operators implemented as software in processors (such as elementary functions or Euclidean norms), operator specialization such as constant multiplication, and others. The FloPoCo project (http://***/LIP/Arenaire/Ware/FloPoCo/) aims at providing such non-standard operators. As a conclusion, current fpga fabrics could be enhanced to improve floating-point performance. However, these enhancements should not take the form of hard FPU blocks as others have suggested. Instead, what is needed is smaller building blocks more generally useful to the implementation of floating-point operators, such as cascadable barrel shifters and leading zero counters
QR decomposition is used in many signal processing applications. We have implemented a systolic array QR decomposition on a Xilinx Virtex5 fpga using the Givens rotation algorithm. It uses a truly two dimensional syst...
详细信息
ISBN:
(纸本)9781595939340
QR decomposition is used in many signal processing applications. We have implemented a systolic array QR decomposition on a Xilinx Virtex5 fpga using the Givens rotation algorithm. It uses a truly two dimensional systolic array architecture so latency scales well for large matrices. To accommodate the dynamic range of input data, floating-point arithmetic is chosen, using the Northeastern University Variable Precision Floating-Point (VFloat) library. We support any general floating-point format including IEEE single precision. Our design uses straightforward floating-point divide and square root implementations, compared to prior work which uses special operations or formats such as CORDIC or the logarithmic number system (LNS). This makes our design more standard and portable to different systems, thus easier to fit into a larger design. We support square, tall and short matrices. The input matrix size can be configured at compile-time to virtually any size. Therefore, it can be easily scaled to future larger fpga devices, or over multiple fpgas. The QR module is fully pipelined with a throughput of over 130 MHz for IEEE single precision floating-point format. 35 GFlops throughput peak performance is achieved for a 12 by 12 matrix with this implementation
This poster presents an in-depth analysis of the Xilinx bitstream file format. This theoretical analysis is backed by a simple and efficient implementation of a reverse-engineering tool for Xilinx bitstreams. The deve...
详细信息
ISBN:
(纸本)9781595939340
This poster presents an in-depth analysis of the Xilinx bitstream file format. This theoretical analysis is backed by a simple and efficient implementation of a reverse-engineering tool for Xilinx bitstreams. The development process followed these lines. First, publicly available documentation from Xilinx has been analyzed; then some custom assumptions about the bitstream format have been made. This information allowed a suitable algorithm to be run on well-chosen bitstreams. The output from this automated analysis step is a database which relates raw bitstream data to low-level netlist elements. This database is subsequently used as input to an efficient bitstream compiler which can either generate a bitstream from a low-level (XDL) description of the netlist, or conversely decompile any given bitstream to its low-level netlist elements. This work has been validated for the spartan3, virtex2, virtex4 and virtex5 fpga lines from Xilinx. Decompiling a bitstream is very fast; it is two orders of magnitude faster than the reverse operation of compilation with Xilinx' bitgen. This work aims to raise awareness about security issues for users of fpgas. It also makes custom compilation and low-level tinkering with bitstreams - à la JBits - possible
Modern fpgas can implement large, custom compute engines that are designed to exploit extreme amounts of parallel computation. Through parallelism, these systems achieve orders of magnitude higher performance than the...
详细信息
ISBN:
(纸本)9781595939340
Modern fpgas can implement large, custom compute engines that are designed to exploit extreme amounts of parallel computation. Through parallelism, these systems achieve orders of magnitude higher performance than the fastest microprocessors. Building such custom compute engines with existing hardware design languages is too difficult and time-consuming. For this to become mainstream technology, the task of designing such parallel systems must be as simple as possible. Thus, high-level languages are needed which can specify a custom compute engine or be compiled to run on predesigned parallel systems. In this workshop, we will examine several approaches for specifying extremely parallel computations in high-level languages. These can be used to build parallel systems in fpgas, or they can be used to specify parallel computations in other competing architectures. By examining several different approaches, one gains insight into the best approach for solving a given problem. Ideally, this will also inspire new approaches for designing with extreme parallelism
This article does a purely mathematical analysis based on generic models, and the idea is to investigate the possibility of using tiling patterns other than Manhattan grid in fpgas. The goal of our research is to evol...
详细信息
ISBN:
(纸本)9781595939340
This article does a purely mathematical analysis based on generic models, and the idea is to investigate the possibility of using tiling patterns other than Manhattan grid in fpgas. The goal of our research is to evolve fpga architectures with advances in technology, and specifically better utilization of available interconnect layers. We propose a method to evaluate tiling patterns based on the first principles ( i.e Rent's Rule, Donath's result, equivalence of wire flux and wire length). We show that, use of tiling patterns formed with higher order polygons can improve the speed and area performances of an fpga. This gain is highly dependent on depopulation schemes and other parameters. However for generic tiling patterns with crossbar switchboxes there is a 22% gain in area for the hexagonal tiling pattern, and a 30% gain in area for the octagonal tiling pattern. Moreover the average interconnect length is around 15% lesser for hexagonal and 31% lesser for the octagonal tiling compared to square tiling. We can expect a proportional increase in speed. We also present a comparative plot of total interconnect lengths for these tiling patterns and the hierarchical gatearraysThe physical realizability of these tiling patterns in CMOS are to be investigated. We present a layout scheme for both hexagonal and octagonal fpgas. To our knowledge standard processes support 45° metal lines, whereas 60° lines can be etched using non-standard processes. We must keep in mind, that in practice one must use some sort of depopulation and staggering scheme, and these results provide only an idea of gains that can be achieved. The actual interconnect structure is of course dependent on several factors (i.e available interconnect layers, difficulty of fabrication, required speed/area, evolution of CMOS technology etc). Our future research direction will be to choose an efficient interconnect strategy based on this and previous researches, as well as experimental results
暂无评论