This paper presents Archlog, a language and framework for designing multiprocessor architectures in the logic programming domain. Our goal is to enable application developers in areas such as machine learning and cogn...
详细信息
ISBN:
(纸本)9781424403127
This paper presents Archlog, a language and framework for designing multiprocessor architectures in the logic programming domain. Our goal is to enable application developers in areas such as machine learning and cognitive robotics to produce high-performance designs for reconfigurable devices, without detailed knowledge of hardware development. The Archlog framework provides a high level of abstraction, enabling rapid system generation while supporting high performance. In this paper we present the Archlog language and its library-based compilation framework, which makes use of a customisable logic programming processor. The system generates multiple designs, with different trade-offs in the use of reconfigurable logic and embedded memories. An implementation of a multiprocessor for the machine learning system Progol on a 40MHz XC2V6000 FPGA is 10 times faster than a 2GHz Pentium 4 processor.
field-progammable circuits now have a capacity that allows them to accelerate floating-point computing, but are still missing core libraries for it. In particular, there is a need for an equivalent to the mathematical...
详细信息
ISBN:
(纸本)9781424410590
field-progammable circuits now have a capacity that allows them to accelerate floating-point computing, but are still missing core libraries for it. In particular, there is a need for an equivalent to the mathematical library (libm) available with every processor and providing implementations of standard elementary functions usch as exponential, logarithm or sine. This is all the more important as FPGAs are able to outperform current processors for such elementary functions, for which no dedicated hardware exists in the processor. FPLibrary, freely available from ***/LIP/Arenaire/, is a first attempt to address this need for a mathematical library for FPGAs. This article demonstrates the implementation, in this library, of high-quality operators for floating-point sine and cosine functions up to single-precision. Small size and high performance are obtained using a specific, hardware-oriented algorithm, and careful datapath optimisation and error analysis. Operators fully compatible with the standard software functions are first presented, followed by a study of several more cost-efficient variants.
In this paper, we propose a first step towards a time predictable computer architecture for single-chip multiprocessing (CMP). CMP is the actual trend in server and desktop systems. CMP is even considered for embedded...
详细信息
ISBN:
(纸本)9781424410590
In this paper, we propose a first step towards a time predictable computer architecture for single-chip multiprocessing (CMP). CMP is the actual trend in server and desktop systems. CMP is even considered for embedded realtime systems, where worst-case execution time (WCET) estimates are of primary importance. We attack the problem of WCET analysis for several processing units accessing a shared resource (the main memory) by support from the hardware. In this paper, we combine a time predictable Java processor and a direct memory access (DMA) unit with a regular access pattern (VGA controller). We analyze and evaluate different arbitration schemes with respect to schedulability analysis and WCET analysis. We also implement the various combinations in an FPGA. An FPGA is the ideal platform to verify the different concepts and evaluate the results by running applications with industrial background in real hardware.
Restricted Boltzmann Machines (RBMs) - the building block for newly popular Deep Belief Networks (DBNs) - are a promising new tool for machine learning practitioners. However, future research in applications of DBNs i...
详细信息
ISBN:
(纸本)9781424438914
Restricted Boltzmann Machines (RBMs) - the building block for newly popular Deep Belief Networks (DBNs) - are a promising new tool for machine learning practitioners. However, future research in applications of DBNs is hampered by the considerable computation that training requires. In this paper, we describe a novel architecture and FPGA implementation that accelerates the training of general RBMs in a scalable manner, with the goal of producing a system that machine learning researchers can use to investigate ever-larger networks. Our design uses a highly efficient, fully-pipelined architecture based on 16-bit arithmetic for performing RBM training on an FPGA. We show that only 16-bit arithmetic precision is necessary, and we consequently use embedded hardware multiply-and-add (MADD) units. We present performance results to show that a speedup of 25-30X can be achieved over an optimized software implementation on a high-end CPU.
Capacity of FPGAs has grown significantly, leading to increased complexity of designs targeting these chips. Traditional FPGA design methodology using HDLs is no longer sufficient and new methodologies are being sough...
详细信息
ISBN:
(纸本)9781424438914
Capacity of FPGAs has grown significantly, leading to increased complexity of designs targeting these chips. Traditional FPGA design methodology using HDLs is no longer sufficient and new methodologies are being sought. An attractive possibility is to use streaming languages. Streaming languages group data into streams, which are processed by computational nodes called kernels. They are suitable for implementation in FPGAs because they expose parallelism, which can be exploited by implementing the application in FPGA logic. Designers can express their designs in a streaming language and target FPGAs without needing a detailed understanding of digital logic design. In this paper we show how the Brook streaming language can be used to simplify design for FPGAs, while providing reasonable performance compared to other methodologies. We show that throughput of streaming applications can be increased through automatic kernel replication. Using our compiler, the FPGA designer can trade off FPGA area and performance by changing the amount of kernel replication. We describe the details of our compiler and present performance and area of a set of benchmarks. We found that throughput scales well with increased replication for most applications.
The fast implementations of ECC in GF(p) are generally implemented using specialized prime field, and henceforth, they are dependent on the structure of the prime. But, these implementations cannot be ported to generi...
详细信息
ISBN:
(数字)9781538685174
ISBN:
(纸本)9781538685174
The fast implementations of ECC in GF(p) are generally implemented using specialized prime field, and henceforth, they are dependent on the structure of the prime. But, these implementations cannot be ported to generic curves which do not support such prime structures. Such generic curves are often used in various crypto-applications like pairing and post quantum secure supersingular isogeny based key exchange. In those cases, modular multiplication is executed through Montgomery multiplier which is slower compared to modular multiplication using specialized primes. This work aims to reduce the speed gap between Montgomery multiplication and modular multiplication in specialized prime field by presenting an efficient implementation of Montgomery multiplier on FPGA using the redundant number system.
As FPGA logic density continues to increase, new techniques are needed to store initial configuration data efficiently, maintain usability, and minimize cost. In this paper, a novel compression technique is presented ...
详细信息
ISBN:
(纸本)9781424438914
As FPGA logic density continues to increase, new techniques are needed to store initial configuration data efficiently, maintain usability, and minimize cost. In this paper, a novel compression technique is presented for Xilinx Virtex partially reconfigurable FPGAs. This technique relies on constrained hardware design and layout combined with a few simple compression techniques. This technique uses partial reconfiguration to separate a hardware design into two separate regions: a static and partial region. A bitstream containing only the static region is then compressed by removing empty frames. This bitstream will be stored in non-volatile memory and used for initialization. The remaining logic is configured through partial reconfiguration over a communication network. By applying this technique, a high level of compression was achieved (almost 90% for the V4 LX25). This compression technique requires no extra decompression circuitry and compression levels improve as device size increases.
Stencil-based algorithms are known to be computationally intensive and used in many scientific applications. The scalability of stencil algorithms in large-scale clusters is limited by data dependency between distribu...
详细信息
ISBN:
(纸本)9781479900046
Stencil-based algorithms are known to be computationally intensive and used in many scientific applications. The scalability of stencil algorithms in large-scale clusters is limited by data dependency between distributed workload. This paper proposes a scalable communication model to schedule communication operations based on available resources and algorithm properties. Experimental results from the Maxeler MPC-C500 computing system with four Virtex-6 SX475T FPGAs demonstrate linear speedup.
In the modern verification environment an FPGA-based prototyping has become an important part of the whole verification flow. The ability to simulate real time application in more realistic speeds allows much higher c...
详细信息
ISBN:
(纸本)9781467381239
In the modern verification environment an FPGA-based prototyping has become an important part of the whole verification flow. The ability to simulate real time application in more realistic speeds allows much higher coverage than traditional HDL logic simulators. The main disadvantage of FPGA prototyping is inability to inspect and observe internal FPGA signals. Currently there are two traditional solutions for this problem. The first solution is using embedded trace-buffers to record a subset of internal signals and the second solution captures a snapshot of the current FPGA state. Both of these techniques have certain benefits and shortcomings. In this paper, we present an idea of merging these two techniques into a new hybrid approach. Using this idea we created a hybrid circuit and during our experiments showed that it preserves all good sides from both traditional approaches.
Implementing Dynamic Voltage and Frequency Scaling (DVFS) is a non-trivial task on FPGAs and requires knowledge about the feasible voltage and frequency (VF) ranges as a first step. The VF feasible ranges depend not o...
详细信息
ISBN:
(纸本)9781467381239
Implementing Dynamic Voltage and Frequency Scaling (DVFS) is a non-trivial task on FPGAs and requires knowledge about the feasible voltage and frequency (VF) ranges as a first step. The VF feasible ranges depend not only on the size of the critical path in the design but also on the inter-and intra-die variability on the FPGA die. Moreover, the variations in the configuration of the FPGA highly affect feasible VF ranges. Therefore, it is crucial to characterise feasibility by studying the relationship between feasible VF regions and these sources of variability in FPGAs. In this paper we employ a self-checking multiplier which uses residue codes and DVFS implemented on the programmablelogic component of a Xilinx Zynq ZC702 device as an error-detection circuit to study these feasible regions. Results show that, as expected, feasible VF ranges vary with FPGA configuration. More interestingly, significant variation of the feasible VF regions is found for different dies. These results highlight the necessity of dynamic self-testing as a part of an adaptive DVFS implementation on FPGAs. Employing the techniques presented in this work enables the implementation of efficient adaptive on-line DVFS on programmablelogic while ensuring reliability.
暂无评论