WalkSAT (WSAT) is a stochastic local search algorithms for Boolean Satisfiability (SAT) and Maximum Boolean Satisfiability (MaxSAT) problems, and it is very suitable for hardware acceleration because of its high inher...
详细信息
WalkSAT (WSAT) is a stochastic local search algorithms for Boolean Satisfiability (SAT) and Maximum Boolean Satisfiability (MaxSAT) problems, and it is very suitable for hardware acceleration because of its high inherent parallelism. Formal verification is one of the most important applications of SAT and MaxSAT, however, the size of the formal verification problems is significantly larger than on-chip memory size, and most of the data have to be placed in off-chip DRAM. In this paper, we propose a method to hide the access delay by using on-chip memory banks as a variable-way associative cache memory. the size of data blocks that are frequently fetched from the DRAM considerably varies in the WSAT algorithm. this cache memory aims to hold whole block when it is small enough, and only the head portion when it is large, to hide the DRAM access delay. Withthis cache memory, up to 60% DRAM access delay can be hidden, and the performance can be improved up to 26%.
this paper present new approaches for in-system, trace-based debug of High-Level Synthesis-generated hardware. these approaches include the use of Event Observability Ports (EOP) that provide observability of source-l...
详细信息
this paper present new approaches for in-system, trace-based debug of High-Level Synthesis-generated hardware. these approaches include the use of Event Observability Ports (EOP) that provide observability of source-level events in the final hardware. We also propose the use of small, independent trace buffers called Event Observability Buffers (EOB) for tracing events through EOPs. EOBs include a data storage enable signal that allows cycle-by-cycle storage decisions to be made on an EOB-by-EOB basis. this approach causes the timing relationships of events captured in different trace buffers to be lost. Two methods are presented for recovering these relationships. Finally, we present a case study that demonstrates the feasibility and effectiveness of an EOB trace strategy.
the transition from a neural network simulation model to its hardware representation is a complex process, which touches computations precision, performance and effective architecture implementation issues. Presented ...
详细信息
the transition from a neural network simulation model to its hardware representation is a complex process, which touches computations precision, performance and effective architecture implementation issues. Presented neural processing accelerator involves neural network sectioning, precision reduction and weight coefficients parsing (arrangements) in order to increase efficiency and maximize FPGA hardware resources utilization. Particular attention has been devoted on to ANN conversion methods designed for a system based on neural processing units and related withthis process redundant calculations and empty neurons generation. In addition, this paper describes the FPGA-based Neural Processing Accelerator architecture benchmark for real example implementation of a pattern recognition neural network.
this paper describes the system architecture and implementation results of a robust and flexible dual-frequency 2×2 array processing GNSS receiver platform. A digital front-end FPGA pre-processes the incoming raw...
详细信息
this paper describes the system architecture and implementation results of a robust and flexible dual-frequency 2×2 array processing GNSS receiver platform. A digital front-end FPGA pre-processes the incoming raw ADC data and implements interference mitigation methods in time and frequency domain. An optional second FPGA card can be used to realize more sophisticated and computational complex interference mitigation techniques. Finally, the data stream is processed on a baseband FPGA platform with spatial array processing techniques using a software assisted hardware GNSS receiver approach. the interconnection of the FPGAs is realized using gigabit transceivers handling a constant raw data rate of 16.8 Gbit/s.
FPGA block RAMs (BRAMs) offer speed advantages compared to LUT-based memory designs but a BRAM has only one read and one write port. Designers need to use multiple BRAMs in order to create multi-port memory structures...
详细信息
FPGA block RAMs (BRAMs) offer speed advantages compared to LUT-based memory designs but a BRAM has only one read and one write port. Designers need to use multiple BRAMs in order to create multi-port memory structures which are more difficult than designing with LUT-based multiport memories. Multi-port memory designs increase overall performance but comes with area cost. In this paper, we present a fully automated methodology that tailors our multi-port memory from a given application. We present our performance improvements and area tradeoffs on state-of-the-art string matching algorithms.
ZYNQ devices combine a dual-core ARM Cortex A9 processor and a FPGA fabric in the same die and in different power domains. In this paper we investigate the run-time power scaling capabilities of these devices using of...
详细信息
ZYNQ devices combine a dual-core ARM Cortex A9 processor and a FPGA fabric in the same die and in different power domains. In this paper we investigate the run-time power scaling capabilities of these devices using of-the-shelf boards and proposed accurate and fine-grained power control and monitoring techniques. the experimental results show that both software and hardware methods are possible and the right selection can yield different results in terms of control and monitoring speeds, accuracy of measurement, power consumption, and area overhead. the results also demonstrate that significant power margins are available in the FPGA device with different voltage configurations possible. this can be used to complement traditional voltage scaling techniques applied to the processor domain to obtain hybrid energy proportional computing platforms.
this paper presents a novel method for estimating parameters of financial models with jump diffusions. It is a Particle Filter based Maximum Likelihood Estimation process, which uses particle streams to enable efficie...
详细信息
this paper presents a novel method for estimating parameters of financial models with jump diffusions. It is a Particle Filter based Maximum Likelihood Estimation process, which uses particle streams to enable efficient evaluation of constraints and weights. We also provide a CPU-FPGA collaborative design for parameter estimation of Stochastic Volatility with Correlated and Contemporaneous Jumps model as a case study. the result is evaluated by comparing with a CPU and a cloud computing platform. We show 14 times speed up for the FPGA design compared withthe CPU, and similar speedup but better convergence compared with an alternative parallelisation scheme using Techila Middleware on a multi-CPU environment.
Robust real time tracking is a requirement for many emerging applications. Many of these applications must track objects even as their appearance changes. Training classifiers online has become an effective approach f...
详细信息
Robust real time tracking is a requirement for many emerging applications. Many of these applications must track objects even as their appearance changes. Training classifiers online has become an effective approach for dealing with variability in object appearance. Classifiers can learn and adapt to changes online at the cost of additional runtime computation. In this paper, we propose a FPGA accelerated design of an online boosting algorithm that uses multiple classifiers to track and recover objects in real time. Our algorithm uses a novel method for training and comparing pose-specific classifiers along with adaptive tracking classifiers. Our FPGA accelerated design is able to track at 60 frames per second while concurrently evaluating 11 classifiers. this represents a 30× speed up over a CPU based software implementation. It also demonstrates tracking accuracy at state of the art levels on a standard set of videos.
Industrial applications often require processing data with large dynamic ranges at low sample rates. As algorithms become more complex, handling the data range of variables required for fixed-point implementations bec...
详细信息
Industrial applications often require processing data with large dynamic ranges at low sample rates. As algorithms become more complex, handling the data range of variables required for fixed-point implementations becomes time consuming, and can also lead to inefficient designs. Floating-point solutions leverage these limitations trading automatic data range handling for a usually higher implementation cost. the adoption of floating-point solutions for this class of applications is conditioned by area and performance requirements. In this paper we present a low-cost floating-point unit which can either be used standalone, or can be attached to a RISC microprocessor. the proposed unit targets modern, multiplier-based FPGAs, computes efficiently costly operations: ×, ÷, 1/x, √x and 1√x, requires less than 700LE and 4-9bit multipliers on a CycloneIV and runs close to 150MHz.
FPGAs are commonly used in high performance computing applications, often in the form of streaming systems which exploit parallelism of algorithms along pipelined kernels. While such applications have traditionally be...
详细信息
FPGAs are commonly used in high performance computing applications, often in the form of streaming systems which exploit parallelism of algorithms along pipelined kernels. While such applications have traditionally been designed at the Register Transfer Level (RTL), the increasing complexity in terms of FPGA resource usage, arithmetic logic and dataflow is causing the time taken for RTL programming to be prohibitive. this necessitates using high-level programming tools to transparently handle low-level aspects - thus simplifying the design process. Examples of high-level tools for building streaming systems include MaxCompiler by Maxeler Technologies and DSP Builder by Altera. We propose an interception layer which when inserted into communication channels, transparently enhances their performance and capabilities without needing to modify the streaming kernels or host code. We discuss specific channel enhancements: lossless compression to improve effective bandwidth; error correction and fault tolerance to improve reliability. the interception layer is intended to add complex behaviour while maintaining the simplicity of the high-level abstraction when transmitting data via a channel.
暂无评论