Heterogeneous Multiprocessor System-on-Chip (Ht-MPSoC) architectures represent a promising approach as they allow a higher performance/energy consumption trade-off. In such systems, the processor instruction set is en...
详细信息
Heterogeneous Multiprocessor System-on-Chip (Ht-MPSoC) architectures represent a promising approach as they allow a higher performance/energy consumption trade-off. In such systems, the processor instruction set is enhanced by application-specific custom instructions implemented on reconfigurable fabrics, namely FPGA. To increase area utilization and guarantee application constraint respect, we propose a new architecture where Ht-MPSoC hardware accelerators are shared among different processors in an intelligent manner. In this paper, a Mixed Integer Linear Programming (MILP) model is proposed to systematically explore the complex design space of the different configurations.
the variety of applications for fieldprogrammable gate arrays (FPGAs) is continuously growing, thus it is important to address power consumption issues during the operation. As technological node shrinks, leakage pow...
详细信息
the variety of applications for fieldprogrammable gate arrays (FPGAs) is continuously growing, thus it is important to address power consumption issues during the operation. As technological node shrinks, leakage power becomes increasingly critical in overall power consumption of FPGA. the technique of configuration pre-fetching (loads configurations as soon as possible) adopted to achieve high performance is one of the major reasons of leakage waste since regions containing reconfiguration information cannot be powered down in between the time gap of reconfiguration and execution. In this work, we present a heuristic approach to minimize the leakage power consumption for two-dimensional reconfigurable FPGA architectures. the heuristic scheduler is based on list scheduling and exploits dynamic priority for sorting the tasks into schedule order and a cost function for cell allocation. Farthest placement scheme is adopted for anti-fragmentation purpose. the cost function provides control to compromise between leakage dissipation and schedule length.
Floating-point computing with more than one TFLOP of peak performance is already a reality in recent field-programmable Gate Arrays (FPGA). General-Purpose Graphics Processing Units (GPGPU) and recent many-core CPUs h...
详细信息
Floating-point computing with more than one TFLOP of peak performance is already a reality in recent field-programmable Gate Arrays (FPGA). General-Purpose Graphics Processing Units (GPGPU) and recent many-core CPUs have also taken advantage of the recent technological innovations in integrated circuit (IC) design and had also dramatically improved their peak performances. In this paper, we compare the trends of these computing architectures for high-performance computing and survey these platforms in the execution of algorithms belonging to different scientific application domains. Trends in peak performance, power consumption and sustained performances, for particular applications, show that FPGAs are increasing the gap to GPUs and many-core CPUs moving them away from high-performance computing with intensive floating-point calculations. FPGAs become competitive for custom floating-point or fixed-point representations, for smaller input sizes of certain algorithms, for combinational logic problems and parallel map-reduce problems.
WalkSAT (WSAT) is a stochastic local search algorithms for Boolean Satisfiability (SAT) and Maximum Boolean Satisfiability (MaxSAT) problems, and it is very suitable for hardware acceleration because of its high inher...
详细信息
WalkSAT (WSAT) is a stochastic local search algorithms for Boolean Satisfiability (SAT) and Maximum Boolean Satisfiability (MaxSAT) problems, and it is very suitable for hardware acceleration because of its high inherent parallelism. Formal verification is one of the most important applications of SAT and MaxSAT, however, the size of the formal verification problems is significantly larger than on-chip memory size, and most of the data have to be placed in off-chip DRAM. In this paper, we propose a method to hide the access delay by using on-chip memory banks as a variable-way associative cache memory. the size of data blocks that are frequently fetched from the DRAM considerably varies in the WSAT algorithm. this cache memory aims to hold whole block when it is small enough, and only the head portion when it is large, to hide the DRAM access delay. Withthis cache memory, up to 60% DRAM access delay can be hidden, and the performance can be improved up to 26%.
this paper present new approaches for in-system, trace-based debug of High-Level Synthesis-generated hardware. these approaches include the use of Event Observability Ports (EOP) that provide observability of source-l...
详细信息
this paper present new approaches for in-system, trace-based debug of High-Level Synthesis-generated hardware. these approaches include the use of Event Observability Ports (EOP) that provide observability of source-level events in the final hardware. We also propose the use of small, independent trace buffers called Event Observability Buffers (EOB) for tracing events through EOPs. EOBs include a data storage enable signal that allows cycle-by-cycle storage decisions to be made on an EOB-by-EOB basis. this approach causes the timing relationships of events captured in different trace buffers to be lost. Two methods are presented for recovering these relationships. Finally, we present a case study that demonstrates the feasibility and effectiveness of an EOB trace strategy.
the transition from a neural network simulation model to its hardware representation is a complex process, which touches computations precision, performance and effective architecture implementation issues. Presented ...
详细信息
the transition from a neural network simulation model to its hardware representation is a complex process, which touches computations precision, performance and effective architecture implementation issues. Presented neural processing accelerator involves neural network sectioning, precision reduction and weight coefficients parsing (arrangements) in order to increase efficiency and maximize FPGA hardware resources utilization. Particular attention has been devoted on to ANN conversion methods designed for a system based on neural processing units and related withthis process redundant calculations and empty neurons generation. In addition, this paper describes the FPGA-based Neural Processing Accelerator architecture benchmark for real example implementation of a pattern recognition neural network.
this paper describes the system architecture and implementation results of a robust and flexible dual-frequency 2×2 array processing GNSS receiver platform. A digital front-end FPGA pre-processes the incoming raw...
详细信息
this paper describes the system architecture and implementation results of a robust and flexible dual-frequency 2×2 array processing GNSS receiver platform. A digital front-end FPGA pre-processes the incoming raw ADC data and implements interference mitigation methods in time and frequency domain. An optional second FPGA card can be used to realize more sophisticated and computational complex interference mitigation techniques. Finally, the data stream is processed on a baseband FPGA platform with spatial array processing techniques using a software assisted hardware GNSS receiver approach. the interconnection of the FPGAs is realized using gigabit transceivers handling a constant raw data rate of 16.8 Gbit/s.
ZYNQ devices combine a dual-core ARM Cortex A9 processor and a FPGA fabric in the same die and in different power domains. In this paper we investigate the run-time power scaling capabilities of these devices using of...
详细信息
ZYNQ devices combine a dual-core ARM Cortex A9 processor and a FPGA fabric in the same die and in different power domains. In this paper we investigate the run-time power scaling capabilities of these devices using of-the-shelf boards and proposed accurate and fine-grained power control and monitoring techniques. the experimental results show that both software and hardware methods are possible and the right selection can yield different results in terms of control and monitoring speeds, accuracy of measurement, power consumption, and area overhead. the results also demonstrate that significant power margins are available in the FPGA device with different voltage configurations possible. this can be used to complement traditional voltage scaling techniques applied to the processor domain to obtain hybrid energy proportional computing platforms.
FPGA block RAMs (BRAMs) offer speed advantages compared to LUT-based memory designs but a BRAM has only one read and one write port. Designers need to use multiple BRAMs in order to create multi-port memory structures...
详细信息
FPGA block RAMs (BRAMs) offer speed advantages compared to LUT-based memory designs but a BRAM has only one read and one write port. Designers need to use multiple BRAMs in order to create multi-port memory structures which are more difficult than designing with LUT-based multiport memories. Multi-port memory designs increase overall performance but comes with area cost. In this paper, we present a fully automated methodology that tailors our multi-port memory from a given application. We present our performance improvements and area tradeoffs on state-of-the-art string matching algorithms.
this paper presents a novel method for estimating parameters of financial models with jump diffusions. It is a Particle Filter based Maximum Likelihood Estimation process, which uses particle streams to enable efficie...
详细信息
this paper presents a novel method for estimating parameters of financial models with jump diffusions. It is a Particle Filter based Maximum Likelihood Estimation process, which uses particle streams to enable efficient evaluation of constraints and weights. We also provide a CPU-FPGA collaborative design for parameter estimation of Stochastic Volatility with Correlated and Contemporaneous Jumps model as a case study. the result is evaluated by comparing with a CPU and a cloud computing platform. We show 14 times speed up for the FPGA design compared withthe CPU, and similar speedup but better convergence compared with an alternative parallelisation scheme using Techila Middleware on a multi-CPU environment.
暂无评论