Dynamic workload orchestration is one of the main concerns when working with heterogeneous computing infrastructures in the edge-cloud continuum. In this context, FPGA-based computing nodes can take advantage of their...
详细信息
Dynamic workload orchestration is one of the main concerns when working with heterogeneous computing infrastructures in the edge-cloud continuum. In this context, FPGA-based computing nodes can take advantage of their improved flexibility, performance, and energy efficiency provided that they use proper resource management strategies. In this regard, many state-of-the-art systems rely on proactive power management techniques and task scheduling decisions, which in turn require deep knowledge about the applications to be accelerated and the actual response of the target reconfigurable fabrics when executing them. While acquiring this knowledge at design time was more or less feasible in the past, with applications mostly being static task graphs that did not change at run time, the highly dynamic nature of current workloads in the edge-cloud continuum, where tasks can be deployed on any node and at any time, has removed this possibility. As a result, being able to derive such information at run time to make informed decisions has become a must. This article presents an infrastructure to build incremental ML models that can be used to obtain run-time power consumption and performance estimations in FPGA-based reconfigurable multi-accelerator systems operating under dynamic workloads. The proposed infrastructure features a novel stop-and-restart resource-aware mechanism to monitor and control the model training and evaluation stages during normal system operation, enabling low-overhead updates in the models to account for either unexpected acceleration requests (i.e., tasks not considered previously by the models) or model drift (e.g., fabric degradation). Experimental results show that the proposed approach induces a maximal additional error of 3.66% compared to a continuous training alternative. Furthermore, the proposed approach incurs only a 4.49% execution time overhead, compared to the 20.91% overhead induced by the continuous training alternative. The proposed m
The end of Moore's law and Dennard scaling emphasizes the need for application-specific computing architectures to achieve high resource and energy efficiency and real-time performance. The concept of a silicon co...
详细信息
The end of Moore's law and Dennard scaling emphasizes the need for application-specific computing architectures to achieve high resource and energy efficiency and real-time performance. The concept of a silicon compiler remains an enduring aspiration for design time reduction. In order to generate hardware implementations at register transfer level from behavioral descriptions, design automation tools must address challenging and interdependent problems, including allocation, scheduling, and binding. Additionally, manual intervention by the user is necessary to balance the resources vs. performance tradeoff via, for example, function inlining or loop unrolling/pipelining. Existing approaches typically solve these problems sequentially, compromising optimality in favor of simplicity and runtime. Here we show how to model the whole model-based design flow as one holistic integer linear programming (ILP) formulation aiming at consistently deriving the optimal microarchitecture for any given application. Incorporating clock gating minimizes the number of useless operations with negligible resource overhead (if any), while always guaranteeing optimal throughput. The unified nature of the proposed ILP model enables implementations unmatched by state-of-the-art approaches in terms of resource efficiency and measured power consumption. These results facilitate a streamlined design flow for highly optimized embedded systems in the context of model-based design.
Runtime Reconfiguration (RTR) has been traditionally utilized as a means for exploiting the flexibility of High-Performance reconfigurable Computers (HPRCs). However, the RTR feature comes with the cost of high config...
详细信息
reconfigurable architectures are quickly gaining in popularity due to their flexibility and ability to provide high energy efficiency. However, reconfigurablesystems allow for a huge design space. Iterative design sp...
详细信息
reconfigurable architectures are quickly gaining in popularity due to their flexibility and ability to provide high energy efficiency. However, reconfigurablesystems allow for a huge design space. Iterative design space exploration (DSE) is often required to achieve good Pareto points with respect to some combination of performance, area, and/or energy. DSE tools depend on information about hardware characteristics in these aspects. These characteristics can be obtained from hardware synthesis and net-list simulation, but this is very timeconsuming. Therefore, architecture models are common. This work introduces CGRA-EAM (Coarse-Grained reconfigurable Architecture - Energy & Area Model), a model for energy and area estimation framework for coarse-grained reconfigurable architectures. The model is evaluated for the Blocks CGRA. The results demonstrate that the mean absolute percentage error is 15.5% and 2.1% for energy and area, respectively, while the model achieves a speedup of close to three orders of magnitude compared to synthesis.
This article presents a new method for Monte Carlo (MC) option pricing using field-programmable gate arrays (FPGAs), which use a discrete-space random walk over a binomial lattice, rather than the continuous space-wal...
详细信息
This article presents a new method for Monte Carlo (MC) option pricing using field-programmable gate arrays (FPGAs), which use a discrete-space random walk over a binomial lattice, rather than the continuous space-walks used by existing approaches. The underlying hypothesis is that the discrete-space walk will significantly reduce the area needed for each MC engine, and the resulting increase in parallelisation and raw performance outweighs any accuracy losses introduced by the discretisation. Experimental results support this hypothesis, showing that for a given MC simulation size, there is no significant loss in accuracy by using a discrete space model for the path-dependent exotic financial options. Analysis of the binomial simulation model shows that only limited-precision fixed-point arithmetic is needed, and also shows that pairs of MC kernels are able to share RAM resources. When using realistic constraints on pricing problems, it was found that the size of a discrete-space MC engine can be kept to 370 Flip-Flops and 233 Lookup Tables, allowing up to 3,000 variance-reduced MC cores in one FPGA. The combination of a highly parallelisable architecture and model-specific optimisations means that the binomial pricing technique allows for a 50x improvement in throughput compared to existing FPGA approaches, without any reduction in accuracy.
reconfigurable hardware is a promising technology for implementing firewalls, routing mechanisms, and new protocols for evolving high-performance network systems. This work presents a novel deterministic approach for ...
详细信息
reconfigurable hardware is a promising technology for implementing firewalls, routing mechanisms, and new protocols for evolving high-performance network systems. This work presents a novel deterministic approach for a Range-enhanced reconfigurable Packet Classification Engine based on the number of rules on FPGAs. The proposed framework uses a RAM-established TernaryMatch to represent the prefix and the range prefix and efficient rule-reordering for priority selection to get both best-match and multi-match in the same architecture. The recommended framework exhibits 3.2 Mbits of LUT-RAM-based ternary content addressable memory (TCAM) to hold a maximum of 31.3 K of 104-bit rules with 520 MPPS. LUT-RAM, alongwith BRAM, shows 4 Mbits of TCAM space to implement 38.5 K of 104-bit rules to sustain a throughput of 400 MPPS on Virtex-7 FPGA. The complete architecture offers scalability, better resource utilization (minimum of 50%), representation of inverse prefix with single entry, range expansion with a single rule, getting bestand multi-match, and determination of the required number of FPGA resources for a particular dataset.
暂无评论