As an emerging field of machine learning, deep learning represents a powerful technique to solve complex learning problems. However, the size of the networks becomes increasingly large due to the demands of the practi...
详细信息
Multi-Agent Reinforcement Learning (MARL) is an emerging technology that has seen success in many AI applications. Multi-Actor-Attention-Critic (MAAC) is a state-of-the-art MARL algorithm that uses a Multi-Head Attent...
详细信息
ISBN:
(数字)9798331530075
ISBN:
(纸本)9798331530082
Multi-Agent Reinforcement Learning (MARL) is an emerging technology that has seen success in many AI applications. Multi-Actor-Attention-Critic (MAAC) is a state-of-the-art MARL algorithm that uses a Multi-Head Attention (MHA) mechanism to learn messages communicated among agents during the training process. Current implementations of MAAC using CPU and CPU-GPU platforms lack fine-grained parallelism among agents, sequentially executing each stage of the training loop, and their performance suffers from costly data movement involved in MHA communication learning. In this work, we develop the first high-throughput accelerator for MARL with attention-based communication on a CPU-FPGA heterogeneous system. We alleviate the limitations of existing implementations through a combination of data- and pipeline-parallel modules in our accelerator design and enable fine-grained system scheduling for exploiting concurrency among heterogeneous resources. Our design increases the overall system throughput by $4.6 \times$ and $4.1 \times$ compared to CPU and CPU-GPU implementations, respectively.
With increasingly complex workloads and the end of Dennard scaling, the need for heterogeneous computing is becoming apparent. SoC FPGAs (System-on-chip field-programmable Gate Arrays) are a promising solution to this...
详细信息
ISBN:
(纸本)9798350344196
With increasingly complex workloads and the end of Dennard scaling, the need for heterogeneous computing is becoming apparent. SoC FPGAs (System-on-chip field-programmable Gate Arrays) are a promising solution to this need. They combine the versatility of CPUs and the reconfigurability and high performance of FPGAs. SoC FPGAs have received a great deal of attention in recent years from both academia and chipmakers. However, the task of hardware-software codesign, posed by these platforms, remains challenging. This is partly due to the lack of vendor-neutral abstractions for building and evaluating designs. Chisel is a promising hardware construction language based on the idea of writing hardware generators. In this paper, we present fpga-tidbits, an open-source, vendor-neutral Chisel library for rapid prototyping of accelerators for SoC FPGAs.
With the deployment of FPGAs in a data center, there is the opportunity to build large multi-FPGA applications. In this paper, we design a partitioner to address the problem of efficiently assigning the various tasks ...
With the deployment of FPGAs in a data center, there is the opportunity to build large multi-FPGA applications. In this paper, we design a partitioner to address the problem of efficiently assigning the various tasks of a large multi-FPGA application to individual network-connected FPGAs according to constraints that consider resource usage, communication bandwidth and communication latency. By using simulated annealing, we can modify the cost function as new objectives and constraints are determined. We build on the Galapagos multi-FPGA platform by introducing a multi-die shell to extend Galapagos to more recent FPGA boards and design the partitioner to work on any collection of single- and multi-die FPGAs. Finally, We evaluate the new shell and partitioner using micro-benchmarks and analyze the partitioning of a real-world multi-FPGA application, a Transformer model.
The HLS toolchain effectively reduces the design complexity of FPGA hardware accelerators. However, in scenarios involving the multi-objective optimization of large-scale HLS designs, determining the knob configuratio...
详细信息
ISBN:
(数字)9798331530075
ISBN:
(纸本)9798331530082
The HLS toolchain effectively reduces the design complexity of FPGA hardware accelerators. However, in scenarios involving the multi-objective optimization of large-scale HLS designs, determining the knob configurations of Pareto design points remains a challenging task for designers. Our work re-evaluates the key factors affecting the efficiency of multiobjective design space exploration in HLS design and proposes an efficient framework named FlexWalker. It utilizes the upper confidence bound algorithm to organize various heterogeneous regression models for predicting the quality of HLS designs with different knob configurations in the design space and introduces a probability sampling algorithm and an elastic Pareto frontier to counteract the negative impact of regression model errors. Experimental results show that our work can stably eliminate over 90% of non-Pareto frontier design points in the tested HLS design space, effectively enhancing the efficiency of multiobjective design space exploration.
Many papers proposed the execution of real-time tasks on FPGA hardware. Most of these works do not demonstrate fully working systems and suffer from either unrealistic assumptions about the placement, reconfigurabilit...
详细信息
ISBN:
(纸本)9783031429200;9783031429217
Many papers proposed the execution of real-time tasks on FPGA hardware. Most of these works do not demonstrate fully working systems and suffer from either unrealistic assumptions about the placement, reconfigurability, and connectivity of hardware tasks to memory and peripherals, or do not come with an efficient schedulability test that guarantees that real-time constraints are met. In this paper, we present a practical way of executing a set of periodic real-time tasks under static priority assignment on a platform FPGA, comprising a processing system and programmablelogic. The platform FPGA is operated under the ReconOS64 architecture and operating system layer which enables practical realization. The hardware tasks follow a 3-phase task model with memory-in, execution, and memory-out phases. All memory phases compete for shared memory, which forms a resource that must be accessed mutually exclusive. While our task and system models are relatively simple as they map each hardware task to a separate region in the programmablelogic, they lead to an efficient schedulability test covering memory accesses. We present our task and ReconOS64 system models, describe the runtime scheduler, and derive a corresponding schedulability test.
LSTM, a recurrent neural network (RNN) well-suited for sequential data tasks, often incurs computational expenses during training and deployment, particularly with large and intricate models. The enhancement of LSTM n...
详细信息
In addressing the issue of gradually attenuating amplitude of underwater acoustic signals received by hydrophones, this paper presents a feedforward digital automatic gain control (AGC) system based on a table lookup ...
详细信息
Modern field-programmable Gate Arrays (FPGAs) are highly versatile, with reconfigurable logic functionality that allows designers to create custom designs. Unlike traditional fixed-function integrated circuits, FPGAs ...
详细信息
ISBN:
(数字)9798331522445
ISBN:
(纸本)9798331522452
Modern field-programmable Gate Arrays (FPGAs) are highly versatile, with reconfigurable logic functionality that allows designers to create custom designs. Unlike traditional fixed-function integrated circuits, FPGAs are reconfigured multiple times to perform different tasks, making them ideal for various applications. An Embedded FPGA (eFPGA) takes the concept of an FPGA and integrates it as an IP core within a larger System-on-Chip (SoC) or Application-Specific Integrated Circuit (ASIC). The proposed work employs Non-dominated Sorting Genetic Algorithm-II (NSGA-II) to evolve efficient and high-performing eFPGA architectures with reduced critical path delay and power defined by a sufficient number of Configurable logic Blocks (CLBs), custom Digital Signal Processors (DSPs) and Block RAM (BRAMs) for targetting workload under consideration. As a result, this work presents a well-balanced eFPGA architecture layout obtained for four workloads equipped with different numbers of heterogeneous and configurable logic blocks. The results showed a significant improvement of 44.62% in hardware parameters.
This paper introduces the problem of learning to place logic blocks in field-programmable Gate Arrays (FPGAs) and a preliminary learning-based method. In contrast to previous FPGA placement algorithms, we depart from ...
详细信息
ISBN:
(纸本)9798350352047;9798350352030
This paper introduces the problem of learning to place logic blocks in field-programmable Gate Arrays (FPGAs) and a preliminary learning-based method. In contrast to previous FPGA placement algorithms, we depart from heuristic search and instead employ Deep Reinforcement Learning (DRL) for the placement task with the objective of minimizing wirelength. To facilitate the agent's decision-making, we design unique state representations that include the chipboard observations and interconnections between different blocks. Additionally, we propose the decomposition training paradigm to address the nature of large search space and sparse rewards in the placement problem by dividing the full problem into small subtasks and solving each subtask using DRL respectively. Experiments demonstrate the effectiveness of the decomposition paradigm on FPGA placement tasks.
暂无评论