fpgas are not just computational workhorses -- they are virtual laboratories for innovation -- a glimpse into our spatial and temporal computing future. No other technology allows you to break computations down into t...
详细信息
ISBN:
(纸本)9798400704185
fpgas are not just computational workhorses -- they are virtual laboratories for innovation -- a glimpse into our spatial and temporal computing future. No other technology allows you to break computations down into their underlying logical atoms so fully and then re-link them back together so fruitfully. This is an incredibly powerful capability, not just for achieving performance, but for analyzing, improving, securing, and understanding the nature of computations in a deep way. It can surface vulnerabilities hidden deep in a design, help us uncover computational truths about the brain, and unlock the latent potential in emerging technologies. In this talk we will dig into the logical representation of computations, the practical power of working at this lowest level, some of the most interesting and pressing problems we might wish to solve in this way, and exciting possible futures for the fpga in helping to realize those solutions.
Charged particle trajectory reconstruction is a critical task in high-energy physics (HEP), particularly for collision analysis in the Large Hadron Collider (LHC). In the LHC, the Level-1 Trigger (L1T) system must per...
详细信息
ISBN:
(纸本)9798400713965
Charged particle trajectory reconstruction is a critical task in high-energy physics (HEP), particularly for collision analysis in the Large Hadron Collider (LHC). In the LHC, the Level-1 Trigger (L1T) system must perform trajectory reconstruction with ultralow latency and very high throughput. Graph Neural Networks (GNNs)-based trajectory reconstruction on fpgas has shown promising performance. However, the existing fpga-based implementations are incomplete, supporting only the GNN processing stage and lacking the graph construction and track building parts of the task *** this paper, we present HiGTR, a high-performance fpga implementation for complete GNN-based trajectory reconstruction, integrating graph construction, GNN processing, and track building. The existing GNN fpga design is enhanced, achieving a 52.3% reduction in latency. The graph construction and track building algorithms are refined to significantly reduce computational complexity and enhance their suitability for fpga implementation. The integrated HiGTR utilizes data streaming and multi-granularity pipelining with the flexibility to support varying data volumes. Implemented on an AMD Xilinx VU9P fpga, HiGTR achieves a speedup of 65,204× compared to the previous software-based algorithm. In addition, HiGTR meets the stringent performance requirements of the L1T system in the HL-LHC upgrade project, including a throughput of 2.22 MHz and a latency of 4 microseconds. These results highlight its strong potential for practical deployment in the HL-LHC upgrade project.
Advancements in design automation technologies, such as high-level synthesis (HLS), have raised the input abstraction level and made the design entry process for fpgas more friendly to software programmers. In contras...
详细信息
ISBN:
(纸本)9798400704185
Advancements in design automation technologies, such as high-level synthesis (HLS), have raised the input abstraction level and made the design entry process for fpgas more friendly to software programmers. In contrast, the backend compilation process for implementing designs on fpgas is considerably more lengthy compared to software compilation. While software code compilation may take just a few seconds, fpga compilation times can often span from several minutes to hours due to the complexity of the underlying toolchain and the ever-growing device capacity. In this paper, we present DynaRapid, a very fast compilation tool that generates in a matter of seconds fully-legal placed-and-routed designs for commercial fpgas. We leverage the inherently modular nature of dataflow circuits created by the HLS tool Dynamatic and combine it with the implementation manipulation capabilities provided by RapidWright. Our approach accelerates the C-to-fpga implementation process by up to 33× with only 20% of degradation in operating frequency compared to a conventional commercial off-the-shelf implementation flow.
The growing interest in autonomous driving technologies requires the creation of efficient real-time systems to understand road scenes. Semantic segmentation, an essential task in computer vision, is crucial in this s...
详细信息
ISBN:
(纸本)9798400713965
The growing interest in autonomous driving technologies requires the creation of efficient real-time systems to understand road scenes. Semantic segmentation, an essential task in computer vision, is crucial in this scenario and has become a viable solution for real-time applications largely due to deep learning models. In terms of hardware for systems with real-time constraints, embedded GPUs are a straightforward solution and provide an easy deployment. At the same time, fpgas have proven to be more efficient for embedded machine vision tasks, especially in terms of power consumption. However, implementing large and complex models on resource-constrained devices such as fpga and reaching a high frame rate and a low latency while preserving the semantic segmentation performance is a major challenge, due to the required trade-off between resources and accuracy. Consequently, the choice of model is not straightforward and requires joint consideration of implementation complexity and accuracy on the task. This work aims to demonstrate that, with appropriate training and fpga-oriented redesign, a low-complexity neural network can match the performance of more complex state-of-the-art *** present our methodology to implement a quantized efficient real-time semantic segmentation model inspired by ENet on AMD ZU19EG fpga as a pipelined dataflow architecture capable of reaching 226 FPS and 4.2 ms with 70.33% mIoU on the Cityscapes dataset. We improve the segmentation performance of ENet by retraining with better hyperparameters and extensive data augmentation, and achieve a 12.2% increase in mIoU on Cityscapes, and perform INT4 quantization with minimal accuracy degradation. To achieve a better trade-off between accuracy and complexity, we explore the neural network design space by evaluating the influence of selected variations of the model architecture and selecting the most efficient with regard to fpga implementation. We implement the model as a pipelined dataf
field-programmablegatearrays (fpgas) in space applications come with the drawback of radiation effects, which inevitably will occur in devices of small process size. This also applies to the electronics of the Bose ...
详细信息
ISBN:
(纸本)9781450394178
field-programmablegatearrays (fpgas) in space applications come with the drawback of radiation effects, which inevitably will occur in devices of small process size. This also applies to the electronics of the Bose Einstein Condensate and Cold Atom Laboratory (BECCAL) apparatus, which is planned to operate on the international Space Station for several years. A total of more than 100 fpgas distributed in the setup will be used for high-precision control of specialized sensors and actuators at nanosecond scale. Due to the large amount of devices in BECCAL, commercial off-the-shelf (COTS) fpgas are used which are not radiation hardened. In this work, we detect and mitigate radiation effects in an application specific COTS-fpga-based communication network. For that redundancy is integrated into the design while the firmware is optimized to stay within the fpga's resource constraints. A redundant integrity checker module is developed which can notify preceding network devices about data and configuration bit errors. The firmware is evaluated by injecting faults into data and configuration registers in simulation and real hardware. The fpga resource usage of the firmware is cut down by more than half, enabling the use of double modular redundancy for the switching fabric. Together with the triple modular redundancy protected integrity checker, this combination fully prevents silent data corruptions in the design as shown in simulations and by injecting faults in hardware using the Intel Fault Injection fpga IP Core while staying in the resource limitation of a COTS fpga.
Since the inception of fpgas over 2 decades ago, the micro-architectures and macro-architectures of fpgas across all fpga vendors have been converging strongly to the point that comparable fpgas from the main fpga ven...
详细信息
ISBN:
(纸本)9781450370998
Since the inception of fpgas over 2 decades ago, the micro-architectures and macro-architectures of fpgas across all fpga vendors have been converging strongly to the point that comparable fpgas from the main fpga vendors had virtually the same use models, and the same programming models. User designs were getting easier to port from one vendor to the other with every generation. Recent developments in from different fpga vendors targeting the most advanced semiconductor technology nodes are an abrupt and disruptive break from this trend, especially at the macro-architectural level.
fpgas are a compelling substrate for supporting machine learning inference. Tools such as High-Level Synthesis and hls4ml can shorten the development cycle for deploying ML algorithms on fpgas, but can struggle to han...
详细信息
ISBN:
(纸本)9798400713965
fpgas are a compelling substrate for supporting machine learning inference. Tools such as High-Level Synthesis and hls4ml can shorten the development cycle for deploying ML algorithms on fpgas, but can struggle to handle the large on-chip storage needed for many of these models. In particular the high BRAM usage found in many of these flows can cause Place & Route failures during synthesis. In this paper we propose using a Simulated-Annealing based flow to perform BRAM-aware quantization. This approach trades off inference accuracy with BRAM usage, to provide a high-quality inference engine that still meets on-chip resource constraints. We demonstrate this flow for Transformer-based machine learning algorithms, which include Flash Attention in a Stream-based Dataflow architecture. Our system imposes minimal accuracy drops, yet can reduce BRAM usage by 20%-50%, and improve power efficiency by 264%-812% compared to existing Transformer-based accelerators on fpgas
Path planning is a critical task in autonomous driving systems, with quadratic programming being the most time-consuming component. Solving quadratic programming problems using a CPU not only takes a long time but can...
详细信息
ISBN:
(纸本)9798400704185
Path planning is a critical task in autonomous driving systems, with quadratic programming being the most time-consuming component. Solving quadratic programming problems using a CPU not only takes a long time but can also lead to high power consumption and costs. In this work, we propose an fpga-based acceleration method for quadratic programming based path planning problems. Our approach leverages an operator splitting solver for quadratic programs (OSQP) and employs the preconditioned conjugate gradient (PCG) method for solving linear equations, which proves to be more scalable and hardware-friendly than the original direct method. We propose optimizations for better memory management, and boost processing throughput and reduce execution time by task level and operator level parallelism with hardware pipelining. Our fpga-based implementation achieves up to 1.8× speedup and 3.2× power reduction compared with the Intel i5 CPU, 3.1× speedup compared with ARM Cortex-A57.
Graph Convolutional Networks (GCNs) are state-of-the-art deep learning models for representation learning on graphs. However, the efficient training of GCNs is hampered by constraints in memory capacity and bandwidth,...
详细信息
ISBN:
(纸本)9798400704185
Graph Convolutional Networks (GCNs) are state-of-the-art deep learning models for representation learning on graphs. However, the efficient training of GCNs is hampered by constraints in memory capacity and bandwidth, compounded by the irregular data flow that results in communication bottlenecks. To address these challenges, we propose a message-passing architecture that leverages NUMA-based memory access properties and employs a parallel multicast routing algorithm based on a 4-D hypercube network within the accelerator for efficient message passing in graphs. Additionally, we have re-engineered the backpropagation algorithm specific to GCNs within our proposed accelerator. This redesign strategically mitigates the memory demands prevalent during the training phase and diminishes the computational overhead associated with the transposition of extensive matrices. Compared to the state-of-the-art HP-GNN architecture we achieved a performance improvement of 1.03×~1.81×.
While fpgas have been investigated for accelerating computing workloads in academia for many decades, industry started adopting fpgas as an accelerator only in the last decade, but even those deployments have been fai...
详细信息
ISBN:
(纸本)9798400704185
While fpgas have been investigated for accelerating computing workloads in academia for many decades, industry started adopting fpgas as an accelerator only in the last decade, but even those deployments have been fairly limited. This talk describes my journey over 15 years of building and deploying fpga accelerated solutions. In my career, I have led the integration of fpgas with CPUs in a variety of ways, each seeking the right mix of efficiency, capability, and cost. The first phase of the journey begins at Intel in 2008 in building the first solutions with socketed fpgas attached to Xeon CPUs using FSB and QPI coherent buses. These coherently attached fpgas enabled various new use cases beyond the PCIe attached fpga accelerators leading to projections of wide scale deployment of fpgas in the data center. Intel's acquisition of Altera in 2015 kicked off the second phase where we integrated a fpga die with a Xeon die to produce a multi-chip package that dropped into a Xeon socket. This was introduced in the market targeting communication workloads. We formed Megh Computing in 2017 to enable a fpga accelerated edge platform for analytics workloads in the enterprise which resulted in solutions for video surveillance. The talk will discuss the various challenges and breakthroughs in deploying fpga accelerators from the edge to the datacenter, where I see the market headed in the near future and identify promising directions and tough problems the fpga research community need to tackle to realize the potential of reconfigurable computing.
暂无评论