the proceedings contain 45 papers. the topics discussed include: CFEACT: a CGRA-based framework enabling agile CNN and transformer accelerator design;FlexWalker: an efficient multi-objective design space exploration f...
ISBN:
(纸本)9798331530075
the proceedings contain 45 papers. the topics discussed include: CFEACT: a CGRA-based framework enabling agile CNN and transformer accelerator design;FlexWalker: an efficient multi-objective design space exploration framework for HLS design;fast switching activity estimation for HLS-produced dataflow circuits;KIT: kernel isotropic transformation of bilateral filters for image denoising on FPGA;LORA: a latency-oriented recurrent architecture for GPT model on multi-FPGA platform with communication optimization;StencilStream: a SYCL-based stencil simulation framework targeting FPGAs;SoGraph: a state-aware architecture for out-of-memory graph processing on HBM-equipped FPGAs;and a high-performance routing engine for large-scale FPGAs.
the biennial "Computer Sciences and Information Technologies" conference in Yerevan in September of 2023 was the 14th in this series. this event is of regional interest, provide a networking ecosystem to sci...
详细信息
the biennial "Computer Sciences and Information Technologies" conference in Yerevan in September of 2023 was the 14th in this series. this event is of regional interest, provide a networking ecosystem to scientists in areas such as the Mathematical logic and logical Reasoning, Discrete Mathematics, Pattern Recognition and Cognitive Sciences, and applications. 17 selected papers of this conference, included in Special Issue of PRIA journal provide up to date research results and research topics that make a sensitive input to the field known as "logic Combinatorial Pattern Recognition". Results from Algebra and Mathematical logic help in structural and semantic areas of pattern recognition, graph theoretical investigations help with clustering and image analysis, other selected papers and results are devoted directly to semantic pattern recognition, to cognitive sciences in general, or they provide application platforms where different artificial intelligence technologies are converging.
Understanding how FPGAs age and how to control that aging is crucial for ensuring the reliability and security of FPGAs in critical applications. Due to the proprietary nature of commercial FPGAs, it can be challengin...
详细信息
ISBN:
(纸本)9798331530082;9798331530075
Understanding how FPGAs age and how to control that aging is crucial for ensuring the reliability and security of FPGAs in critical applications. Due to the proprietary nature of commercial FPGAs, it can be challenging to validate aging models on real silicon, and most previous work has relied on circuit simulations to study the effects of FPGA aging. In this work, we leverage low-level placement and routing APIs provided by RapidWright to create a series of stressor and characterization circuits that allow us to measure the effects of aging on individual LUTs and routing resources in a 28nm FPGA. We demonstrate how these techniques allow fine-grained control of the relative aging of different FPGA resources, even to the point of aging individual paths within a single LUT. Several different aging experiments are demonstrated, and in a cumulative test, we show how different signal and LUT configurations can influence the aging rate by over 2x.
Neural Radiance fields (NeRF) have great potential applications in 3D spatial modeling, scene reconstruction, and related fields, such as AR/VR. However, the compute-intensive and memory-intensive characteristics of N...
详细信息
ISBN:
(纸本)9798331530082;9798331530075
Neural Radiance fields (NeRF) have great potential applications in 3D spatial modeling, scene reconstruction, and related fields, such as AR/VR. However, the compute-intensive and memory-intensive characteristics of NeRF present significant obstacles for real-time applications. therefore, this work presents CFSA, a CPU-FPGA synergies hardware accelerator that enhances the deployment of NeRF in resource-constrained environments. A multilayer perceptron (MLP) processor has been designed on an FPGA with 89% utilization for the multi-resolution hash encoding NeRF algorithm. Additionally, a grid occupancy and early termination computation unit has been created, which reduces computational effort by over 90%. An overall pipeline scheduling scheme for the CPU and FPGA has also been designed. the experimental data shows that the accelerator can render synthetic data at 7.1 FPS on the Xilinx VC709 board with 100MHz. Compared to ICARUS, a specialized architecture for NeRF-based rendering, there is a 338.1x speed improvement and a 8.4x increase in energy. Compared to the GTX 1080 Ti GPU, there is a 1.3x speed improvement and a 20.2x increase in energy.
Advancements in design automation technologies, such as high-level synthesis (HLS), have raised the input abstraction level and made the design entry process for FPGAs more friendly to software programmers. In contras...
详细信息
ISBN:
(纸本)9798331530082;9798331530075
Advancements in design automation technologies, such as high-level synthesis (HLS), have raised the input abstraction level and made the design entry process for FPGAs more friendly to software programmers. In contrast, the backend compilation process for implementing designs on FPGAs is considerably more lengthy compared to software compilation: while software code compilation may take just a few seconds, FPGA compilation times can often span from several minutes to hours due to the complexity of the underlying toolchain and evergrowing device capacities. In this paper, we present DynaRapid, a fast compilation tool that generates-in a matter of secondsfully legal placed-and-routed designs for commercial FPGAs. Elastic circuits created by the HLS tool Dynamatic are made exclusively of a limited number of reusable components;we exploit this fact to create a library of placed and routed building blocks, and then stitch together instances of them as needed through RapidWright. Our approach accelerates the C-to-FPGA implementation process by a geomean 20x with only 10% of degradation in operating frequency compared to a conventional commercial off-the-shelf implementation flow.
Convolutional Neural Networks (CNNs) combine large amounts of parallelizable computation with frequent memory access. fieldprogrammable Gate Arrays (FPGAs) can achieve low latency and high throughput CNN inference by...
详细信息
ISBN:
(纸本)9798331530082;9798331530075
Convolutional Neural Networks (CNNs) combine large amounts of parallelizable computation with frequent memory access. fieldprogrammable Gate Arrays (FPGAs) can achieve low latency and high throughput CNN inference by implementing dataflow accelerators that pipeline layer-specific hardware to implement an entire network. By implementing a different processing element for each CNN layer, these layer-pipelined accelerators can achieve high compute density, but having all layers processing in parallel requires high memory bandwidth. Traditionally this has been satisfied by storing all weights on chip, but this is infeasible for the largest CNNs, which are often those most in need of acceleration. In this work we augment a state-of-the-art dataflow accelerator (HPIPE) to leverage both High-Bandwidth Memory (HBM) and on-chip storage, enabling high performance layer-pipelined dataflow acceleration of large CNNs. Based on profiling results of HBM's latency and throughput against expected address patterns, we develop an algorithm to choose which weight buffers should be moved off chip and how deep the on-chip FIFOs to HBM should be to minimize compute unit stalling. We integrate the new hardware generation within the HPIPE domain-specific CNN compiler and demonstrate good bandwidth efficiency against theoretical limits. Compared to the best prior work we obtain speed-ups of at least 19.4x, 5.1x and 10.5x on ResNet-18, ResNet-50 and VGG-16 respectively.
this paper mainly focuses on getting acquainted with various applications of programmablelogic Controller (PLC) used in industries. this paper compares and contrasts several research publications and projects on appl...
详细信息
field-programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a n...
详细信息
ISBN:
(纸本)9798331530082;9798331530075
field-programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks. Among the most computationally intensive operations in a neural network (NN) is the dot product between the feature and weight vectors. thus, some previous FPGA acceleration works have proposed mapping neurons with quantized inputs and outputs directly to lookup tables (LUTs) for hardware implementation. In these works, the boundaries of the neurons coincide withthe boundaries of the LUTs. We propose relaxing these boundaries and mapping entire sub-networks to a single LUT. As the sub-networks are absorbed within the LUT, the NN topology and precision within a partition do not affect the size of the lookup tables generated. therefore, we utilize fully connected layers with floating-point precision inside each partition, which benefit from being universal function approximators, but with rigid sparsity and quantization enforced between partitions, where the NN topology becomes exposed to the circuit topology. Although cheap to implement, this approach can lead to very deep NNs, and so to tackle challenges like vanishing gradients, we also introduce skip connections inside the partitions. the resulting methodology can be seen as training DNNs with a specific FPGA hardware-inspired sparsity pattern that allows them to be mapped to much shallower circuit-level networks, thereby significantly improving latency. We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, digit classification using MNIST. Our approach allows for greater function expressivity within the LUTs compared to existing work, leading to up to 4.3x lower latency NNs for the same accuracy.
the proceedings contain 64 papers. the topics discussed include: a deep-learning framework for predicting congestion during FPGA placement;lightweight side-channel protection using dynamic clock randomization;executin...
ISBN:
(纸本)9781728199023
the proceedings contain 64 papers. the topics discussed include: a deep-learning framework for predicting congestion during FPGA placement;lightweight side-channel protection using dynamic clock randomization;executing ARMv8 loop traces on reconfigurable accelerator via binary translation framework;precise pointer analysis in high-level synthesis;LFTSM: lightweight and fully testable SEU mitigation system for Xilinx processor-based SoCs;a high throughput MobileNetV2 FPGA implementation based on a flexible architecture for depthwise separable convolution;hardware acceleration of Monte-Carlo sampling for energy efficient robust robot manipulation;and automated design of FPGAs facilitated by cycle-free routing.
Nowadays, heterogeneous architectures are widely used to overcome the ongoing demand for increased computing performance at the edge, such as the pre-processing of raw antenna data in radio telescope systems. the Vers...
详细信息
ISBN:
(纸本)9798331530082;9798331530075
Nowadays, heterogeneous architectures are widely used to overcome the ongoing demand for increased computing performance at the edge, such as the pre-processing of raw antenna data in radio telescope systems. the Versal Adaptive SoC is a novel heterogeneous architecture that includes programmablelogic (PL), a Processing System, and Artificial Intelligence Engines (AIEs) interconnected by a programmable Network-on-Chip. In this work, we explore the AIEs to evaluate their capabilities for real-time signal processing in radio telescope systems. We focus on the implementation of a Polyphase Filter Bank (PFB), which is a representative signal processing operation that consists of Finite Impulse Response (FIR) filters and the Fast-Fourier Transform (FFT) algorithm. We analyzed the performance of the AIEs with regard to the requirements of LOFAR, the world's largest low-frequency radio telescope. By means of the roofline model, we reveal that the AIEs theoretically meet the LOFAR computational requirements, but the FIR vendor-provided library implementation did not reach the required performance. therefore, we explore several optimization strategies for the FIR implementation on the AIEs and analyze the communication options between the PL and the AIE array. Finally, we have developed an efficient PFB implementation that requires only 12 AIEs. A prototype on a VC1902 device achieves a throughput of 437 MSPS, which is more than sufficient for a single antenna polarization of the LOFAR system. this work is open source and publicly available at: https://***/rd/acap.
暂无评论