The High-Level Synthesis (HLS) tools aid in simplified and faster design development without familiarity with Hardware Description Language (HDL) and Register Transfer logic (RTL) design flow that can be implemented o...
详细信息
ISBN:
(纸本)9781450391597
The High-Level Synthesis (HLS) tools aid in simplified and faster design development without familiarity with Hardware Description Language (HDL) and Register Transfer logic (RTL) design flow that can be implemented on an FPGA (fieldprogrammable Gate Array). However, it is not straight forward to trace and link source code to synthesized hardware design. On the other hand, the traditional RTL-based design development flow provides the fine-grained performance profile through waveforms. With the same level of visibility in HLS designs, the designers can identify the performance-bottlenecks and obtain the target performance by iteratively fine-tuning the source code. Although, the HLS development tools provide the low-level waveforms, interpreting them in terms of source code variables is a challenging and tedious task. Addressing this gap, we propose to demonstrate an automated profiler tool, HLS_Profiler, that provides a performance profile of source code in a cycle-accurate manner.
Large Language Models (LLMs) have been widely deployed in data centers to provide various services, among which the most representative is the Generative Pre-trained Transformer (GPT). The GPT model has heavy memory a...
详细信息
ISBN:
(数字)9798331530075
ISBN:
(纸本)9798331530082
Large Language Models (LLMs) have been widely deployed in data centers to provide various services, among which the most representative is the Generative Pre-trained Transformer (GPT). The GPT model has heavy memory and computing overhead, and its inference process has two stages with distinct computing characteristics: Prefill and Decode. Utilizing existing GPUs and FPGA accelerators to construct a platform for deploying GPT in data centers faces the challenges of needing more effective synchronization schemes or structures with higher computational intensity. This paper proposes LORA, a low latency end-to-end GPT acceleration platform utilizing multiple FPGAs. Firstly, we optimize the synchronization timing of the GPT model to reduce the computation and communication overhead. Secondly, we devise some efficient synchronization steps for specific layers of the GPT model that overlap part of the computation and communication delay to improve the latency of our platform. Finally, we deploy recurrent structures on each FPGA to accelerate the different stages of the GPT model. Implemented on the Xilinx Alveo U280 FPGAs, LORA achieves an average $11.1 \times$ speedup over NVIDIA V100 GPUs on the modern GPT-2 model. Compared to the existing multi-FPGA accelerator appliance, LORA shows performance improvements of up to $4 \times$ and $2.7 \times$ in the Prefill and Decode stages.
To address the spectrum aliasing caused by the non-uniform and non-stationary characteristics of rotating machinery vibration signals in the field of dynamic signal analysis, this paper proposes a time-domain stretchi...
详细信息
The article discusses the use of Galois field (GF) multipliers for cryptographic data protection based on elliptical curves. It recommends using extended Galois fields with characteristics d} > 2 for digital signat...
详细信息
In Biomedical photonics, a precision current source with high intensity and programmable, which can generate pulses of variable frequencies, is necessary to drive devices such as LED or LED lasers. Such a source was b...
详细信息
programmablelogic Controllers (PLCs) are the most used digital systems in the manufacturing industry, but there is little support for testing such systems. Despite the recommendations of the IEC 61131-3 standards, te...
详细信息
ISBN:
(纸本)9781665409674
programmablelogic Controllers (PLCs) are the most used digital systems in the manufacturing industry, but there is little support for testing such systems. Despite the recommendations of the IEC 61131-3 standards, testing is mainly done manually or not at all. Recent successful attempts for a testing framework for PLCs include proposals close to object orientation. This work presents a test generation approach using such a testing system. Via our Advanced POU Testing (APTest) Framework written in a native IEC 61131-3 - compliant language, we demonstrate the automatic generation and execution of unit tests for existing software units. We introduce the software, discuss its features, and demonstrate its use.
Traditionally, partial reconfiguration of FPGAs involves replacing defined regions of the design, entirely replacing the logic and losing the state within that region. However, configuration frame reloads typically be...
详细信息
Traditionally, partial reconfiguration of FPGAs involves replacing defined regions of the design, entirely replacing the logic and losing the state within that region. However, configuration frame reloads typically being glitch free means that wires and logic can safely be added and removed at runtime, without losing state - potentially even without stopping the clock! This could even be extended into an “edit and continue” mode where register positions and unchanged logic is preserved, and only changed logic cones are replaced, to enable small design changes to be made to a live system with only a brief pause and no loss of state.
Decimal operands expressed in BCD is often the convenient data format used in embedded systems and human-centric applications. For optimized arithmetic computation on hardware where binary arithmetic can be convenient...
详细信息
ISBN:
(数字)9798331522445
ISBN:
(纸本)9798331522452
Decimal operands expressed in BCD is often the convenient data format used in embedded systems and human-centric applications. For optimized arithmetic computation on hardware where binary arithmetic can be conveniently processed, input BCD operands should first be converted to binary. This paper explores the versatility of FPGA specific logic primitives, namely the dual output Look-Up Tables (LUTs) and carry chain for efficient realization of parallel and pipelined architectures to optimize throughput and resource utilization for BCD to binary conversion architectures. Primitive instantiation was adopted to ensure design optimization, which involved direct configuration of FPGA logic primitives. The utility ratio of the configured FPGA primitives was increased to extract the best performance. Experimental results demonstrate a good trade-off in speed and area for our proposed architectures when compared with existing designs. We have investigated multiple implementation variants for the addition tree, which serves as a crucial functional logic block in the converter design.
Mostly focusing on the Zynq FPGA board, this effort targets energy-efficient green communication using FPGA devices. Examined in practical applications, excess-3 to binary coding is noted for its simplicity in digital...
详细信息
ISBN:
(纸本)9798331540661;9798331540678
Mostly focusing on the Zynq FPGA board, this effort targets energy-efficient green communication using FPGA devices. Examined in practical applications, excess-3 to binary coding is noted for its simplicity in digital system decimal number representation. The 1.364W Zynq FPGA board runs dynamic power at 1.236W and stationary power at 0.127W. Effective thermal control comes from its 40.7 degrees C junction temperature and 44.3 degrees C thermal margin. Using just 0.01% of its Look-Up Table (LUT) hardware and 4.0% of its Input/Output (IO) capability, the board maximizes resource use. These benchmarks show how well the Zynq FPGA board designs energy-efficient circuits required to solve world energy crises. Particularly in green communication technologies, the Zynq FPGA board with its economy and heat control offers potential for creating long-term digital solutions.
Counting substrings of an arbitrary length k (k-mers) is the single most time-consuming step of de novo genome sequencing. Sequencing machines generate large quantities of data $(\gt100$ s of GBs per genome), which ne...
详细信息
ISBN:
(数字)9798331530075
ISBN:
(纸本)9798331530082
Counting substrings of an arbitrary length k (k-mers) is the single most time-consuming step of de novo genome sequencing. Sequencing machines generate large quantities of data $(\gt100$ s of GBs per genome), which need to be processed using frequent memory accesses into data structures considerably larger than available cache, leading to a memorybound runtime. Stemming from the gap between processor and memory speed, this bottleneck requires alternative architectures. Recent FPGA devices, equipped with on-chip High-Bandwidth Memory (HBM), enable custom architectures to employ high-capacity, high-bandwidth memory to address memory-bound tasks. This research investigates accelerating k-mers counting with one such device, the BittWare 520N-MX, a Stratix 10 FPGA with 16 GB of on-chip HBM2. The architecture was designed using Intel’s oneAPI, which enables high-level synthesis for FPGA programming. The accelerator architecture was able to leverage inherent parallelism in the algorithm via the multiple parallel hash functions, along with partitioning data structure across multiple memory banks, and employing multiple independent parallel processing pipelines on the device to maximize throughput. The accelerator achieved $42.67 \mathrm{M} \boldsymbol{k}$-mers per second, $2.80 \times$ more than the throughput-optimized CPU version and $4.31 \times$ more than the original CPU app. Despite this speedup, the architectures saw diminishing returns when attempting to incorporate additional HBM2 banks. OneAPI was able to achieve speedup over the CPU using the FPGA and emerging memory, but it’s somewhat limited in its current scaling capabilities.
暂无评论