Computation of the inner products is frequently used in machine learning (ML) algorithms apart from signal processing and communication applications. Distributed arithmetic (DA) has been frequently employed for area-t...
详细信息
ISBN:
(纸本)9798350330991;9798350331004
Computation of the inner products is frequently used in machine learning (ML) algorithms apart from signal processing and communication applications. Distributed arithmetic (DA) has been frequently employed for area-time efficient inner-product implementations. In conventional DA-based architectures, one of the vectors is constant and known a priori. Hence, the traditional DA architectures are not suitable when both vectors are variable. However, computing the inner product of a pair of variable vectors is frequently used for matrix multiplication of various forms and convolutional neural networks. In this paper, we present a novel DA-based architecture for computing the inner product of variable vectors. To derive the proposed architecture, the inner product of any given length is decomposed into a set of short-length inner products, such that the inner product could be computed by successive accumulation of the results of shortlength inner products. We have designed a DA-based architecture for the computation of the short-length inner-product of variable vectors and used that in successive clock cycles to compute the whole inner-product by successive accumulation. The post-layout synthesis results using Cadence Innovus with a GPDK 90nm technology library show that the proposed DA-based parallelarchitecture offers significant advantages in area-delay product and energy consumption over the bit-serial DA architecture.
In decentralized IoT ecosystems, four cryptographic algorithms, including SHA256, BLAKE256, BLAKE2s, and Chacha20, are principal to ensure data integrity and confidentiality. However, existing cryptographic hardware i...
详细信息
ISBN:
(纸本)9798350393613
In decentralized IoT ecosystems, four cryptographic algorithms, including SHA256, BLAKE256, BLAKE2s, and Chacha20, are principal to ensure data integrity and confidentiality. However, existing cryptographic hardware is often limited to supporting a single algorithm and suffers from low performance, which falls short of meeting the diverse requirements of these systems. To address these limitations, we introduce a reconfigurable crypto accelerator (RCA) that offers high flexibility, superior performance, and optimal hardware efficiency. Our RCA includes three novel optimizations, specifically, a homogeneous multi-core architecture, a register-adder sharing approach, and a multi-level pipeline scheduler. The RCA was successfully verified and implemented at the system-on-chip level on a ZCU102 FPGA. The real-time performance evaluation of the RCA, during the execution of various cryptographic algorithms, demonstrates an energy efficiency ranging from 94.3-160.4 Mbps/W, which is 3.1-10.5 times higher compared to modern CPUs. Experiments conducted on several FPGAs show that the RCA is higher flexibility while still outperforming previous works by 1.63-31.65 times in throughput and 1.04-2.76 times in area efficiency. Furthermore, in ASIC synthesis, the RCA exhibits exceptional throughput (48.79-92.16 Gbps), area efficiency (66.2-102.31 Gbps/mm2), and energy efficiency (186.22-287.8 Gbps/W), surpassing other related ASICbased works.
The proceedings contain 148 papers. The topics discussed include: heterogeneous architecture for sparse data processing;combined application of approximate computing techniques in DNN hardware accelerators;highly effi...
ISBN:
(纸本)9781665497473
The proceedings contain 148 papers. The topics discussed include: heterogeneous architecture for sparse data processing;combined application of approximate computing techniques in DNN hardware accelerators;highly efficient ALLTOALL and ALLTOALLV communication algorithms for GPU systems;implementing spatio-temporal graph convolutional networks on graphcore IPUs;the best of many worlds: scheduling machine learning inference on CPU-GPU integrated architectures;online learning RTL synthesis for automated design space exploration;machine learning aided hardware resource estimation for FPGA DNN implementations;optimal schedules for high-level programming environments on FPGAs with constraint programming;on how to push efficient medical semantic segmentation to the edge: the SENECA approach;and exploiting high-bandwidth memory for FPGA-acceleration of inference on sum-product networks.
Computation of the inner products is frequently used in machine learning (ML) algorithms apart from signal processing and communication applications. Distributed arithmetic (DA) has been frequently employed for area-t...
详细信息
ISBN:
(数字)9798350330991
ISBN:
(纸本)9798350331004
Computation of the inner products is frequently used in machine learning (ML) algorithms apart from signal processing and communication applications. Distributed arithmetic (DA) has been frequently employed for area-time efficient inner-product implementations. In conventional DA-based architectures, one of the vectors is constant and known a priori. Hence, the traditional DA architectures are not suitable when both vectors are variable. However, computing the inner product of a pair of variable vectors is frequently used for matrix multiplication of various forms and convolutional neural networks. In this paper, we present a novel DA-based architecture for computing the inner product of variable vectors. To derive the proposed architecture, the inner product of any given length is decomposed into a set of short-length inner products, such that the inner product could be computed by successive accumulation of the results of short-length inner products. We have designed a DA-based architecture for the computation of the short-length inner-product of variable vectors and used that in successive clock cycles to compute the whole inner-product by successive accumulation. The post-layout synthesis results using Cadence Innovus with a GPDK 90nm technology library show that the proposed DA-based parallelarchitecture offers significant advantages in area-delay product and energy consumption over the bit-serial DA architecture.
In decentralized IoT ecosystems, four cryptographic algorithms, including SHA256, BLAKE256, BLAKE2s, and Chacha20, are principal to ensure data integrity and confidentiality. However, existing cryptographic hardware i...
In decentralized IoT ecosystems, four cryptographic algorithms, including SHA256, BLAKE256, BLAKE2s, and Chacha20, are principal to ensure data integrity and confidentiality. However, existing cryptographic hardware is often limited to supporting a single algorithm and suffers from low performance, which falls short of meeting the diverse requirements of these systems. To address these limitations, we introduce a reconfigurable crypto accelerator (RCA) that offers high flexibility, superior performance, and optimal hardware efficiency. Our RCA includes three novel optimizations, specifically, a homogeneous multi-core architecture, a register-adder sharing approach, and a multilevel pipeline scheduler. The RCA was successfully verified and implemented at the system-on-chip level on a ZCU102 FPGA. The real-time performance evaluation of the RCA, during the execution of various cryptographic algorithms, demonstrates an energy efficiency ranging from 94.3-160.4 Mbps/W, which is 3.1-10.5 times higher compared to modern CPUs. Experiments conducted on several FPGAs show that the RCA is higher flexibility while still outperforming previous works by 1.63-31.65 times in throughput and 1.04-2.76 times in area efficiency. Furthermore, in ASIC synthesis, the RCA exhibits exceptional throughput (48.79-92.16 Gbps), area efficiency (66.2-102.31 Gbps/mm 2 ), and energy efficiency (186.22-287.8 Gbps/W), surpassing other related ASIC-based works.
The synthesis of thinned planar arrays of real radiating elements for 5G communications systems is addressed. A nature-inspired optimization strategy based on the Genetic Algorithm (GA) is employed for defining the si...
详细信息
ISBN:
(纸本)9781728146713
The synthesis of thinned planar arrays of real radiating elements for 5G communications systems is addressed. A nature-inspired optimization strategy based on the Genetic Algorithm (GA) is employed for defining the simplified array architecture in order to reduce the number of transmit/receive modules and radio-frequency (RF) chains with respect to a fully-populated array architecture. For each iteration of the GA-based optimization process, the array pattern is efficiently calculated by considering a limited set of representative embedded element patterns which allow taking into account the mutual coupling phenomena and to obtain an accurate prediction of the real radiation performance. A representative numerical example is reported to validate the proposed approach.
Since the advent of GPU computing, GPU hardware has evolved at a fast pace. Since application performance heavily depends on the latest hardware improvements, performance portability is extremely challenging for GPU a...
详细信息
ISBN:
(纸本)9781728114361
Since the advent of GPU computing, GPU hardware has evolved at a fast pace. Since application performance heavily depends on the latest hardware improvements, performance portability is extremely challenging for GPU application library developers. Portability becomes even more difficult when new low-level instructions are added to the ISA (e.g., warp shuffle instructions) or the microarchitectural support for existing instructions is improved (e.g., atomic instructions). Library developers, besides re-tuning the code for new hardware features, deal with the performance portability issue by hand-writing multiple algorithm versions that leverage different instruction sets and microarchitectures. High-level programming frameworks and Domain Specific Languages (DSLs) do not typically support low-level instructions (e.g., warp shuffle and atomic instructions), so it is painful or even impossible for these programming systems to take advantage of the latest architectural improvements. In this work, we design a new set of high-level APIs and qualifiers, as well as specialized Abstract Syntax Tree (AST) transformations for high-level programming languages and DSLs. Our transformations enable warp shuffle instructions and atomic instructions (on global and shared memories) to be easily generated. We show a practical implementation of these transformations by building on Tangram, a high-level kernel synthesis framework. Using our new language and compiler extensions, we implement parallel reduction, a fundamental building block used in a wide range of algorithms. parallel reduction is representative of the performance portability challenge, as its performance heavily depends on the latest hardware improvements. We compare our synthesized parallel reduction to another high-level programming framework and a hand-written high-performance library across three generations of GPU architectures, and show up to 7.8x speedup (2x on average) over hand-written code.
The proceedings contain 28 papers. The special focus in this conference is on Applied Reconfigurable Computing. The topics include: Proof-Carrying Hardware Versus the Stealthy Malicious LUT Hardware Trojan;secure Loca...
ISBN:
(纸本)9783030172268
The proceedings contain 28 papers. The special focus in this conference is on Applied Reconfigurable Computing. The topics include: Proof-Carrying Hardware Versus the Stealthy Malicious LUT Hardware Trojan;secure Local Configuration of Intellectual Property Without a Trusted Third Party;HiFlipVX: An Open Source High-Level synthesis FPGA Library for Image Processing;Real-Time FPGA Implementation of Connected Component Labelling for a 4K Video Stream;A Scalable FPGA-Based architecture for Depth Estimation in SLAM;Evaluating LULESH Kernels on OpenCL FPGA;The TaPaSCo Open-Source Toolflow for the Automated Composition of Task-Based parallel Reconfigurable Computing Systems;Graph-Based Code Restructuring Targeting HLS for FPGAs;UltraSynth: Integration of a CGRA into a Control Engineering Environment;exploiting Reconfigurable Vector Processing for Energy-Efficient Computation in 3D-Stacked Memories;Optimizing CNN-Based Hyperspectral Image Classification on FPGAs;Automatic Toolflow for VCGRA Generation to Enable CGRA Evaluation for Arithmetic algorithms;reM: A Reconfigurable Multipotent Cell for New Distributed Reconfigurable architectures;Update or Invalidate: Influence of Coherence Protocols on Configurable HW Accelerators;hybrid Prototyping for Manycore Design and Validation;Evaluation of FPGA Partitioning Schemes for Time and Space Sharing of Heterogeneous Tasks;Third Party CAD Tools for FPGA Design—A Survey of the Current Landscape;Filter-Wise Pruning Approach to FPGA Implementation of Fully Convolutional Network for Semantic Segmentation;Exploring Data Size to Run Convolutional Neural Networks in Low Density FPGAs;Faster Convolutional Neural Networks in Low Density FPGAs Using Block Pruning;Supporting Columnar In-memory Formats on FPGA: The Hardware Design of Fletcher for Apache Arrow;A Novel Encoder for TDCs.
The proceedings contain 59 papers. The special focus in this conference is on Applied Reconfigurable Computing. The topics include: FPGA-based memory efficient shift-and algorithm for regular expression matching;Towar...
ISBN:
(纸本)9783319788890
The proceedings contain 59 papers. The special focus in this conference is on Applied Reconfigurable Computing. The topics include: FPGA-based memory efficient shift-and algorithm for regular expression matching;Towards an optimized multi FPGA architecture with STDM network: A preliminary study;An FPGA/HMC-based accelerator for resolution proof checking;An efficient FPGA implementation of the big bang-big crunch optimization algorithm;ReneGENE-GI: Empowering precision genomics with FPGAs on HPCs;FPGA-based parallel pattern matching;embedded vision systems: A review of the literature;a survey of low power design techniques for last level caches;ISA-DTMR: Selective protection in configurable heterogeneous multicores;redundancy-reduced MobileNet acceleration on reconfigurable logic for ImageNet classification;Analyzing AXI streaming interface for hardware acceleration in AP-SoC under soft errors;High performance UDP/IP 40Gb ethernet stack for FPGAs;tackling wireless sensor network heterogeneity through novel reconfigurable gateway approach;A low-power FPGA-based architecture for microphone arrays in wireless sensor networks;A hybrid FPGA trojan detection technique based-on combinatorial testing and on-chip sensing;honeyWiN: Novel honeycomb-based wireless NoC architecture in many-core era;Fast partial reconfiguration on SRAM-based FPGAs: A frame-driven routing approach;a dynamic partial reconfigurable overlay framework for python;Runtime adaptive cache for the LEON3 processor;exploiting partial reconfiguration on a dynamic coarse grained reconfigurable architecture;accuracy to throughput trade-offs for reduced precision neural networks on reconfigurable logic;DIM-VEX: Exploiting design time configurability and runtime reconfigurability;The use of HACP+SBT lossless compression in optimizing memory bandwidth requirement for hardware implementation of background modelling algorithms;A reconfigurable PID controller;High-level synthesis of software-defined MPSoCs.
Today, machine learning based on neural networks has become mainstream, in many application domains. A small subset of machine learning algorithms, called Convolutional Neural Networks (CNN), are considered as state-o...
详细信息
Today, machine learning based on neural networks has become mainstream, in many application domains. A small subset of machine learning algorithms, called Convolutional Neural Networks (CNN), are considered as state-ofthe- art for many applications (e.g. video/audio classification). The main challenge in implementing the CNNs, in embedded systems, is their large computation, memory, and bandwidth requirements. To meet these demands, dedicated hardware accelerators have been proposed. Since memory is the major cost in CNNs, recent accelerators focus on reducing the memory accesses. In particular, they exploit data locality using either tiling, layer merging or intra/inter feature map parallelism to reduce the memory footprint. However, they lack the flexibility to interleave or cascade these optimizations. Moreover, most of the existing accelerators do not exploit compression that can simultaneously reduce memory requirements, increase the throughput, and enhance the energy efficiency. To tackle these limitations, we present a flexible accelerator called MOCHA. MOCHA has three features that differentiate it from the state-of-the-art: (i) the ability to compress input/ kernels, (ii) the flexibility to interleave various optimizations, and (iii) intelligence to automatically interleave and cascade the optimizations, depending on the dimension of a specific CNN layer and available resources. Post layout synthesis results reveal that MOCHA provides up to 63% higher energy efficiency, up to 42% higher throughput, and up to 30% less storage, compared to the next best accelerator, at the cost of 26-35% additional area.
暂无评论