Three parallel sorting applications and two list output protocols for the first phase of an external sort execute on a fine-grained many-core processor array that contains no algorithm-specific hardware acting as a co...
详细信息
Three parallel sorting applications and two list output protocols for the first phase of an external sort execute on a fine-grained many-core processor array that contains no algorithm-specific hardware acting as a co-processor with a variety of array sizes. Results are generated using a cycle-accurate model based on measured data from a fabricated many-core chip, and simulated for different processor array sizes. The data shows most energy efficient first-phase many-core sort requires over 65x lower energy than GNU C++ standard library sort performed on an Intel laptop-class processor and over 105x lower energy than a radix sort running on an Nvidia GPU. In addition, the highest first-phase throughput many-core sort is over 9.8x faster than the std::sort and over 14x faster than the radix sort. Both phases of a 10 GB external sort require 6.2x lower energyx time energy delay product than the std::sort and over 13x lower energyx time than the radix sort. (C) 2019 Elsevier Inc. All rights reserved.
A multi-core event-driven parallel processor array design is presented. Using relatively simple 8-bit processing cores and a 2D mesh network topology, the architecture focuses on reducing the area occupation of a sing...
详细信息
ISBN:
(纸本)9781479983919
A multi-core event-driven parallel processor array design is presented. Using relatively simple 8-bit processing cores and a 2D mesh network topology, the architecture focuses on reducing the area occupation of a single processor core. A large number of these processor cores can be implemented on a single integrated chip to create a MIMD architecture capable of providing a powerful processing performance. Each processor core is an event-driven processor which can enter an idle mode when no data is changing locally. An 8 x 8 prototype processor array is implemented in a 65 nm CMOS process in 1,875 mu m x 1,875 mu m. This processor array is capable of performing 5.12 GOPS operating at 80 MHz with an average power consumption of 75.4 mW.
A multi-core event-driven parallel processor array design is presented. Using relatively simple 8-bit processing cores and a 2D mesh network topology, the architecture focuses on reducing the area occupation of a sing...
详细信息
ISBN:
(纸本)9781479983926
A multi-core event-driven parallel processor array design is presented. Using relatively simple 8-bit processing cores and a 2D mesh network topology, the architecture focuses on reducing the area occupation of a single processor core. A large number of these processor cores can be implemented on a single integrated chip to create a MIMD architecture capable of providing a powerful processing performance. Each processor core is an event-driven processor which can enter an idle mode when no data is changing locally. An 8 × 8 prototype processor array is implemented in a 65 nm CMOS process in 1,875 μm × 1,875 μm. This processor array is capable of performing 5.12 GOPS operating at 80 MHz with an average power consumption of 75.4 mW.
processor-in-Memory (PIM) overlays and alternative reconfigurable tile fabrics have been proposed to eliminate the von Neumann bottleneck and enable processing performance to scale with BRAM capacity. The performance ...
详细信息
ISBN:
(纸本)9798331530082;9798331530075
processor-in-Memory (PIM) overlays and alternative reconfigurable tile fabrics have been proposed to eliminate the von Neumann bottleneck and enable processing performance to scale with BRAM capacity. The performance of these FPGA-based PIM architectures has been limited due to a reduction of the BRAMs maximum clock frequencies and less than ideal scaling of processing elements with increased BRAM capacity. This paper presents IMAGine, an In-Memory Accelerated GEMV engine, a PIM-array accelerator that clocks at the maximum frequency of the BRAM and scales to 100% of the available BRAMs. Comparative analyses are presented showing execution speeds over existing PIM-based GEMV engines on FPGAs and achieving a 2.65x - 3.2x faster clock. An AMD Alveo U55 implementation is presented that achieves a system clock speed of 737 MHz, providing 64K bit-serial multiply-accumulate (MAC) units for GEMV operation. This establishes IMAGine as the fastest PIM-based GEMV overlay, outperforming even the custom PIM-based FPGA accelerators reported to date. Additionally, it surpasses TPU v1-v2 and Alibaba Hanguang 800 in clock speed while offering an equal or greater number of multiply-accumulate (MAC) units.
Adoption of IoT technology without considering its security implications may expose network systems to a variety of security breaches. In network systems, IoT edge devices are a major source of security risks. Impleme...
详细信息
Adoption of IoT technology without considering its security implications may expose network systems to a variety of security breaches. In network systems, IoT edge devices are a major source of security risks. Implementing cryptographic algorithms on most IoT edge devices can be difficult due to their limited resources. As a result, compact implementations of these algorithms on these devices are required. Because the field multiplication operation is at the heart of most cryptographic algorithms, its implementation will have a significant impact on the entire cryptographic algorithm implementation. As a result, in this paper, we propose a small hardware accelerator for performing field multiplication on edge devices. The hardware accelerator is primarily composed of a processor array with a regular structure and local interconnection among its processing elements. The main advantage of the proposed hardware structure is the ability to manage its area, delay, and consumed energy by choosing the appropriate word size l. We implemented the proposed structure using ASIC technology and the obtained results attain average savings in the area of 95.9%. Also, we obtained significant average savings in energy of 63.2%. The acquired results reveal that the offered hardware accelerator is appropriate for usage in resource-constrained IoT edge devices.(c) 2022 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Engineering, Alexandria University This is an open access article under the CC BY-NC-ND license (http://***/ licenses/by-nc-nd/4.0/).
The increasing density of distributed BRAMs diffused throughout modern Field Programmable Gate arrays (FPGAs) is ideal for forming processor in/near memory architectures. This breaks the traditional von Neumann memory...
详细信息
ISBN:
(纸本)9798350312058
The increasing density of distributed BRAMs diffused throughout modern Field Programmable Gate arrays (FPGAs) is ideal for forming processor in/near memory architectures. This breaks the traditional von Neumann memory bottleneck limiting concurrency and degrading energy efficiency. Ideally, processing density should scale linearly with BRAM capacity, and clock frequencies should be set by the read/write access times of the BRAM. In this paper, we present a PIM overlay that achieves these goals. We observe an improvement of performance by 2.25x, logic resource utilization by 2x, and accumulation delay by 17x compared to prior published work.
JPEG is a widely-used image compression algorithm. A 31-core JPEG encoder and a scalable family of JPEG encoders were developed for a fine-grain many-core processor array and are measured on the 32 nm KiloCore chip. T...
详细信息
ISBN:
(纸本)9798350393613
JPEG is a widely-used image compression algorithm. A 31-core JPEG encoder and a scalable family of JPEG encoders were developed for a fine-grain many-core processor array and are measured on the 32 nm KiloCore chip. These implementations are compared against JPEG implementations on three high-level-language-programmable chip platforms: an Nvidia A100 GPU with an Intel Platinum 8168 (nvJPEG), a TI C66x embedded processor, and an Intel i9-9900 processor (libjpeg-turbo). In addition, an HDL-configurable Xilinx Zynq-7000 FPGA (VISENGI) was compared. All results are scaled to 32nm CMOS, and throughput per area, energy per megapixel encoded, and the comprehensive energy x delay metrics are compared. The KiloCore designs achieve the lowest chip area and the highest normalized throughput per chip area. Among the chips that are programmable by a high-level programming language, the KiloCore achieves up to 4.33x, 31.4x, and 46.4x lower energy dissipation and up to 4.81x, 7,845x, and 14,510x lower energy x delay than the TI, Intel, and Nvidia respectively.
Mesh-connected processor array is an extensively investigated architecture in parallel processing. Massive studies have addressed the problem of using reconfiguration algorithms to solve the fault tolerance of faulty ...
详细信息
Mesh-connected processor array is an extensively investigated architecture in parallel processing. Massive studies have addressed the problem of using reconfiguration algorithms to solve the fault tolerance of faulty mesh-connected processor arrays. However, the subarrays generated by the previous studies still contain large interconnection length, which will lead to the increase of capacitance, power dissipation and dynamic communication cost. First, a mathematical model is established for the array reconfiguration. Then, the proposed method treats the interconnections between each PEs as a function with different integer variables, which can be solved by using effective integer programming techniques. Finally, an effective solver is called to find the optimal solution. Simulation results show that the proposed method can reduce the interconnection length of the array in the row and column directions simultaneously, thereby generating a subarray with the shortest interconnection length. On a 32 x 32 host array with fault density of 30%, the total interconnection length of the subarray can be reduced by 8.36% compared with state-of-the-art, and the average interconnection length can be reduced by 39.30%, which is more closer to the lower bound.
This paper presents an overview and performance analysis of a software-programmable domain-customizable System-on-Chip (SoC) overlay for low-latency inferencing of variable and low-precision Machine Learning (ML) netw...
详细信息
ISBN:
(纸本)9781665437592
This paper presents an overview and performance analysis of a software-programmable domain-customizable System-on-Chip (SoC) overlay for low-latency inferencing of variable and low-precision Machine Learning (ML) networks targeting Internet-of-Things (IoT) edge devices. The SoC includes a 2-D processor array that can be customized at design time for FPGA logic families. The overlay resolves historic issues of poor designer productivity associated with traditional Field Programmable Gate array (FPGA) design flows without the performance losses normally incurred by overlays. A standard Instruction Set Architecture (ISA) allows different ML networks to be quickly compiled and run on the overlay without the need to resynthesize. Performance results are presented that show the overlay achieves 1.3x-8.0x speedup over custom designs while still allowing rapid changes to ML algorithms on the FPGA through standard compilation.
A low-power demonstration system using a SCAMP-3 vision chip to track and count multiple objects with unpredictable trajectories is presented. The system can track as many discrete objects that can fit into its visual...
详细信息
ISBN:
(纸本)9781467302890
A low-power demonstration system using a SCAMP-3 vision chip to track and count multiple objects with unpredictable trajectories is presented. The system can track as many discrete objects that can fit into its visual field. The compact, self-contained hardware consists of a battery, an ARM Cortex-M3 co-processor, and the sensor/processor array device. The tracking algorithm is performed entirely by the processor array and the complete system draws 7.3mA during operation.
暂无评论