检索结果-内蒙古大学图书馆

Scalable energy-efficient parallel sorting on a fine-grained many-core processor array

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 2020年第0期138卷 32-47页

作者： Stillmaker, Aaron Bohnenstiehl, Brent Stillmaker, Lucas Baas, Bevan Univ Calif Davis Elect & Comp Engn Dept One Shields Ave Davis CA 95616 USA Calif State Univ Elect & Comp Engn Dept Fresno 2320 E San Ramon Ave Fresno CA 93740 USA

Three parallel sorting applications and two list output protocols for the first phase of an external sort execute on a fine-grained many-core processor array that contains no algorithm-specific hardware acting as a co-processor with a variety of array sizes. Results are generated using a cycle-accurate model based on measured data from a fabricated many-core chip, and simulated for different processor array sizes. The data shows most energy efficient first-phase many-core sort requires over 65x lower energy than GNU C++ standard library sort performed on an Intel laptop-class processor and over 105x lower energy than a radix sort running on an Nvidia GPU. In addition, the highest first-phase throughput many-core sort is over 9.8x faster than the std::sort and over 14x faster than the radix sort. Both phases of a 10 GB external sort require 6.2x lower energyx time energy delay product than the std::sort and over 13x lower energyx time than the radix sort. (C) 2019 Elsevier Inc. All rights reserved.

关键词： Parallel processing External sorting Scalable sorting Fine-grained many-core processor array

来源：评论

学校读者我要写书评

暂无评论

An Event-Driven Massively Parallel Fine-Grained processor array

An Event-Driven Massively Parallel Fine-Grained Processor Ar...

引用

IEEE International Symposium on Circuits and Systems (ISCAS)

作者： Walsh, Declan Dudek, Piotr Univ Manchester Sch Elect & Elect Engn Manchester M13 9PL Lancs England

ISBN: (纸本)9781479983919

A multi-core event-driven parallel processor array design is presented. Using relatively simple 8-bit processing cores and a 2D mesh network topology, the architecture focuses on reducing the area occupation of a single processor core. A large number of these processor cores can be implemented on a single integrated chip to create a MIMD architecture capable of providing a powerful processing performance. Each processor core is an event-driven processor which can enter an idle mode when no data is changing locally. An 8 x 8 prototype processor array is implemented in a 65 nm CMOS process in 1,875 mu m x 1,875 mu m. This processor array is capable of performing 5.12 GOPS operating at 80 MHz with an average power consumption of 75.4 mW.

关键词： MIMD many-core event-driven fine-grained processor array parallel processing scalable

来源：评论

学校读者我要写书评

暂无评论

An Event-Driven Massively Parallel Fine-Grained processor array

An Event-Driven Massively Parallel Fine-Grained Processor Ar...

引用

IEEE International Symposium on Circuits and Systems

作者： Declan Walsh Piotr Dudek School of Electrical and Electronic Engineering The University of Manchester UK

ISBN: (纸本)9781479983926

A multi-core event-driven parallel processor array design is presented. Using relatively simple 8-bit processing cores and a 2D mesh network topology, the architecture focuses on reducing the area occupation of a single processor core. A large number of these processor cores can be implemented on a single integrated chip to create a MIMD architecture capable of providing a powerful processing performance. Each processor core is an event-driven processor which can enter an idle mode when no data is changing locally. An 8 × 8 prototype processor array is implemented in a 65 nm CMOS process in 1,875 μm × 1,875 μm. This processor array is capable of performing 5.12 GOPS operating at 80 MHz with an average power consumption of 75.4 mW.

关键词： MIMD many-core event-driven fine-grained processor array parallel processing scalable

来源：评论

学校读者我要写书评

暂无评论

IMAGine: An In-Memory Accelerated GEMV Engine Overlay 34

IMAGine: An In-Memory Accelerated GEMV Engine Overlay

引用

34th International Conference on Field-Programmable Logic and Applications (FPL)

作者： Kabir, M. D. Arafat Kamucheka, Tendayi Fredricks, Nathaniel Mandebi, Joel Bakos, Jason Huang, Miaoqing Andrews, David Univ Arkansas Dept Elect Engn & Comp Sci Fayetteville AR 72701 USA Univ South Carolina Dept Comp Sci & Engn Columbia SC USA Adv Micro Devices Inc AMD Santa Clara CA USA

ISBN: (纸本)9798331530082;9798331530075

processor-in-Memory (PIM) overlays and alternative reconfigurable tile fabrics have been proposed to eliminate the von Neumann bottleneck and enable processing performance to scale with BRAM capacity. The performance of these FPGA-based PIM architectures has been limited due to a reduction of the BRAMs maximum clock frequencies and less than ideal scaling of processing elements with increased BRAM capacity. This paper presents IMAGine, an In-Memory Accelerated GEMV engine, a PIM-array accelerator that clocks at the maximum frequency of the BRAM and scales to 100% of the available BRAMs. Comparative analyses are presented showing execution speeds over existing PIM-based GEMV engines on FPGAs and achieving a 2.65x - 3.2x faster clock. An AMD Alveo U55 implementation is presented that achieves a system clock speed of 737 MHz, providing 64K bit-serial multiply-accumulate (MAC) units for GEMV operation. This establishes IMAGine as the fastest PIM-based GEMV overlay, outperforming even the custom PIM-based FPGA accelerators reported to date. Additionally, it surpasses TPU v1-v2 and Alibaba Hanguang 800 in clock speed while offering an equal or greater number of multiply-accumulate (MAC) units.

关键词： Processing-in-Memory System Design Block RAM GEMV engine processor array

来源：评论

学校读者我要写书评

暂无评论

Compact hardware accelerator for field multipliers suitable for use in ultra-low power IoT edge devices

引用

ALEXANDRIA ENGINEERING JOURNAL 2022年第12期61卷 13079-13087页

作者： Ibrahim, Atef Gebali, Fayez Prince Sattam Bin Abdulaziz Univ Alkharj Coll Comp Engn & Sci Comp Engn Dept Al Kharj Saudi Arabia Univ Victoria ECE Dept Victoria BC Canada

Adoption of IoT technology without considering its security implications may expose network systems to a variety of security breaches. In network systems, IoT edge devices are a major source of security risks. Implementing cryptographic algorithms on most IoT edge devices can be difficult due to their limited resources. As a result, compact implementations of these algorithms on these devices are required. Because the field multiplication operation is at the heart of most cryptographic algorithms, its implementation will have a significant impact on the entire cryptographic algorithm implementation. As a result, in this paper, we propose a small hardware accelerator for performing field multiplication on edge devices. The hardware accelerator is primarily composed of a processor array with a regular structure and local interconnection among its processing elements. The main advantage of the proposed hardware structure is the ability to manage its area, delay, and consumed energy by choosing the appropriate word size l. We implemented the proposed structure using ASIC technology and the obtained results attain average savings in the area of 95.9%. Also, we obtained significant average savings in energy of 63.2%. The acquired results reveal that the offered hardware accelerator is appropriate for usage in resource-constrained IoT edge devices.(c) 2022 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Engineering, Alexandria University This is an open access article under the CC BY-NC-ND license (http://***/ licenses/by-nc-nd/4.0/).

关键词： Finite field multiplication IoT security Cryptography IoT-edge devices Parallel processing processor array

来源：评论

学校读者我要写书评

暂无评论

Making BRAMs Compute: Creating Scalable Computational Memory Fabric Overlays 31

Making BRAMs Compute: Creating Scalable Computational Memory...

引用

31st IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

作者： Kabir, Md Arafat Hollis, Joshua Panahi, Atiyehsadat Bakos, Jason Huang, Miaoqing Andrews, David Univ Arkansas Dept Comp Sci & Comp Engn Fayetteville AR 72701 USA Univ South Carolina Dept Comp Sci & Comp Engn Columbia SC USA Cadence Design Syst Dept Comp Sci & Comp Engn San Jose CA USA

ISBN: (纸本)9798350312058

The increasing density of distributed BRAMs diffused throughout modern Field Programmable Gate arrays (FPGAs) is ideal for forming processor in/near memory architectures. This breaks the traditional von Neumann memory bottleneck limiting concurrency and degrading energy efficiency. Ideally, processing density should scale linearly with BRAM capacity, and clock frequencies should be set by the read/write access times of the BRAM. In this paper, we present a PIM overlay that achieves these goals. We observe an improvement of performance by 2.25x, logic resource utilization by 2x, and accumulation delay by 17x compared to prior published work.

关键词： Bit-serial Overlay FPGA Machine Learning SIMD processor array Processing-in-Memory

来源：评论

学校读者我要写书评

暂无评论

A Scalable JPEG Encoder on a Many-Core array 16

A Scalable JPEG Encoder on a Many-Core Array

引用

16th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

作者： Abbott, Thomas Baas, Bevan Univ Calif Davis ECE Dept Davis CA 95616 USA

ISBN: (纸本)9798350393613

JPEG is a widely-used image compression algorithm. A 31-core JPEG encoder and a scalable family of JPEG encoders were developed for a fine-grain many-core processor array and are measured on the 32 nm KiloCore chip. These implementations are compared against JPEG implementations on three high-level-language-programmable chip platforms: an Nvidia A100 GPU with an Intel Platinum 8168 (nvJPEG), a TI C66x embedded processor, and an Intel i9-9900 processor (libjpeg-turbo). In addition, an HDL-configurable Xilinx Zynq-7000 FPGA (VISENGI) was compared. All results are scaled to 32nm CMOS, and throughput per area, energy per megapixel encoded, and the comprehensive energy x delay metrics are compared. The KiloCore designs achieve the lowest chip area and the highest normalized throughput per chip area. Among the chips that are programmable by a high-level programming language, the KiloCore achieves up to 4.33x, 31.4x, and 46.4x lower energy dissipation and up to 4.81x, 7,845x, and 14,510x lower energy x delay than the TI, Intel, and Nvidia respectively.

关键词： many-core JPEG encoder processor array

来源：评论

学校读者我要写书评

暂无评论

A mathematical programming method for constructing the shortest interconnection VLSI arrays

引用

INTEGRATION-THE VLSI JOURNAL 2021年 81卷 167-174页

作者： Ding, Hao Qian, Junyan Zhao, Lingzhong Zhai, Zhongyi Guilin Univ Elect Technol Guangxi Key Lab Trusted Software Guilin 541004 Peoples R China Guangxi Normal Univ Guangxi Key Lab Multisource Informat Min & Secur Guilin 541004 Peoples R China

Mesh-connected processor array is an extensively investigated architecture in parallel processing. Massive studies have addressed the problem of using reconfiguration algorithms to solve the fault tolerance of faulty mesh-connected processor arrays. However, the subarrays generated by the previous studies still contain large interconnection length, which will lead to the increase of capacitance, power dissipation and dynamic communication cost. First, a mathematical model is established for the array reconfiguration. Then, the proposed method treats the interconnections between each PEs as a function with different integer variables, which can be solved by using effective integer programming techniques. Finally, an effective solver is called to find the optimal solution. Simulation results show that the proposed method can reduce the interconnection length of the array in the row and column directions simultaneously, thereby generating a subarray with the shortest interconnection length. On a 32 x 32 host array with fault density of 30%, the total interconnection length of the subarray can be reduced by 8.36% compared with state-of-the-art, and the average interconnection length can be reduced by 39.30%, which is more closer to the lower bound.

关键词： processor array Fault tolerance Reconfiguration Algorithm Integer programming

来源：评论

学校读者我要写书评

暂无评论

A Customizable Domain-Specific Memory-Centric FPGA Overlay for Machine Learning Applications 31

A Customizable Domain-Specific Memory-Centric FPGA Overlay f...

引用

31st International Conference on Field-Programmable Logic and Applications (FPL)

作者： Panahi, Atiyehsadat Balsalama, Suhail Ishimwe, Ange-Thierry Mbongue, Joel Mandebi Andrews, David Univ Arkansas Dept Comp Sci & Comp Engn Fayetteville AR 72701 USA Univ Florida Dept Comp Sci & Comp Engn Fayetteville AR USA

ISBN: (纸本)9781665437592

This paper presents an overview and performance analysis of a software-programmable domain-customizable System-on-Chip (SoC) overlay for low-latency inferencing of variable and low-precision Machine Learning (ML) networks targeting Internet-of-Things (IoT) edge devices. The SoC includes a 2-D processor array that can be customized at design time for FPGA logic families. The overlay resolves historic issues of poor designer productivity associated with traditional Field Programmable Gate array (FPGA) design flows without the performance losses normally incurred by overlays. A standard Instruction Set Architecture (ISA) allows different ML networks to be quickly compiled and run on the overlay without the need to resynthesize. Performance results are presented that show the overlay achieves 1.3x-8.0x speedup over custom designs while still allowing rapid changes to ML algorithms on the FPGA through standard compilation.

关键词： FPGA overlay processor array machine learning SIMD bit-serial fixed-point MLP CNN LSTM GRU

来源：评论

学校读者我要写书评

暂无评论

Low Power Multiple Object Tracking and Counting using a SCAMP Cellular processor array

Low Power Multiple Object Tracking and Counting using a SCAM...

引用

13th International Workshop on Cellular Nanoscale Networks and their Applications (CNNA)

作者： Barr, David R. W. Carey, Stephen J. Dudek, Piotr Univ Manchester Sch Elect & Elect Engn Manchester M13 9PL Lancs England

ISBN: (纸本)9781467302890

A low-power demonstration system using a SCAMP-3 vision chip to track and count multiple objects with unpredictable trajectories is presented. The system can track as many discrete objects that can fit into its visual field. The compact, self-contained hardware consists of a battery, an ARM Cortex-M3 co-processor, and the sensor/processor array device. The tracking algorithm is performed entirely by the processor array and the complete system draws 7.3mA during operation.

关键词： Vision Chip processor array SIMD Smart Sensors Multiple Object Tracking

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：