咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Remarn: A Reconfigurable Multi... 收藏

Remarn: A Reconfigurable Multi-threaded Multi-core Accelerator for Recurrent Neural Networks

作     者:Que, Zhiqiang Nakahara, Hiroki Fan, Hongxiang Li, He Meng, Jiuxi Tsoi, Kuen Hung Niu, Xinyu Nurvitadhi, Eriko Luk, Wayne 

作者机构:Imperial Coll London Exhibit Rd London SW7 2BX England Tokyo Inst Technol Ohokayama 1-21-2 Tokyo 1528550 Japan Univ Cambridge Cambridge CB2 1TN England Corerain Technol Ltd 14F Changfu Jinmao Bldg CFC Shenzhou Peoples R China Intel Corp Jones Farm Campus Hillsboro OR 97124 USA 

出 版 物:《ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS》 (美国计算机学会可重构技术和系统汇刊)

年 卷 期:2023年第16卷第1期

页      面:1-26页

核心收录:

学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

基  金:United Kingdom EPSRC [EP/V028251/1, EP/L016796/1, EP/N031768/1, EP/P010040/1, EP/S030069/1] Intel Corerain 

主  题:Accelerator architecture recurrent neural networks multi-tenant execution 

摘      要:This work introduces Remarn, a reconfigurable multi-threaded multi-core accelerator supporting both spatial and temporal co-execution of Recurrent Neural Network (RNN) inferences. It increases processing capabilities and quality of service of cloud-based neural processing units (NPUs) by improving their hardware utilization and by reducing design latency, with two innovations. First, a custom coarse-grained multi-threaded RNN/Long Short-Term Memory (LSTM) hardware architecture, switching tasks among threads when RNN computational engines meet data hazards. Second, the partitioning of this hardware architecture into multiple full-fledged sub-accelerator cores, enabling spatially co-execution of multiple RNN/LSTM inferences. These innovations improve the exploitation of the available parallelism to increase runtime hardware utilization and boost design throughput. Evaluation results show that a dual-threaded quad-core Remarn NPU achieves 2.91 times higher performance while only occupying 5.0% more area than a single-threaded one on a Stratix 10 FPGA. When compared with a Tesla V100 GPU implementation, our design achieves 6.5 times better performance and 15.6 times higher power efficiency, showing that our approach contributes to high performance and energy-efficient FPGA-based multi-RNN inference designs for datacenters.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分