This paper presents an energy-efficient, domain-specific manycoreaccelerator also referred to as the "CSCMAC"-Cyclic Sparsely Connected Neural Network manycoreaccelerator, which effectively maps and execut...
详细信息
ISBN:
(纸本)9781728142074
This paper presents an energy-efficient, domain-specific manycoreaccelerator also referred to as the "CSCMAC"-Cyclic Sparsely Connected Neural Network manycoreaccelerator, which effectively maps and executes deep neural networks (DNNs) compressed with cyclic sparsely connected (CSC) architectures. CSC layers are architectures that structurally compress and sparsify DNNs, which can reduce the memory footprint of fully connected (FC) layers from O(N2) to O(N logN) with respect to layers nodes, and is shown to be hardware implementable-friendly. We implement CSC layers for inference on a manycore unit, take advantage of their cyclic architecture, and show that their implementation in software even for a parallel-computing processor is affable. To further take advantage of their implementation simplicity, we propose customized instructions for the manycore that fuse frequently used sequences of machine codes and evaluate the optimization gained by the customization. Our experimental results using a LeNet300100 on MNIST and a Multi-Layer Perceptron (MLP) on Physical Activity Monitoring indicate that by replacing FC layers with CSC layers, we can achieve 46x and 6x compression respectively within a margin of 2% accuracy loss. A 64-cluster architecture of the CSCMAC is fully placed and routed using 65nm, TSMC CMOS technology. The layout of each cluster occupies an area of 0.73 mm2 and consumes 230.2 mW power at 980 MHz clock frequency. Our proposed CSCMAC achieves 1.48x higher throughput and 1.49x lower energy compared to its equivalent predecessor manycore (PENC). Also, the CSCMAC achieves 85x higher throughput and consumes 66.4x lower energy compared to CPU implementation of the NVIDIA Jetson TX2 platform.
Current high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a general-purpose host processor is coupled to a programmable manycore accelerator (PMCA). Such PMCAs typically leverage hi...
详细信息
ISBN:
(纸本)9781479986705
Current high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a general-purpose host processor is coupled to a programmable manycore accelerator (PMCA). Such PMCAs typically leverage hierarchical interconnect and distributed memory with non-uniform access (NUMA). Nested parallelism is a convenient programming abstraction for large-scale cc-NUMA systems, which allows to hierarchically (and dynamically) create multiple levels of fine-grained parallelism whenever it is available. Available implementations for cc-NUMA systems introduce large overheads for nested parallelism management, which cannot be tolerated due to the extremely fine-grained nature of embedded parallel workloads. In particular, creating a team of parallel threads has a cost that increases linearly with the number of threads, which is inherently non scalable. This work presents a software cache mechanism for frequently-used parallel team configurations to speed up parallel thread creation overheads in PMCA systems. When a configuration is found in the cache the cost for parallel team creation has a constant time, providing a scalable mechanism. We evaluated our support on the STMicroelectronics STHORM many-core. Compared to the state-of-the art, our solution shows that: i) the cost for parallel team creation is reduced by up to 67%;ii) the tangible effect on real ultra-fine-grained parallel kernels is a speedup of up to 80%.
暂无评论