检索结果-内蒙古大学图书馆

53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

作者： Imani, Mohsen Pampana, Saikishan Gupta, Saransh Zhou, Minxuan Kim, Yeseong Rosing, Tajana UC Irvine Dept Comp Sci Irvine CA 92697 USA DGIST Dept Informat & Commun Engn Daegu South Korea Univ Calif San Diego Dept Comp Sci & Engn San Diego CA USA

ISBN: (数字)9781728173832

ISBN: (纸本)9781728173832

Today's applications generate a large amount of data that need to be processed by learning algorithms. In practice, the majority of the data are not associated with any labels. Unsupervised learning, i.e., clustering methods, are the most commonly used algorithms for data analysis. However, running clustering algorithms on traditional cores results in high energy consumption and slow processing speed due to a large amount of data movement between memory and processing units. In this paper, we propose DUAL, a Digital-based Unsupervised learning AcceLeration, which supports a wide range of popular algorithms on conventional crossbar memory. Instead of working with the original data, DUAL maps all data points into high-dimensional space, replacing complex clustering operations with memory-friendly operations. We accordingly design a PIM-based architecture that supports all essential operations in a highly parallel and scalable way. DUAL supports a wide range of essential operations and enables in-place computations, allowing data points to remain in memory. We have evaluated DUAL on several popular clustering algorithms for a wide range of large-scale datasets. Our evaluation shows that DUAL provides a comparable quality to existing clustering algorithms while using a binary representation and a simplified distance metric. DUAL also provides 58.8x speedup and 251.2x energy efficiency improvement as compared to the state-of-the-art solution running on GPU.

关键词： Processing in-memory Unsupervised learning Hyperdimensional computing algorithm-hardware co-design

来源：评论

学校读者我要写书评

暂无评论

Efficient Network construction Through Structural Plasticity

引用

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS 2019年第3期9卷 453-464页

作者： Du, Xiaocong Li, Zheng Ma, Yufei Cao, Yu Arizona State Univ Sch Elect Comp & Energy Engn Tempe AZ 85287 USA Arizona State Univ Sch Comp Informat & Decis Syst Engn Tempe AZ 85287 USA

Deep Neural Networks (DNNs) on hardware is facing excessive computation cost due to the massive number of parameters. A typical training pipeline to mitigate over-parameterization is to pre-define a DNN structure with redundant learning units (filters and neurons) with the goal of high accuracy, then to prune redundant learning units after training with the purpose of efficient inference. We argue that it is sub-optimal to introduce redundancy into training in order to reduce redundancy later in inference. Moreover, the fixed network structure further results in poor adaption to dynamic tasks, such as lifelong learning. In contrast, structural plasticity plays an indispensable role in mammalian brains to achieve compact and accurate learning. Throughout the lifetime, active connections are continuously created while those that are no longer important are degenerated. Inspired by such observation, we propose a training scheme, namely continuous Growth and Pruning (CGaP), where we start the training from a small network seed, then literally execute continuous growth by adding important learning units and finally prune secondary ones for efficient inference. The inference model generated from CGaP is sparse in the structure, largely decreasing the inference power and latency when deployed on hardware platforms. With popular DNN structures on representative datasets, the efficacy of CGaP is benchmarked by both algorithmic simulation and architectural modeling on Field-programmable Gate Arrays (FPGA). For example, CGaP decreases the FLOPs, model size, DRAM access energy and inference latency by 63.3%, 64.0%, 11.8% and 40.2%, respectively, for ResNet-110 on CIFAR-10.

关键词： Deep learning structural plasticity model pruning hardware acceleration algorithm-hardware co-design

来源：评论

学校读者我要写书评

暂无评论

Efficient Network construction Through Structural Plasticity

Efficient Network Construction Through Structural Plasticity

引用

1st AI compute Symposium (AICS)

作者： Du, Xiaocong Li, Zheng Ma, Yufei Cao, Yu Arizona State Univ Sch Elect Comp & Energy Engn Tempe AZ 85287 USA Arizona State Univ Sch Comp Informat & Decis Syst Engn Tempe AZ 85287 USA

关键词： Deep learning structural plasticity model pruning hardware acceleration algorithm-hardware co-design

来源：评论

学校读者我要写书评

暂无评论

Boosting the Performance of CNN Accelerators with Dynamic Fine-Grained Channel Gating 52

Boosting the Performance of CNN Accelerators with Dynamic Fi...

引用

52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

作者： Hua, Weizhe Zhou, Yuan De Sa, Christopher Zhang, Zhiru Suh, G. Edward Cornell Univ Ithaca NY 14850 USA

ISBN: (纸本)9781450369381

This paper proposes a new fine-grained dynamic pruning technique for CNN inference, named channel gating, and presents an accelerator architecture that can effectively exploit the dynamic sparsity. Intuitively, channel gating identifies the regions in the feature map of each CNN layer that contribute less to the classification result and turns off a subset of channels for computing the activations in these less important regions. Unlike static network pruning, which removes redundant weights or neurons prior to inference, channel gating exploits dynamic sparsity specific to each input at run time and in a structured manner. To maximize compute savings while minimizing accuracy loss, channel gating learns the gating thresholds together with weights automatically through training. Experimental results show that the proposed approach can significantly speed up state-of-the-art networks with a marginal accuracy loss, and enable a trade-off between performance and accuracy. This paper also shows that channel gating can be supported with a small set of extensions to a CNN accelerator, and implements a prototype for quantized ResNet-18 models. The accelerator shows an average speedup of 2.3x for ImageNet when the theoretical FLOP reduction is 2.8x, indicating that the hardware can effectively exploit the dynamic sparsity exposed by channel gating.

关键词： neural networks dynamic pruning algorithm-hardware co-design

来源：评论

学校读者我要写书评

暂无评论

MnnFast: A Fast and Scalable System Architecture for Memory-Augmented Neural Networks 19

MnnFast: A Fast and Scalable System Architecture for Memory-...

引用

46th International Symposium on computer Architecture (ISCA) / Workshop on computer Architecture Education (WCAE)

作者： Jang, Hanhwi Kim, Joonsung Jo, Jae-Eon Lee, Jaewon Kim, Jangwoo POSTECH Pohang Dept Comp Sci & Engn Pohang South Korea Seoul Natl Univ Dept Elect & Comp Engn Seoul South Korea

ISBN: (纸本)9781450366694

Memory-augmented neural networks are getting more attention from many researchers as they can make an inference with the previous history stored in memory. Especially, among these memory-augmented neural networks, memory networks are known for their huge reasoning power and capability to learn from a large number of inputs rather than other networks. As the size of input datasets rapidly grows, the necessity of large-scale memory networks continuously arises. Such large-scale memory networks provide excellent reasoning power;however, the current computer infrastructure cannot achieve scalable performance due to its limited system architecture. In this paper, we propose MnnFast, a novel system architecture for large-scale memory networks to achieve fast and scalable reasoning performance. We identify the performance problems of the current architecture by conducting extensive performance bottleneck analysis. Our in-depth analysis indicates that the current architecture suffers from three major performance problems: high memory bandwidth consumption, heavy computation, and cache contention. To overcome these performance problems, we propose three novel optimizations. First, to reduce the memory bandwidth consumption, we propose a new column-based algorithm with streaming which minimizes the size of data spills and hides most of the offchip memory accessing overhead. Second, to decrease the high computational overhead, we propose a zero-skipping optimization to bypass a large amount of output computation. Lastly, to eliminate the cache contention, we propose an embedding cache dedicated to efficiently cache the embedding matrix. Our evaluations show that MnnFast is significantly effective in various types of hardware: CPU, GPU, and FPGA. MnnFast improves the overall throughput by up to 5.38x, 4.34x, and 2.01x on CPU, GPU, and FPGA respectively. Also, compared to CPU-based MnnFast, our FPGA-based MnnFast achieves 6.54x higher energy efficiency.

关键词： Memory Networks Attention-based Neural Networks Machine Learning Parallel algorithm computation/Dataflow Optimization Accelerator algorithm-hardware co-design Architecture

来源：评论

学校读者我要写书评

暂无评论

The Mondrian Data Engine 17

The Mondrian Data Engine

引用

44th Annual International Symposium on computer Architecture (ISCA)

作者： Drumond, Mario Daglis, Alexandros Mirzadeh, Nooshin Ustiugov, Dmitrii Picorel, Javier Falsafi, Babak Grot, Boris Pnevmatikatos, Dionisios Ecole Polytech Fed Lausanne EcoCloud Lausanne Switzerland Univ Edinburgh Edinburgh Midlothian Scotland FORTH ICS & ECE TUC Edinburgh Midlothian Scotland

ISBN: (纸本)9781450348928

The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance density of traditional CPU-centric architectures stagnates, advancing compute capabilities necessitates novel architectural approaches. Near-memory processing (NMP) architectures are reemerging as promising candidates to improve computing efficiency through tight coupling of logic and memory. NMP architectures are especially fitting for data analytics, as they provide immense bandwidth to memory-resident data and dramatically reduce data movement, the main source of energy consumption. Modern data analytics operators are optimized for CPU execution and hence rely on large caches and employ random memory accesses. In the context of NMP, such random accesses result in wasteful DRAM row buffer activations that account for a significant fraction of the total memory access energy. In addition, utilizing NMP's ample bandwidth with fine-grained random accesses requires complex hardware that cannot be accommodated under NMP's tight area and power constraints. Our thesis is that efficient NMP calls for an algorithm-hardware co-design that favors algorithms with sequential accesses to enable simple hardware that accesses memory in streams. We introduce an instance of such a co-designed NMP architecture for data analytics, the Mondrian Data Engine. compared to a CPU-centric and a baseline NMP system, the Mondrian Data Engine improves the performance of basic data analytics operators by up to 49x and 5x, and efficiency by up to 28x and 5x, respectively.

关键词： Near-memory processing sequential memory access algorithm-hardware co-design

来源：评论

学校读者我要写书评

暂无评论

EIE: Efficient Inference Engine on compressed Deep Neural Network 16

EIE: Efficient Inference Engine on Compressed Deep Neural Ne...

引用

43rd ACM/IEEE Annual International Symposium on computer Architecture (ISCA)

作者： Han, Song Liu, Xingyu Mao, Huizi Pu, Jing Pedram, Ardavan Horowitz, Mark A. Dally, William J. Stanford Univ Stanford CA 94305 USA NVIDIA Santa Clara CA USA

ISBN: (纸本)9781467389471

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving;Exploiting sparsity saves 10x;Weight sharing gives 8x;Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88x10(4) frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.

关键词： Deep Learning Model compression hardware Acceleration algorithm-hardware co-design ASIC

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：