Today's applications generate a large amount of data that need to be processed by learning algorithms. In practice, the majority of the data are not associated with any labels. Unsupervised learning, i.e., cluster...
详细信息
ISBN:
(数字)9781728173832
ISBN:
(纸本)9781728173832
Today's applications generate a large amount of data that need to be processed by learning algorithms. In practice, the majority of the data are not associated with any labels. Unsupervised learning, i.e., clustering methods, are the most commonly used algorithms for data analysis. However, running clustering algorithms on traditional cores results in high energy consumption and slow processing speed due to a large amount of data movement between memory and processing units. In this paper, we propose DUAL, a Digital-based Unsupervised learning AcceLeration, which supports a wide range of popular algorithms on conventional crossbar memory. Instead of working with the original data, DUAL maps all data points into high-dimensional space, replacing complex clustering operations with memory-friendly operations. We accordingly design a PIM-based architecture that supports all essential operations in a highly parallel and scalable way. DUAL supports a wide range of essential operations and enables in-place computations, allowing data points to remain in memory. We have evaluated DUAL on several popular clustering algorithms for a wide range of large-scale datasets. Our evaluation shows that DUAL provides a comparable quality to existing clustering algorithms while using a binary representation and a simplified distance metric. DUAL also provides 58.8x speedup and 251.2x energy efficiency improvement as compared to the state-of-the-art solution running on GPU.
Deep Neural Networks (DNNs) on hardware is facing excessive computation cost due to the massive number of parameters. A typical training pipeline to mitigate over-parameterization is to pre-define a DNN structure with...
详细信息
Deep Neural Networks (DNNs) on hardware is facing excessive computation cost due to the massive number of parameters. A typical training pipeline to mitigate over-parameterization is to pre-define a DNN structure with redundant learning units (filters and neurons) with the goal of high accuracy, then to prune redundant learning units after training with the purpose of efficient inference. We argue that it is sub-optimal to introduce redundancy into training in order to reduce redundancy later in inference. Moreover, the fixed network structure further results in poor adaption to dynamic tasks, such as lifelong learning. In contrast, structural plasticity plays an indispensable role in mammalian brains to achieve compact and accurate learning. Throughout the lifetime, active connections are continuously created while those that are no longer important are degenerated. Inspired by such observation, we propose a training scheme, namely continuous Growth and Pruning (CGaP), where we start the training from a small network seed, then literally execute continuous growth by adding important learning units and finally prune secondary ones for efficient inference. The inference model generated from CGaP is sparse in the structure, largely decreasing the inference power and latency when deployed on hardware platforms. With popular DNN structures on representative datasets, the efficacy of CGaP is benchmarked by both algorithmic simulation and architectural modeling on Field-programmable Gate Arrays (FPGA). For example, CGaP decreases the FLOPs, model size, DRAM access energy and inference latency by 63.3%, 64.0%, 11.8% and 40.2%, respectively, for ResNet-110 on CIFAR-10.
Deep Neural Networks (DNNs) on hardware is facing excessive computation cost due to the massive number of parameters. A typical training pipeline to mitigate over-parameterization is to pre-define a DNN structure with...
详细信息
Deep Neural Networks (DNNs) on hardware is facing excessive computation cost due to the massive number of parameters. A typical training pipeline to mitigate over-parameterization is to pre-define a DNN structure with redundant learning units (filters and neurons) with the goal of high accuracy, then to prune redundant learning units after training with the purpose of efficient inference. We argue that it is sub-optimal to introduce redundancy into training in order to reduce redundancy later in inference. Moreover, the fixed network structure further results in poor adaption to dynamic tasks, such as lifelong learning. In contrast, structural plasticity plays an indispensable role in mammalian brains to achieve compact and accurate learning. Throughout the lifetime, active connections are continuously created while those that are no longer important are degenerated. Inspired by such observation, we propose a training scheme, namely continuous Growth and Pruning (CGaP), where we start the training from a small network seed, then literally execute continuous growth by adding important learning units and finally prune secondary ones for efficient inference. The inference model generated from CGaP is sparse in the structure, largely decreasing the inference power and latency when deployed on hardware platforms. With popular DNN structures on representative datasets, the efficacy of CGaP is benchmarked by both algorithmic simulation and architectural modeling on Field-programmable Gate Arrays (FPGA). For example, CGaP decreases the FLOPs, model size, DRAM access energy and inference latency by 63.3%, 64.0%, 11.8% and 40.2%, respectively, for ResNet-110 on CIFAR-10.
This paper proposes a new fine-grained dynamic pruning technique for CNN inference, named channel gating, and presents an accelerator architecture that can effectively exploit the dynamic sparsity. Intuitively, channe...
详细信息
ISBN:
(纸本)9781450369381
This paper proposes a new fine-grained dynamic pruning technique for CNN inference, named channel gating, and presents an accelerator architecture that can effectively exploit the dynamic sparsity. Intuitively, channel gating identifies the regions in the feature map of each CNN layer that contribute less to the classification result and turns off a subset of channels for computing the activations in these less important regions. Unlike static network pruning, which removes redundant weights or neurons prior to inference, channel gating exploits dynamic sparsity specific to each input at run time and in a structured manner. To maximize compute savings while minimizing accuracy loss, channel gating learns the gating thresholds together with weights automatically through training. Experimental results show that the proposed approach can significantly speed up state-of-the-art networks with a marginal accuracy loss, and enable a trade-off between performance and accuracy. This paper also shows that channel gating can be supported with a small set of extensions to a CNN accelerator, and implements a prototype for quantized ResNet-18 models. The accelerator shows an average speedup of 2.3x for ImageNet when the theoretical FLOP reduction is 2.8x, indicating that the hardware can effectively exploit the dynamic sparsity exposed by channel gating.
Memory-augmented neural networks are getting more attention from many researchers as they can make an inference with the previous history stored in memory. Especially, among these memory-augmented neural networks, mem...
详细信息
ISBN:
(纸本)9781450366694
Memory-augmented neural networks are getting more attention from many researchers as they can make an inference with the previous history stored in memory. Especially, among these memory-augmented neural networks, memory networks are known for their huge reasoning power and capability to learn from a large number of inputs rather than other networks. As the size of input datasets rapidly grows, the necessity of large-scale memory networks continuously arises. Such large-scale memory networks provide excellent reasoning power;however, the current computer infrastructure cannot achieve scalable performance due to its limited system architecture. In this paper, we propose MnnFast, a novel system architecture for large-scale memory networks to achieve fast and scalable reasoning performance. We identify the performance problems of the current architecture by conducting extensive performance bottleneck analysis. Our in-depth analysis indicates that the current architecture suffers from three major performance problems: high memory bandwidth consumption, heavy computation, and cache contention. To overcome these performance problems, we propose three novel optimizations. First, to reduce the memory bandwidth consumption, we propose a new column-based algorithm with streaming which minimizes the size of data spills and hides most of the offchip memory accessing overhead. Second, to decrease the high computational overhead, we propose a zero-skipping optimization to bypass a large amount of output computation. Lastly, to eliminate the cache contention, we propose an embedding cache dedicated to efficiently cache the embedding matrix. Our evaluations show that MnnFast is significantly effective in various types of hardware: CPU, GPU, and FPGA. MnnFast improves the overall throughput by up to 5.38x, 4.34x, and 2.01x on CPU, GPU, and FPGA respectively. Also, compared to CPU-based MnnFast, our FPGA-based MnnFast achieves 6.54x higher energy efficiency.
The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance density of traditional CP...
详细信息
ISBN:
(纸本)9781450348928
The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance density of traditional CPU-centric architectures stagnates, advancing compute capabilities necessitates novel architectural approaches. Near-memory processing (NMP) architectures are reemerging as promising candidates to improve computing efficiency through tight coupling of logic and memory. NMP architectures are especially fitting for data analytics, as they provide immense bandwidth to memory-resident data and dramatically reduce data movement, the main source of energy consumption. Modern data analytics operators are optimized for CPU execution and hence rely on large caches and employ random memory accesses. In the context of NMP, such random accesses result in wasteful DRAM row buffer activations that account for a significant fraction of the total memory access energy. In addition, utilizing NMP's ample bandwidth with fine-grained random accesses requires complex hardware that cannot be accommodated under NMP's tight area and power constraints. Our thesis is that efficient NMP calls for an algorithm-hardware co-design that favors algorithms with sequential accesses to enable simple hardware that accesses memory in streams. We introduce an instance of such a co-designed NMP architecture for data analytics, the Mondrian Data Engine. compared to a CPU-centric and a baseline NMP system, the Mondrian Data Engine improves the performance of basic data analytics operators by up to 49x and 5x, and efficiency by up to 28x and 5x, respectively.
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resour...
详细信息
ISBN:
(纸本)9781467389471
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving;Exploiting sparsity saves 10x;Weight sharing gives 8x;Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88x10(4) frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.
暂无评论