检索结果-内蒙古大学图书馆

6th International Conference on Learning Representations, ICLR 2018

作者： Das, Dipankar Mellempudi, Naveen Mudigere, Dheevatsa Kalamkar, Dhiraj Avancha, Sasikanth Banerjee, Kunal Sridharan, Srinivas Vaidyanathan, Karthik Kaul, Bharat Georganas, Evangelos Heinecke, Alexander Dubey, Pradeep Corbal, Jesus Shustrov, Nikita Dubtsov, Roma Fomenko, Evarist Pirogov, Vadim Parallel Computing Lab Intel Labs India Parallel Computing Lab Intel Labs SC Product Architecture Group Intel OR United States Software Services Group Intel OR United States

The state-of-the-art (SOTA) for mixed precision training is dominated by variants of low precision floating point operations, and in particular FP16 accumulating into FP32 Micikevicius et al. (2017). On the other hand, while a lot of research has also happened in the domain of low and mixed-precision Integer training, these works either present results for non-SOTA networks (for instance only AlexNet for ImageNet-1K), or relatively small datasets (like CIFAR-10). In this work, we train state-of-the-art visual understanding neural networks on ImageNet-1K dataset, with Integer operations on General Purpose (GP) hardware. In particular, we focus on Integer Fused-Multiply-and-Accumulate (FMA) operations which take two pairs of INT16 operands and accumulate results into an INT32 *** propose a shared exponent representation of tensors, and develop a Dynamic Fixed Point (DFP) scheme suitable for common neural network operations. The nuances of developing an efficient integer convolution kernel is examined, including methods to handle overflow of the INT32 accumulator. We implement CNN training for ResNet-50, GoogLeNet-v1, VGG-16 and AlexNet;and these networks achieve or exceed SOTA accuracy within the same number of iterations as their FP32 counterparts without any change in hyper-parameters and with a 1.8X improvement in end-to-end training throughput. To the best of our knowledge these results represent the first INT16 training results on GP hardware for ImageNet-1K dataset using SOTA CNNs and achieve highest reported accuracy using half precision representation. © 2017 International Conference on Learning Representations, ICLR. All rights reserved.

关键词： Neural networks

来源：评论

学校读者我要写书评

暂无评论

Distributed Hessian-free optimization for deep neural network 31

Distributed Hessian-free optimization for deep neural networ...

引用

31st AAAI Conference on Artificial intelligence, AAAI 2017

作者： He, Xi Mudigere, Dheevatsa Smelyanskiy, Mikhail Takáč, Martin Industrial and Systems Engineering Lehigh University United States Parallel Computing Lab Intel Labs India Parallel Computing Lab Intel Labs SC United States

ISBN: (纸本)9781577357865

Training deep neural network is a high dimensional and a highly non-convex optimization problem. In this paper, we revisit Hessian-free optimization method for deep networks with negative curvature direction detection. We also develop its distributed variant and demonstrate superior scaling potential to SGD, which allows more efficiently utilizing larger computing resources thus enabling large models and faster time to obtain desired solution. We show that these techniques accelerate the training process for both the standard MNIST dataset and also the TIMIT speech recognition problem, demonstrating robust performance with upto an order of magnitude larger batch sizes. This increased scaling potential is illustrated with near linear speed-up on upto 32 CPU nodes for a simple 4-layer network. © Copyright 2017, Association for the Advancement of Artificial intelligence (***). All rights reserved.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

parallelizing Julia with a Non-Invasive DSL 31

Parallelizing Julia with a Non-Invasive DSL

引用

31st European Conference on Object-Oriented Programming, ECOOP 2017

作者： Anderson, Todd A. Liu, Hai Kuper, Lindsey Totoni, Ehsan Vitek, Jan Shpeisman, Tatiana Parallel Computing Lab Intel Labs Chile Northeastern University Czech Technical University Prague Czech Republic

ISBN: (纸本)9783959770354

Computational scientists often prototype software using productivity languages that offer highlevel programming abstractions. When higher performance is needed, they are obliged to rewrite their code in a lower-level efficiency language. Different solutions have been proposed to address this trade-off between productivity and efficiency. One promising approach is to create embedded domain-specific languages that sacrifice generality for productivity and performance, but practical experience with DSLs points to some road blocks preventing widespread adoption. This paper proposes a non-invasive domain-specific language that makes as few visible changes to the host programming model as possible. We present parallelAccelerator, a library and compiler for highlevel, high-performance scientific computing in Julia. parallelAccelerator's programming model is aligned with existing Julia programming idioms. Our compiler exposes the implicit parallelism in high-level array-style programs and compiles them to fast, parallel native code. Programs can also run in "library-only" mode, letting users benefit from the full Julia environment and libraries. Our results show encouraging performance improvements with very few changes to source code required. In particular, few to no additional type annotations are necessary.

关键词： Economic and social effects

来源：评论

学校读者我要写书评

暂无评论

Ternary neural networks with fine-grained quantization

arXiv

引用

arXiv 2017年

作者： Mellempudi, Naveen Kundu, Abhisek Mudigere, Dheevatsa Das, Dipankar Kaul, Bharat Dubey, Pradeep Parallel Computing Lab Intel Labs Bangalore Parallel Computing Lab Intel Labs Santa ClaraCA

We propose a novel fine-grained quantization (FGQ) method to ternarize pre-trained full precision models, while also constraining activations to 8 and 4-bits. Using this method, we demonstrate minimal loss in classification accuracy on state-of-the-art topologies without additional training. We provide an improved theoretical formulation that forms the basis for a higher quality solution using FGQ. Our method involves ternarizing the original weight tensor in groups of N weights. Using N = 4, we achieve Top-1 accuracy within 3.7% and 4.2% of the baseline full precision result for Resnet-101 and Resnet-50 respectively, while eliminating 75% of all multiplications. These results enable a full 8/4-bit inference pipeline, with best reported accuracy using ternary weights on ImageNet dataset, with a potential of 9× improvement in performance. Also, for smaller networks like AlexNet, FGQ achieves state-of-the-art results. We further study the impact of group size on both performance and accuracy. With a group size of N = 64, we eliminate ≈ 99% of the multiplications;however, this introduces a noticeable drop in accuracy, which necessitates fine tuning the parameters at lower precision. We address this by fine-tuning Resnet-50 with 8-bit activations and ternary weights at N = 64, improving the Top-1 accuracy to within 4% of the full precision result with Copyright © 2017, The Authors. All rights reserved.

关键词： Pipelines

来源：评论

学校读者我要写书评

暂无评论

Mixed precision training with 8-bit floating point

arXiv

引用

arXiv 2019年

作者： Mellempudi, Naveen Srinivasan, Sudarshan Das, Dipankar Kaul, Bharat Parallel Computing Lab Intel Labs

Reduced precision computation for deep neural networks is one of the key areas addressing the widening 'compute gap' driven by an exponential growth in model size. In recent years, deep learning training has largely migrated to 16-bit precision, with significant gains in performance and energy efficiency. However, attempts to train DNNs at 8-bit precision have met with significant challenges because of the higher precision and dynamic range requirements of back-propagation. In this paper, we propose a method to train deep neural networks using 8-bit floating point representation for weights, activations, errors, and gradients. In addition to reducing compute precision, we also reduced the precision requirements for the master copy of weights from 32-bit to 16-bit. We demonstrate state-of-the-art accuracy across multiple data sets (imagenet-1K, WMT16) and a broader set of workloads (Resnet-18/34/50, GNMT, Transformer) than previously reported. We propose an enhanced loss scaling method to augment the reduced subnormal range of 8-bit floating point for improved error propagation. We also examine the impact of quantization noise on generalization and propose a stochastic rounding technique to address gradient noise. As a result of applying all these techniques, we report slightly higher validation accuracy compared to full precision baseline. Copyright © 2019, The Authors. All rights reserved.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

GrAPL 2022 Keynote Speaker: GraphBLAS Beyond Simple Graphs

GrAPL 2022 Keynote Speaker: GraphBLAS Beyond Simple Graphs

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Tim Mattson Parallel Computing Lab Intel Labs

来源：评论

学校读者我要写书评

暂无评论

AUTOSPARSE: TOWARDS AUTOMATED SPARSE TRAINING OF DEEP NEURAL NETWORKS

arXiv

引用

arXiv 2023年

作者： Kundu, Abhisek Mellempudi, Naveen K. Vooturi, Dharma Teja Kaul, Bharat Dubey, Pradeep Parallel Computing Lab Intel Labs India

Sparse training is emerging as a promising avenue for reducing the computational cost of training neural networks. Several recent studies have proposed pruning methods using learnable thresholds to efficiently explore the non-uniform distribution of sparsity inherent within the models. In this paper, we propose Gradient Annealing (GA), where gradients of masked weights are scaled down in a non-linear manner. GA provides an elegant trade-off between sparsity and accuracy without the need for additional sparsity-inducing regularization. We integrated GA with the latest learnable pruning methods to create an automated sparse training algorithm called AutoSparse, which achieves better accuracy and/or training/inference FLOPS reduction than existing learnable pruning methods for sparse ResNet50 and MobileNetV1 on ImageNet-1K: AutoSparse achieves (2×, 7×) reduction in (training,inference) FLOPS for ResNet50 on ImageNet at 80% sparsity. Finally, AutoSparse outperforms sparse-to-sparse SotA method MEST (uniform sparsity) for 80% sparse ResNet50 with similar accuracy, where MEST uses 12% more training FLOPS and 50% more inference FLOPS. Copyright © 2023, The Authors. All rights reserved.

关键词： Genetic algorithms

来源：评论

学校读者我要写书评

暂无评论

PRACTICAL MASSIVELY parallel MONTE-CARLO TREE SEARCH APPLIED TO MOLECULAR DESIGN 9

PRACTICAL MASSIVELY PARALLEL MONTE-CARLO TREE SEARCH APPLIED...

引用

9th International Conference on Learning Representations, ICLR 2021

作者： Yang, Xiufeng Aasawat, Tanuj Kr Yoshizoe, Kazuki Chugai Pharmaceutical Co. Ltd Japan Parallel Computing Lab - India Intel Labs India RIKEN Center for Advanced Intelligence Project Japan

It is common practice to use large computational resources to train neural networks, known from many examples, such as reinforcement learning applications. However, while massively parallel computing is often used for training models, it is rarely used to search solutions for combinatorial optimization problems. This paper proposes a novel massively parallel Monte-Carlo Tree Search (MP-MCTS) algorithm that works efficiently for a 1,000 worker scale on a distributed memory environment using multiple compute nodes and applies it to molecular design. This paper is the first work that applies distributed MCTS to a real-world and non-game problem. Existing works on large-scale parallel MCTS show efficient scalability in terms of the number of rollouts up to 100 workers. Still, they suffer from the degradation in the quality of the solutions. MP-MCTS maintains the search quality at a larger scale. By running MP-MCTS on 256 CPU cores for only 10 minutes, we obtained candidate molecules with similar scores to non-parallel MCTS running for 42 hours. Moreover, our results based on parallel MCTS (combined with a simple RNN model) significantly outperform existing state-of-the-art work. Our method is generic and is expected to speed up other applications of MCTS. © 2021 ICLR 2021 - 9th International Conference on Learning Representations. All rights reserved.

关键词： Reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

Mixed low-precision deep learning inference using dynamic fixed point

arXiv

引用

arXiv 2017年

作者： Mellempudi, Naveen Kundu, Abhisek Das, Dipankar Mudigere, Dheevatsa Kaul, Bharat Parallel Computing Lab Intel Labs Bangalore India

We propose a cluster-based quantization method to convert pre-trained full precision weights into ternary weights with minimal impact on the accuracy. In addition we also constrain the activations to 8-bits thus enabling sub 8-bit full integer inference pipeline. Our method uses smaller clusters of N filters with a common scaling factor to minimize the quantization loss, while also maximizing the number of ternary operations. We show that with cluster size of N=4 on Resnet-101, can achieve 71.8% TOP-1 accuracy, within 6% of the best full precision result, while replacing ≈ 85% of all multiplications with 8-bit accumulations. Using the same method with 4-bit weights achieves 76.3% TOP-1 accuracy which within 2% of the full precision result. We also study the impact of the size of the cluster on both performance and accuracy, larger cluster sizes N=64 can replace ≈ 98% of the multiplications with ternary operations but introduces significant drop in accuracy which necessitates fine tuning the parameters with retraining the network at lower precision. To address this we have also trained low-precision Resnet-50 with 8-bit activations and ternary weights by pre-initializing the network with full precision weights and achieve 68.9% TOP-1 accuracy within 4 additional epochs. Our final quantized model can run on a full 8-bit compute pipeline, with a potential 16x improvement in performance compared to baseline full-precision models. Copyright © 2017, The Authors. All rights reserved.

关键词： Chemical activation

来源：评论

学校读者我要写书评

暂无评论

Mixed precision training of convolutional neural networks using integer operations

arXiv

引用

arXiv 2018年

作者： Das, Dipankar Mellempudi, Naveen Mudigere, Dheevatsa Kalamkar, Dhiraj Avancha, Sasikanth Banerjee, Kunal Sridharan, Srinivas Vaidyanathan, Karthik Kaul, Bharat Georganas, Evangelos Heinecke, Alexander Dubey, Pradeep Corbal, Jesus Shustrov, Nikita Dubtsov, Roma Fomenko, Evarist Pirogov, Vadim Parallel Computing Lab Intel Labs India Parallel Computing Lab Intel Labs Seychelles Product Architecture Group Intel OR Singapore Software Services Group Intel OR Singapore

The state-of-the-art (SOTA) for mixed precision training is dominated by variants of low precision floating point operations, and in particular FP16 accumulating into FP32 Micikevicius et al. (2017). On the other hand, while a lot of research has also happened in the domain of low and mixed-precision Integer training, these works either present results for non-SOTA networks (for instance only AlexNet for ImageNet-1K), or relatively small datasets (like CIFAR-10). In this work, we train state-of-the-art visual understanding neural networks on ImageNet-1K dataset, with Integer operations on General Purpose (GP) hardware. In particular, we focus on Integer Fused-Multiply-and-Accumulate (FMA) operations which take two pairs of INT16 operands and accumulate results into an INT32 *** propose a shared exponent representation of tensors, and develop a Dynamic Fixed Point (DFP) scheme suitable for common neural network operations. The nuances of developing an efficient integer convolution kernel is examined, including methods to handle overflow of the INT32 accumulator. We implement CNN training for ResNet-50, GoogLeNet-v1, VGG-16 and AlexNet;and these networks achieve or exceed SOTA accuracy within the same number of iterations as their FP32 counterparts without any change in hyper-parameters and with a 1.8X improvement in end-to-end training throughput. To the best of our knowledge these results represent the first INT16 training results on GP hardware for ImageNet-1K dataset using SOTA CNNs and achieve highest reported accuracy using half precision representation. Copyright © 2018, The Authors. All rights reserved.

关键词： Convolution

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：