检索结果-内蒙古大学图书馆

31st AAAI Conference on Artificial Intelligence, AAAI 2017

作者： He, Xi Mudigere, Dheevatsa Smelyanskiy, Mikhail Takáč, Martin Industrial and Systems Engineering Lehigh University United States Parallel Computing Lab Intel Labs India Parallel Computing Lab Intel Labs SC United States

ISBN: (纸本)9781577357865

Training deep neural network is a high dimensional and a highly non-convex optimization problem. In this paper, we revisit Hessian-free optimization method for deep networks with negative curvature direction detection. We also develop its distributed variant and demonstrate superior scaling potential to SGD, which allows more efficiently utilizing larger computing resources thus enabling large models and faster time to obtain desired solution. We show that these techniques accelerate the training process for both the standard MNIST dataset and also the TIMIT speech recognition problem, demonstrating robust performance with upto an order of magnitude larger batch sizes. This increased scaling potential is illustrated with near linear speed-up on upto 32 CPU nodes for a simple 4-layer network. © Copyright 2017, Association for the Advancement of Artificial Intelligence (***). All rights reserved.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

MIXED PRECISION TRAINING OF CONVOLUTIONAL NEURAL NETWORKS USING INTEGER OPERATIONS 6

MIXED PRECISION TRAINING OF CONVOLUTIONAL NEURAL NETWORKS US...

引用

6th International Conference on Learning Representations, ICLR 2018

作者： Das, Dipankar Mellempudi, Naveen Mudigere, Dheevatsa Kalamkar, Dhiraj Avancha, Sasikanth Banerjee, Kunal Sridharan, Srinivas Vaidyanathan, Karthik Kaul, Bharat Georganas, Evangelos Heinecke, Alexander Dubey, Pradeep Corbal, Jesus Shustrov, Nikita Dubtsov, Roma Fomenko, Evarist Pirogov, Vadim Parallel Computing Lab Intel Labs India Parallel Computing Lab Intel Labs SC Product Architecture Group Intel OR United States Software Services Group Intel OR United States

The state-of-the-art (SOTA) for mixed precision training is dominated by variants of low precision floating point operations, and in particular FP16 accumulating into FP32 Micikevicius et al. (2017). On the other hand, while a lot of research has also happened in the domain of low and mixed-precision Integer training, these works either present results for non-SOTA networks (for instance only AlexNet for ImageNet-1K), or relatively small datasets (like CIFAR-10). In this work, we train state-of-the-art visual understanding neural networks on ImageNet-1K dataset, with Integer operations on General Purpose (GP) hardware. In particular, we focus on Integer Fused-Multiply-and-Accumulate (FMA) operations which take two pairs of INT16 operands and accumulate results into an INT32 *** propose a shared exponent representation of tensors, and develop a Dynamic Fixed Point (DFP) scheme suitable for common neural network operations. The nuances of developing an efficient integer convolution kernel is examined, including methods to handle overflow of the INT32 accumulator. We implement CNN training for ResNet-50, GoogLeNet-v1, VGG-16 and AlexNet;and these networks achieve or exceed SOTA accuracy within the same number of iterations as their FP32 counterparts without any change in hyper-parameters and with a 1.8X improvement in end-to-end training throughput. To the best of our knowledge these results represent the first INT16 training results on GP hardware for ImageNet-1K dataset using SOTA CNNs and achieve highest reported accuracy using half precision representation. © 2017 International Conference on Learning Representations, ICLR. All rights reserved.

关键词： Neural networks

来源：评论

学校读者我要写书评

暂无评论

Translation validation of loop and arithmetic transformations in the presence of recurrences 2016

Translation validation of loop and arithmetic transformation...

引用

17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools and Theory for Embedded Systems, LCTES 2016

作者： Banerjee, Kunal Mandal, Chittaranjan Sarkar, Dipankar Department of Computer Science and Engineering Indian Institute of Technology Kharagpur India Intel Parallel Computing Lab Bangalore India

ISBN: (纸本)9781450343169

Compiler optimization of array-intensive programs involves extensive application of loop transformations and arithmetic transformations. Hence, translation validation of array-intensive programs requires manipulation of intervals of integers (representing domains of array indices) and relations over such intervals to account for loop transformations and simplification of arithmetic expressions to handle arithmetic transformations. A major obstacle for verification of such programs is posed by the presence of recurrences, whereby an element of an array gets defined in a statement S inside a loop in terms of some other element(s) of the same array which have been previously defined through the same statement S. Recurrences lead to cycles in the data-dependence graph of a program which make dependence analyses and simplifications (through closed-form representations) of the data transformations difficult. Another technique which works better for recurrences does not handle arithmetic transformations. In this work, array data-dependence graphs (ADDGs) are used to represent both the original and the optimized versions of the program and a validation scheme is proposed where the cycles due to recurrences in the ADDGs are suitably abstracted as acyclic subgraphs. Thus, this work provides a unified equivalence checking framework to handle loop and arithmetic transformations along with most of the recurrences - this combination of features had not been achieved by a single verification technique earlier. ©2016 ACM.

关键词： Application programs

来源：评论

学校读者我要写书评

暂无评论

Blackout: Speeding up recurrent neural network language models with very large vocabularies 4

Blackout: Speeding up recurrent neural network language mode...

引用

4th International Conference on Learning Representations, ICLR 2016

作者： Ji, Shihao Vishwanathan, S.V.N. Satish, Nadathur Anderson, Michael J. Dubey, Pradeep Parallel Computing Lab. Intel India Univ. of California Santa Cruz United States

We propose BlackOut, an approximation algorithm to efficiently train massive recurrent neural network language models (RNNLMs) with million word vocabularies. BlackOut is motivated by using a discriminative loss, and we describe a weighted sampling strategy which significantly reduces computation while improving stability, sample efficiency, and rate of convergence. One way to understand BlackOut is to view it as an extension of the DropOut strategy to the output layer, wherein we use a discriminative training loss and a weighted sampling scheme. We also establish close connections between BlackOut, importance sampling, and noise contrastive estimation (NCE). Our experiments, on the recently released one billion word language modeling benchmark, demonstrate scalability and accuracy of BlackOut;we outperform the state-of-the art, and achieve the lowest perplexity scores on this dataset. Moreover, unlike other established methods which typically require GPUs or CPU clusters, we show that a carefully implemented version of BlackOut requires only 1-10 days on a single machine to train a RNNLM with a million word vocabulary and billions of parameters on one billion words. Although we describe BlackOut in the context of RNNLM training, it can be used to any networks with large softmax output layers. © ICLR 2016: San Juan, Puerto Rico. All Rights Reserved.

关键词： Approximation algorithms

来源：评论

学校读者我要写书评

暂无评论

Lattice QCD on Intel® Xeon Phi™ coprocessors

Lattice QCD on Intel® Xeon Phi™ coprocessors

引用

28th International Supercomputing Conference on Supercomputing, ISC 2013

作者： Joó, Bálint Kalamkar, Dhiraj D. Vaidyanathan, Karthikeyan Smelyanskiy, Mikhail Pamnany, Kiran Lee, Victor W. Dubey, Pradeep Watson III, William Thomas Jefferson National Accelerator Facility Newport News VA United States Parallel Computing Lab. Intel Corporation Bangalore India Parallel Computing Lab. Intel Corporation Santa Clara CA United States

ISBN: (纸本)9783642387494

Lattice Quantum Chromodynamics (LQCD) is currently the only known model independent, non perturbative computational method for calculations in the theory of the strong interactions, and is of importance in studies of nuclear and high energy physics. LQCD codes use large fractions of supercomputing cycles worldwide and are often amongst the first to be ported to new high performance computing architectures. The recently released Intel Xeon Phi architecture from Intel Corporation features parallelism at the level of many x86-based cores, multiple threads per core, and vector processing units. In this contribution, we describe our experiences with optimizing a key LQCD kernel for the Xeon Phi architecture. On a single node, using single precision, our Dslash kernel sustains a performance of up to 320 GFLOPS, while our Conjugate Gradients solver sustains up to 237 GFLOPS. Furthermore we demonstrate a fully 'native' multi-node LQCD implementation running entirely on KNC nodes with minimum involvement of the host CPU. Our multi-node implementation of the solver has been strong scaled to 3.9 TFLOPS on 32 KNCs. © 2013 Springer-Verlag.

关键词： Quantum theory

来源：评论

学校读者我要写书评

暂无评论

Data-race detection: The missing piece for an end-to-end semantic equivalence checker for parallelizing transformations of array-intensive programs 3

Data-race detection: The missing piece for an end-to-end sem...

引用

3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY 2016

作者： Banerjee, Kunal Banerjee, Soumyadip Sarkar, Santonu Dept of Computer Sc and Engg IIT Kharagpur India Dept of CSIS BITS Pilani-Goa India Intel Parallel Computing Lab. Bangalore India

ISBN: (纸本)9781450343848

The parallelizing transformation (hand-crafted or compiler-assisted) is error prone as it is often performed without verifying any semantic equivalence with the sequential counterpart. Even when the parallel program can be proved to be semantically equivalent with its corresponding sequential program, detecting data-race conditions in the parallel program remains a challenge. In this paper, we propose a formal verification approach that can detect data-race conditions while verifying the computational semantic equivalence of parallelizing loop transformations. We propose a coloured array data dependence graph (C-ADDG) based modeling of a program for verification of program equivalence as well as data-race condition detection across parallel loops. We have tested our tool on a set of Rodinia and PLuTo+ benchmarks and shown that our method is sound, whereby the method does not produce any false-positive program equivalence or data-race condition. © 2016 ACM.

关键词： Software testing

来源：评论

学校读者我要写书评

暂无评论

PRACTICAL MASSIVELY parallel MONTE-CARLO TREE SEARCH APPLIED TO MOLECULAR DESIGN 9

PRACTICAL MASSIVELY PARALLEL MONTE-CARLO TREE SEARCH APPLIED...

引用

9th International Conference on Learning Representations, ICLR 2021

作者： Yang, Xiufeng Aasawat, Tanuj Kr Yoshizoe, Kazuki Chugai Pharmaceutical Co. Ltd Japan Parallel Computing Lab - India Intel Labs India RIKEN Center for Advanced Intelligence Project Japan

It is common practice to use large computational resources to train neural networks, known from many examples, such as reinforcement learning applications. However, while massively parallel computing is often used for training models, it is rarely used to search solutions for combinatorial optimization problems. This paper proposes a novel massively parallel Monte-Carlo Tree Search (MP-MCTS) algorithm that works efficiently for a 1,000 worker scale on a distributed memory environment using multiple compute nodes and applies it to molecular design. This paper is the first work that applies distributed MCTS to a real-world and non-game problem. Existing works on large-scale parallel MCTS show efficient scalability in terms of the number of rollouts up to 100 workers. Still, they suffer from the degradation in the quality of the solutions. MP-MCTS maintains the search quality at a larger scale. By running MP-MCTS on 256 CPU cores for only 10 minutes, we obtained candidate molecules with similar scores to non-parallel MCTS running for 42 hours. Moreover, our results based on parallel MCTS (combined with a simple RNN model) significantly outperform existing state-of-the-art work. Our method is generic and is expected to speed up other applications of MCTS. © 2021 ICLR 2021 - 9th International Conference on Learning Representations. All rights reserved.

关键词： Reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

Ternary Residual Networks

arXiv

引用

arXiv 2017年

作者： Kundu, Abhisek Banerjee, Kunal Mellempudi, Naveen Mudigere, Dheevatsa Das, Dipankar Kaul, Bharat Dubey, Pradeep Parallel Computing Lab Bangalore India Parallel Computing Lab Santa ClaraCA United States

Sub-8-bit representation of DNNs incur some discernible loss of accuracy despite rigorous (re)training at low-precision. Such loss of accuracy essentially makes them equivalent to a much shallower counterpart, diminishing the power of being deep networks. To address this problem of accuracy drop we introduce the notion of residual networks where we add more low-precision edges to sensitive branches of the sub-8-bit network to compensate for the lost accuracy. Further, we present a perturbation theory to identify such sensitive edges. Aided by such an elegant trade-off between accuracy and compute, the 8-2 model (8-bit activations, ternary weights), enhanced by ternary residual edges, turns out to be sophisticated enough to achieve very high accuracy (1% drop from our FP-32 baseline), despite1:6reduction in model size,26reduction in number of multiplications, and potentially2power-performance gain comparing to 8-8 representation, on the state-of-the-art deep network ResNet-101 pre-trained on ImageNet dataset. Moreover, depending on the varying accuracy requirements in a dynamic environment, the deployed low-precision model can be upgraded/downgraded on-the-fly by partially enabling/disabling residual connections. For example, disabling the least important residual connections in the above enhanced network, the accuracy drop is2% (from FP32), despite1:9reduction in model size,32reduction in number of multiplications, and potentially2:3power-performance gain comparing to 8-8 representation. Finally, all the ternary connections are sparse in nature, and the ternary residual conversion can be done in a resource-constraint setting with no low-precision (re)training. Copyright © 2017, The Authors. All rights reserved.

关键词： Drops

来源：评论

学校读者我要写书评

暂无评论

AUTOSPARSE: TOWARDS AUTOMATED SPARSE TRAINING OF DEEP NEURAL NETWORKS

arXiv

引用

arXiv 2023年

作者： Kundu, Abhisek Mellempudi, Naveen K. Vooturi, Dharma Teja Kaul, Bharat Dubey, Pradeep Parallel Computing Lab Intel Labs India

Sparse training is emerging as a promising avenue for reducing the computational cost of training neural networks. Several recent studies have proposed pruning methods using learnable thresholds to efficiently explore the non-uniform distribution of sparsity inherent within the models. In this paper, we propose Gradient Annealing (GA), where gradients of masked weights are scaled down in a non-linear manner. GA provides an elegant trade-off between sparsity and accuracy without the need for additional sparsity-inducing regularization. We integrated GA with the latest learnable pruning methods to create an automated sparse training algorithm called AutoSparse, which achieves better accuracy and/or training/inference FLOPS reduction than existing learnable pruning methods for sparse ResNet50 and MobileNetV1 on ImageNet-1K: AutoSparse achieves (2×, 7×) reduction in (training,inference) FLOPS for ResNet50 on ImageNet at 80% sparsity. Finally, AutoSparse outperforms sparse-to-sparse SotA method MEST (uniform sparsity) for 80% sparse ResNet50 with similar accuracy, where MEST uses 12% more training FLOPS and 50% more inference FLOPS. Copyright © 2023, The Authors. All rights reserved.

关键词： Genetic algorithms

来源：评论

学校读者我要写书评

暂无评论

Mixed low-precision deep learning inference using dynamic fixed point

arXiv

引用

arXiv 2017年

作者： Mellempudi, Naveen Kundu, Abhisek Das, Dipankar Mudigere, Dheevatsa Kaul, Bharat Parallel Computing Lab Intel Labs Bangalore India

We propose a cluster-based quantization method to convert pre-trained full precision weights into ternary weights with minimal impact on the accuracy. In addition we also constrain the activations to 8-bits thus enabling sub 8-bit full integer inference pipeline. Our method uses smaller clusters of N filters with a common scaling factor to minimize the quantization loss, while also maximizing the number of ternary operations. We show that with cluster size of N=4 on Resnet-101, can achieve 71.8% TOP-1 accuracy, within 6% of the best full precision result, while replacing ≈ 85% of all multiplications with 8-bit accumulations. Using the same method with 4-bit weights achieves 76.3% TOP-1 accuracy which within 2% of the full precision result. We also study the impact of the size of the cluster on both performance and accuracy, larger cluster sizes N=64 can replace ≈ 98% of the multiplications with ternary operations but introduces significant drop in accuracy which necessitates fine tuning the parameters with retraining the network at lower precision. To address this we have also trained low-precision Resnet-50 with 8-bit activations and ternary weights by pre-initializing the network with full precision weights and achieve 68.9% TOP-1 accuracy within 4 additional epochs. Our final quantized model can run on a full 8-bit compute pipeline, with a potential 16x improvement in performance compared to baseline full-precision models. Copyright © 2017, The Authors. All rights reserved.

关键词： Chemical activation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：