检索结果-内蒙古大学图书馆

arXiv 2023年

作者： Kundu, Abhisek Mellempudi, Naveen K. Vooturi, Dharma Teja Kaul, Bharat Dubey, Pradeep Parallel Computing Lab Intel Labs India

Sparse training is emerging as a promising avenue for reducing the computational cost of training neural networks. Several recent studies have proposed pruning methods using learnable thresholds to efficiently explore the non-uniform distribution of sparsity inherent within the models. In this paper, we propose Gradient Annealing (GA), where gradients of masked weights are scaled down in a non-linear manner. GA provides an elegant trade-off between sparsity and accuracy without the need for additional sparsity-inducing regularization. We integrated GA with the latest learnable pruning methods to create an automated sparse training algorithm called AutoSparse, which achieves better accuracy and/or training/inference FLOPS reduction than existing learnable pruning methods for sparse ResNet50 and MobileNetV1 on ImageNet-1K: AutoSparse achieves (2×, 7×) reduction in (training,inference) FLOPS for ResNet50 on ImageNet at 80% sparsity. Finally, AutoSparse outperforms sparse-to-sparse SotA method MEST (uniform sparsity) for 80% sparse ResNet50 with similar accuracy, where MEST uses 12% more training FLOPS and 50% more inference FLOPS. Copyright © 2023, The Authors. All rights reserved.

关键词： Genetic algorithms

来源：评论

学校读者我要写书评

暂无评论

POPA: Expressing High and Portable Performance across Spatial and Vector Architectures for Tensor Computations 24

POPA: Expressing High and Portable Performance across Spatia...

引用

32nd ACM International Symposium on Field-Programmable Gate Arrays, FPGA 2024

作者： Hao, Xiaochen Rong, Hongbo Zhang, Mingzhe Sun, Ce Jiang, Hong Liang, Yun Peking University China Parallel Computing Lab Intel United States Tsinghua University China University of Science and Technology of China China Intel United States Peking University & Beijing Advanced Innovation Center for Integrated Circuits China

ISBN: (纸本)9798400704185

This paper aims at high and portable performance for tensor computations across spatial (e.g., FPGAs) and vector architectures (e.g., GPUs). The state-of-the-art usually address performance portability across vector architectures (CPUs and GPUs). However, they either miss FPGAs or do not achieve high performance. Without a common architectural abstraction, they program and optimize spatial and vector devices separately, causing low portability. We propose a unified programming framework, POPA, which achieves portability via architectural abstraction and performance via specialization. A parallel dataflow machine is proposed as a unified, abstract hardware target that hides differences of concrete architectures. The machine consists of software-defined systolic arrays and a tensor-specific cache hierarchy, which captures pipeline parallelism and customizable memories on FPGAs, as well as multithreading parallelism on GPUs. The machine is specified in a unified programming model as two dataflow graphs for scheduling compute and data movement, respectively. A compiler then specializes the abstract machine to exploit the properties of FPGAs and GPUs, bridging the gap between the abstract machine and a concrete architecture. We evaluate POPA on several intel FPGAs and GPUs with high-profile tensor kernels, and this is the first system that achieves >=80% performance of expert-written code or machine peak across architectures, to the best of our knowledge. © 2024 ACM.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

GrAPL 2022 Keynote Speaker: GraphBLAS Beyond Simple Graphs

GrAPL 2022 Keynote Speaker: GraphBLAS Beyond Simple Graphs

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Tim Mattson Parallel Computing Lab Intel Labs

来源：评论

学校读者我要写书评

暂无评论

Accelerating Deep Learning based Identification of Chromatin Accessibility from noisy ATAC-seq Data

Accelerating Deep Learning based Identification of Chromatin...

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Narendra Chaudhary Sanchit Misra Dhiraj Kalamkar Alexander Heinecke Evangelos Georganas Barukh Ziv Menachem Adelman Bharat Kaul Parallel Computing Lab Intel Bangalore India parallel Computing Lab Intel Corporation Santa Clara USA Intel Corporation Haifa Israel

ISBN: (数字)9781665497473

ISBN: (纸本)9781665497480

Identifying accessible chromatin regions is a fundamental problem in epigenomics with ATAC-seq being a commonly used assay. Exponential rise in ATAC-seq experiments has made it critical to accelerate processing of ATAC-seq data that can have a low signal-to-noise ratio for various reasons including low coverage or low cell count. To denoise and identify accessible chromatin regions from noisy ATAC-seq data, use of deep learning on 1D data - using large filter sizes, long tensor widths, and/or dilation - has recently been proposed. Convolutions over 1D data consume a majority of the runtime in these methods. However, existing implementations of the 1D convolution layer for CPUs and GPUs fail to efficiently use the underlying architecture especially in the case of large filter sizes, long tensor widths, and dilation. Here, we present ways to accelerate the end-to-end training performance of these deep learning based methods. We evaluate our approach on the recently released AtacWorks toolkit using modern CPUs. Compared to AtacWorks running on an Nvidia DGX-1 box with 8 V100 GPUs, we get up to 2.27× speedup using just 16 CPU sockets. To achieve this, we build an efficient 1D dilated convolution layer and demonstrate reduced precision (BFloat16) training and nearly linear scaling from 1 to 16 sockets. Code Availability: https://***/intellabs/Trans-Omics-Acceleration-Library/tree/ATAC-Seq/applications/ATAC-Seq

关键词： Deep learning Convolutional codes Training Tensors Costs Convolution Computer architecture

来源：评论

学校读者我要写书评

暂无评论

Towards a GraphBLAS Implementation for Go

Towards a GraphBLAS Implementation for Go

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Pascal Costanza Ibrahim Hurt Timothy G. Mattson Intel Extreme Scale Computing Group Brussels Belgium Intel Extreme Scale Computing Group Hillsboro OR USA Intel Parallel Computing Lab Ocean Park WA USA

ISBN: (数字)9781665497473

ISBN: (纸本)9781665497480

The GraphBLAS are building blocks for constructing graph algorithms as linear algebra. They are defined mathematically with the goal that they would eventually map onto a variety of programming languages. Today they exist in C, C++, Python, MATlab®, and Julia. In this paper, we describe the GraphBLAS for the Go programming language. A particularly interesting aspect of this work is that using the concurrency features of the Go language, we aim to build a runtime system that uses the GraphBLAS nonblocking mode by default.

关键词： Concurrent computing Distributed processing Runtime Conferences Prototypes Linear algebra Licenses

来源：评论

学校读者我要写书评

暂无评论

C++ and Interoperability between Libraries: The GraphBLAS C++ Specification

C++ and Interoperability between Libraries: The GraphBLAS C+...

引用

2023 IEEE International parallel and Distributed Processing Symposium Workshops, IPDPSW 2023

作者： Brock, Benjamin McMillan, Scott Buluc, Aydin Mattson, Timothy G. Moreira, Jose E. Parallel Computing Lab Intel United States Software Engineering Institute Carnegie Mellon University United States Lawrence Berkeley National Laboratory Computational Research Department United States University of California EECS Department Berkeley United States IBM Thomas J. Watson Research Center United States

ISBN: (纸本)9798350311990

Interoperability between libraries is often hindered by incompatible data formats, which can necessitate creating new copies of data when transferring data back and forth between different libraries. This additional data movement incurs additional runtime costs, particularly for sparse applications, where the costs of data movement often dwarf compute costs. In this paper, we investigate interoperability in the context of the C++ GraphBLAS Specification, where C++ concepts allow GraphBLAS algorithms to accept any matrix type as long as it follows the matrix interface defined in the GraphBLAS matrix concept. We first develop non-owning, lazily evaluated adapted views for a number of external data structures, including two categories of graphs defined in the Northwest Graph Library (NWGraph) and traditional pointer-based CSR data structures. These adapted views fulfill the C++ GraphBLAS matrix concept, allowing them to be used inside GraphBLAS algorithms. We then evaluate the performance of these adapted views across two kernels, matrix reduction and sparse times dense matrix multiplication (SpMM), where the performance achieved using a single generic implementation with these views largely matches the performance achieved operating directly on the original data structures, with a slight performance loss in one case. We then propose a mechanism for automatically discovering the availability of these views, allowing algorithms to directly accept external data structures. We also discuss potential extensions to the C++ GraphBLAS specification that might eliminate the small performance dip observed for one of the views. © 2023 IEEE.

关键词： Data structures

来源：评论

学校读者我要写书评

暂无评论

Introducing the Quantum Research Kernels: Lessons from Classical parallel computing

arXiv

引用

arXiv 2022年

作者： Matsuura, A.Y. Mattson, Timothy G. Applications and Quantum Architecture Lab Intel Corp HillsboroOR United States Parallel Computing Lab Intel Corp. IlwacoWA United States

Quantum computing represents a paradigm shift for computation requiring an entirely new computer architecture. However, there is much that can be learned from traditional classical computer engineering. In this paper, we describe the parallel Research Kernels (PRK), a tool that was very useful for designing classical parallel computing systems. The PRK are simple kernels written to expose bottlenecks that limit classical parallel computing performance. We hypothesize that an analogous tool for quantum computing, Quantum Research Kernels (QRK), may similarly aid the codesign of software and hardware for quantum computing systems, and we give a few examples of representative QRKs. Copyright © 2022, The Authors. All rights reserved.

关键词： Hardware-software codesign

来源：评论

学校读者我要写书评

暂无评论

PRACTICAL MASSIVELY parallel MONTE-CARLO TREE SEARCH APPLIED TO MOLECULAR DESIGN 9

PRACTICAL MASSIVELY PARALLEL MONTE-CARLO TREE SEARCH APPLIED...

引用

9th International Conference on Learning Representations, ICLR 2021

作者： Yang, Xiufeng Aasawat, Tanuj Kr Yoshizoe, Kazuki Chugai Pharmaceutical Co. Ltd Japan Parallel Computing Lab - India Intel Labs India RIKEN Center for Advanced Intelligence Project Japan

It is common practice to use large computational resources to train neural networks, known from many examples, such as reinforcement learning applications. However, while massively parallel computing is often used for training models, it is rarely used to search solutions for combinatorial optimization problems. This paper proposes a novel massively parallel Monte-Carlo Tree Search (MP-MCTS) algorithm that works efficiently for a 1,000 worker scale on a distributed memory environment using multiple compute nodes and applies it to molecular design. This paper is the first work that applies distributed MCTS to a real-world and non-game problem. Existing works on large-scale parallel MCTS show efficient scalability in terms of the number of rollouts up to 100 workers. Still, they suffer from the degradation in the quality of the solutions. MP-MCTS maintains the search quality at a larger scale. By running MP-MCTS on 256 CPU cores for only 10 minutes, we obtained candidate molecules with similar scores to non-parallel MCTS running for 42 hours. Moreover, our results based on parallel MCTS (combined with a simple RNN model) significantly outperform existing state-of-the-art work. Our method is generic and is expected to speed up other applications of MCTS. © 2021 ICLR 2021 - 9th International Conference on Learning Representations. All rights reserved.

关键词： Reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design

arXiv

引用

arXiv 2023年

作者： Wang, Chuanrui Zhong, Bozitao Zhang, Zuobai Chaudhary, Narendra Misra, Sanchit Tang, Jian Mila - Québec AI Institute Canada Université de Montréal Canada Intel Parallel Computing Lab HEC Montréal Canada CIFAR AI Chair Canada

Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years. However, a universally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the in silico validation with recovery and perplexity metrics is efficient but may not precisely reflect true foldability. To address this gap, we introduce two novel metrics: refoldability-based metric, which leverages high-accuracy protein structure prediction models as a proxy for wet lab experiments, and stability-based metric, which assesses whether models can assign high likelihoods to experimentally stable proteins. We curate datasets from high-quality CATH protein data, high-throughput de novo designed proteins, and mega-scale experimental mutagenesis experiments, and in doing so, present the PDB-Struct benchmark that evaluates both recent and previously uncompared protein design methods. Experimental results indicate that ByProt, ProteinMPNN, and ESM-IF perform exceptionally well on our benchmark, while ESM-Design and AF-Design fall short on the refoldability metric. We also show that while some methods exhibit high sequence recovery, they do not perform as well on our new benchmark. Our proposed benchmark paves the way for a fair and comprehensive evaluation of protein design methods in the future. Code is available at https://***/WANG-CR/PDB-Struct. Copyright © 2023, The Authors. All rights reserved.

关键词： Proteins

来源：评论

学校读者我要写书评

暂无评论

C++ and Interoperability Between Libraries: The GraphBLAS C++ Specification

C++ and Interoperability Between Libraries: The GraphBLAS C+...

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Benjamin Brock Scott McMillan Aydın Buluç Timothy G. Mattson José E. Moreira Parallel Computing Lab Intel Software Engineering Institute Carnegie Mellon University Computational Research Department Lawrence Berkeley National Laboratory EECS Department University of California Berkeley IBM Thomas J. Watson Research Center

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：