Sparse training is emerging as a promising avenue for reducing the computational cost of training neural networks. Several recent studies have proposed pruning methods using learnable thresholds to efficiently explore...
详细信息
This paper aims at high and portable performance for tensor computations across spatial (e.g., FPGAs) and vector architectures (e.g., GPUs). The state-of-the-art usually address performance portability across vector a...
详细信息
Identifying accessible chromatin regions is a fundamental problem in epigenomics with ATAC-seq being a commonly used assay. Exponential rise in ATAC-seq experiments has made it critical to accelerate processing of ATA...
详细信息
ISBN:
(数字)9781665497473
ISBN:
(纸本)9781665497480
Identifying accessible chromatin regions is a fundamental problem in epigenomics with ATAC-seq being a commonly used assay. Exponential rise in ATAC-seq experiments has made it critical to accelerate processing of ATAC-seq data that can have a low signal-to-noise ratio for various reasons including low coverage or low cell count. To denoise and identify accessible chromatin regions from noisy ATAC-seq data, use of deep learning on 1D data - using large filter sizes, long tensor widths, and/or dilation - has recently been proposed. Convolutions over 1D data consume a majority of the runtime in these methods. However, existing implementations of the 1D convolution layer for CPUs and GPUs fail to efficiently use the underlying architecture especially in the case of large filter sizes, long tensor widths, and dilation. Here, we present ways to accelerate the end-to-end training performance of these deep learning based methods. We evaluate our approach on the recently released AtacWorks toolkit using modern CPUs. Compared to AtacWorks running on an Nvidia DGX-1 box with 8 V100 GPUs, we get up to 2.27× speedup using just 16 CPU sockets. To achieve this, we build an efficient 1D dilated convolution layer and demonstrate reduced precision (BFloat16) training and nearly linear scaling from 1 to 16 sockets. Code Availability: https://***/intellabs/Trans-Omics-Acceleration-Library/tree/ATAC-Seq/applications/ATAC-Seq
The GraphBLAS are building blocks for constructing graph algorithms as linear algebra. They are defined mathematically with the goal that they would eventually map onto a variety of programming languages. Today they e...
详细信息
ISBN:
(数字)9781665497473
ISBN:
(纸本)9781665497480
The GraphBLAS are building blocks for constructing graph algorithms as linear algebra. They are defined mathematically with the goal that they would eventually map onto a variety of programming languages. Today they exist in C, C++, Python, MATlab®, and Julia. In this paper, we describe the GraphBLAS for the Go programming language. A particularly interesting aspect of this work is that using the concurrency features of the Go language, we aim to build a runtime system that uses the GraphBLAS nonblocking mode by default.
Interoperability between libraries is often hindered by incompatible data formats, which can necessitate creating new copies of data when transferring data back and forth between different libraries. This additional d...
详细信息
Quantum computing represents a paradigm shift for computation requiring an entirely new computer architecture. However, there is much that can be learned from traditional classical computer engineering. In this paper,...
详细信息
It is common practice to use large computational resources to train neural networks, known from many examples, such as reinforcement learning applications. However, while massively parallelcomputing is often used for...
详细信息
Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years. However, a universally accepted method for evaluation has not been established, since the wet-l...
详细信息
Interoperability between libraries is often hindered by incompatible data formats, which can necessitate creating new copies of data when transferring data back and forth between different libraries. This additional d...
Interoperability between libraries is often hindered by incompatible data formats, which can necessitate creating new copies of data when transferring data back and forth between different libraries. This additional data movement incurs additional runtime costs, particularly for sparse applications, where the costs of data movement often dwarf compute costs. In this paper, we investigate interoperability in the context of the C++ GraphBLAS Specification, where C++ concepts allow GraphBLAS algorithms to accept any matrix type as long as it follows the matrix interface defined in the GraphBLAS matrix concept. We first develop non-owning, lazily evaluated adapted views for a number of external data structures, including two categories of graphs defined in the Northwest Graph Library (NWGraph) and traditional pointer-based CSR data structures. These adapted views fulfill the C++ GraphBLAS matrix concept, allowing them to be used inside GraphBLAS algorithms. We then evaluate the performance of these adapted views across two kernels, matrix reduction and sparse times dense matrix multiplication (SpMM), where the performance achieved using a single generic implementation with these views largely matches the performance achieved operating directly on the original data structures, with a slight performance loss in one case. We then propose a mechanism for automatically discovering the availability of these views, allowing algorithms to directly accept external data structures. We also discuss potential extensions to the C++ GraphBLAS specification that might eliminate the small performance dip observed for one of the views.
暂无评论