检索结果-内蒙古大学图书馆

arXiv 2022年

作者： Boureima, Ismael Bhattarai, Manish Eren, Maksim Skau, Erik Romero, Philip Eidenbenz, Stephan Alexandrov, Boian Theoritical Divison Los Alamos National Laboratory Los AlamosNM87545 United States Computer Computational and Statistical Science Division Los Alamos National Laboratory Los AlamosNM87545 United States HPC Divison Los Alamos National Laboratory Los AlamosNM87545 United States

We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10−6 Copyright © 2022, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Pattern-based Autotuning of OpenMP Loops using Graph Neural Networks

Pattern-based Autotuning of OpenMP Loops using Graph Neural ...

引用

Artificial Intelligence and Machine Learning for Scientific Applications (AI4S), IEEE/ACM International Workshop on

作者： Akash Dutta Jordi Alcaraz Ali TehraniJamsaz Anna Sikora Eduardo Cesar Ali Jannesari Department of Computer Science Iowa State University Ames Iowa USA OACISS University of Oregon Eugene USA CAOS Department Universitat Autonoma de Barcelona Spain

ISBN: (纸本)9781665462082

Stagnation of Moore's law has led to the increased adoption of parallel programming for enhancing performance of scientific applications. Frequently occurring code and design patterns in scientific applications are often used for transforming serial code to parallel. But, identifying these patterns is not easy. To this end, we propose using Graph Neural Networks for modeling code flow graphs to identify patterns in such parallel code. Additionally, identifying the runtime parameters for best performing parallel code is also challenging. We propose a pattern-guided deep learning based tuning approach, to help identify the best runtime parameters for OpenMP loops. Overall, we aim to identify commonly occurring patterns in parallel loops and use these patterns to guide auto-tuning efforts. We validate our hypothesis on 20 different applications from Polybench, and STREAM benchmark suites. This deep learning-based approach can identify the considered patterns with an overall accuracy of 91%. We validate the usefulness of using patterns for auto-tuning on tuning the number of threads, scheduling policies and chunk size on a single socket system, and the thread count and affinity on a multi-socket machine. Our approach achieves geometric mean speedups of $1.1\times$ and $4.7\times$ respectively over default OpenMP configurations, compared to brute-force speedups of $1.27\times$ and $4.93\times$ respectively.

关键词： Semiconductor device modeling Codes Runtime parallel programming Sockets Moore's Law Instruction sets

来源：评论

学校读者我要写书评

暂无评论

Efficient and Eventually Consistent Collective Operations

arXiv

引用

arXiv 2022年

作者： Iakymchuk, Roman Faustino, Amândio Emerson, Andrew Barreto, João Bartsch, Valeria Rodrigues, Rodrigo Monteiro, José C. Fraunhofer ITWM Kaiserslautern67663 Germany Sorbonne Université Paris75252 France Lisboa1000-029 Portugal CINECA Casalecchio di Reno 40033 Italy

Collective operations are common features of parallel programming models that are frequently used in High-Performance (HPC) and machine/ deep learning (ML/ DL) applications. In strong scaling scenarios, collective operations can negatively impact the overall application performance: with the increase in core count, the load per rank decreases, while the time spent in collective operations increases logarithmically. In this article, we propose a design for eventually consistent collectives suitable for ML/ DL computations by reducing communication in Broadcast and Reduce, as well as by exploring the Stale Synchronous parallel (SSP) synchronization model for the Allreduce collective. Moreover, we also enrich the GASPI ecosystem with frequently used classic/ consistent collective operations – such as Allreduce for large messages and AlltoAll used in an HPC code. Our implementations show promising preliminary results with significant improvements, especially for Allreduce and AlltoAll, compared to the vendor-provided MPI alternatives. © 2022, CC BY.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Profile-Guided parallel Task Extraction and Execution for Domain Specific Heterogeneous SoC

Profile-Guided Parallel Task Extraction and Execution for Do...

引用

IEEE International Conference on Big Data and Cloud Computing (BdCloud)

作者： Liangliang Chang Joshua Mack Benjamin Willis Xing Chen John Brunhaver Ali Akoglu Chaitali Chakrabarti The School of Electrical Computer and Energy Engineering Arizona State University USA Electrical and Computer Engineering Department The University of Arizona USA

In this study, we introduce a methodology for automatically transforming user applications in the radar and communication domain written in $\boldsymbol{\mathrm{C}/\mathrm{C}++}$ based on dynamic profiling to a parallel representation targeted for a heterogeneous SoC. We present our approach for instrumenting the user application binary during the compilation process with barrier synchronization primitives that enable runtime system schedule and execute independent tasks concurrently over the available compute resources. We demonstrate the capabilities of our integrated compile time and runtime flow through task-level parallel and functionally correct execution of real-life applications. We perform validation of our integrated system by executing four distinct applications each carrying various degrees of task level parallelism over the Xeon-based multi-core homogeneous processor. We use the proposed compilation and code transformation methodology to re-target each application for execution on a heterogeneous SoC composed of three ARM cores and one FFT accelerator that is emulated on the Xilinx Zynq Ultra $\mathbf{Scale}+$ platform. We demonstrate our runtime's ability to process application binary, dispatch independent tasks over the available compute resources of the emulated SoC on the Zynq FPGA based on three different scheduling heuristics. Finally we demonstrate execution of each application individually with task level parallelism on the Zynq FPGA and execution of workload scenarios composed of multiple instances of the same application as well as mixture of two distinct applications to demonstrate ability to realize both application and task level parallel execution. Our integrated approach offers a path forward for application developers to take full advantage of the target SoC without requiring users to become hardware and parallel programming experts.

关键词： Schedules Runtime Processor scheduling parallel programming parallel processing Hardware Radar applications

来源：评论

学校读者我要写书评

暂无评论

STMatch: accelerating graph pattern matching on GPU with stack-based loop optimizations 22

STMatch: accelerating graph pattern matching on GPU with sta...

引用

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

作者： Yihua Wei Peng Jiang University of Iowa

Graph pattern matching is a fundamental task in many graph analytics and graph mining applications. As an NP-hard problem, it is often a performance bottleneck in these applications. Previous work has proposed to use GPU to accelerate the computation. However, we find that the existing GPU solutions fail to show a performance advantage over the state-of-the-art CPU implementation due to their subgraph-centric design. This work proposes a novel stack-based graph pattern matching system on GPU that avoids the synchronization and memory consumption issues of the previous subgraph-centric systems. We also propose a two-level work-stealing and a loop-unrolling technique to improve the inter-warp and intra-warp GPU resource utilization of our system. The experiments show that our system significantly advances the state-of-the-art for graph pattern matching on GPU.

关键词： backtracking parallel programming

来源：评论

学校读者我要写书评

暂无评论

Distributed non-negative RESCAL with Automatic Model Selection for Exascale Data

arXiv

引用

arXiv 2022年

作者： Bhattarai, Manish Kharat, Namita Skau, Erik Nebgen, Benjamin Djidjev, Hristo Rajopadhye, Sanjay Smith, James P. Alexandrov, Boian Theoretical Division Los Alamos National Laboratory Los AlamosNM87544 United States Computer Computational and Statistical Science Division Los Alamos National Laboratory Los AlamosNM87544 United States Institute of Information and Communication Technologies Bulgarian Academy of Sciences Sofia Bulgaria The Computer Computational and Statistical Science Division Los Alamos National Laboratory Los AlamosNM87544 United States Department of Computer Science CSU Fort CollinsCO80523 United States

With the boom in the development of computer hardware and software, social media, IoT platforms, and communications, there has been an exponential growth in the volume of data produced around the world. Among these data, relational datasets are growing in popularity as they provide unique insights regarding the evolution of communities and their interactions. Relational datasets are naturally non-negative, sparse, and extra-large. Relational data usually contain triples, (subject, relation, object), and are represented as graphs/multigraphs, called knowledge graphs, which need to be embedded into a low-dimensional dense vector space. Among various embedding models, RESCAL allows learning of relational data to extract the posterior distributions over the latent variables and to make predictions of missing relations. However, RESCAL is computationally demanding and requires a fast and distributed implementation to analyze extra-large real-world datasets. Here we introduce a distributed non-negative RESCAL algorithm for heterogeneous CPU/GPU architectures with automatic selection of the number of latent communities (model selection), called pyDRESCALk. We demonstrate the correctness of pyDRESCALk with real-world and large synthetic tensors, and the efficacy showing near-linear scaling that concurs with the theoretical complexities. Finally, pyDRESCALk determines the number of latent communities in an 11-terabyte dense and 9-exabyte sparse synthetic tensor. Copyright © 2022, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

parallelization of Local Neighborhood Difference Pattern Feature Extraction using GPU

Parallelization of Local Neighborhood Difference Pattern Fea...

引用

Knowledge Engineering and Communication Systems (ICKECS), International Conference on

作者： Arisetty Sree Ashish Ashwath Rao B Department of Computer Science and Engineering Manipal Institute of Technology Manipal Academy of Higher Education Manipal Karnataka India

One of the various techniques employed for image feature extraction is the Local Neighborhood Difference Pattern, also called as LNDP. LNDP considers the relationship between neighbors of a central pixel with its adjacent pixels and transforms this mutual relationship of all the neighboring pixels into a binary pattern. It has proven to be a powerful and effective descriptor for texture analysis. A parallel implementation of LNDP using Compute Unified Device Architecture (CUDA) has been proposed in this paper. A speedup of about 1000 times has been achieved through a shared memory parallel implementation for large images. Thus, an efficacious and efficient implementation has resulted in an increased execution speed and reduced execution time.

关键词： Knowledge engineering parallel programming Communication systems Face recognition Graphics processing units Computer architecture Transforms

来源：评论

学校读者我要写书评

暂无评论

programming Bare-Metal Accelerators with Heterogeneous Threading Models: A Case Study of Matrix-3000

arXiv

引用

arXiv 2022年

作者： Fang, Jianbin Zhang, Peng Huang, Chun Tang, Tao Lu, Kai Wang, Ruibo Wang, Zheng School of Computer Science and Technology National University of Defense Technology Changsha410073 China School of Computing University of Leeds LeedsLS2 9JT United Kingdom

As the hardware industry moves towards using specialized heterogeneous many-cores to avoid the effects of the power wall, software developers are finding it hard to deal with the complexity of these systems. This article shares our experience when developing a programming model and its supporting compiler and libraries for Matrix-3000, which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor organization. To assist its software development, we developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL compiler. Our low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000, while the high-level model allows programmers to use the OpenCL programming standard. We detail our design choices and highlight the lessons learned from developing systems software to enable the programming of bare-metal accelerators. Our programming models have been deployed to the production environment of an exascale prototype system. © 2022, CC BY.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Vectorizing sparse matrix computations with partially-strided codelets 22

Vectorizing sparse matrix computations with partially-stride...

引用

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

作者： Kazem Cheshmi Zachary Cetinic Maryam Mehri Dehnavi University of Toronto Toronto Canada

The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize strided computation regions of a sparse code. In this work, we propose a locality-based codelet mining (LCM) algorithm that efficiently searches for strided and partially strided regions in sparse matrix computations for vectorization. We also present a classification of partially strided codelets and a differentiation-based approach to generate codelets from memory accesses in the sparse computation. LCM is implemented as an inspector-executor framework called LCM I/E that generates vectorized code for the sparse matrix-vector multiplication (SpMV), sparse matrix times dense matrix (SpMM), and sparse triangular solver (SpTRSV). LCM I/E outperforms the MKL library with an average speedup of 1.67×, 4.1×, and 1.75× for SpMV, SpTRSV, and SpMM, respectively. It is also faster than the state-of-the-art inspector-executor framework Sympiler [1] for the SpTRSV kernel with an average speedup of 1.9×.

关键词： parallel programming vectorization sparse matrix computation polyhedral analysis

来源：评论

学校读者我要写书评

暂无评论

Performance Implications of Thread Count on OS Level Factors in Multithreaded Applications 6

Performance Implications of Thread Count on OS Level Factors...

引用

6th International Conference on Computing, Communication, Control and Automation, ICCUBEA 2022

作者： Malave, Sachin Shinde, Subhash Lokmanya Tilak College of Engineering Computer Department New Mumbai India

ISBN: (纸本)9781665484527

In high-performance computing, picking the right number of threads to gain a good speedup is important, as many OS-level parameters are influenced by even slight adjustments in thread count. These parameters are required by the operating system for process management and should not be ignored. They also contribute overhead to the running program, which can mount up quickly if not properly managed. Using too many threads in the system raises overheads, but using too few threads in the system significantly reduces performance. In this paper, the impact of page faults, CPU migrations, CPU utilisation, and context switching on execution time is investigated. The proposed work is simulated on a dual-socket Intel Xeon E5-2603 v3 using the well-known benchmark PARSEC 3.0. After studying performance parameters, simulation results reveal that running multithreaded programs with a correct number of threads can result in greater speedup and save overall system time. © 2022 IEEE.

关键词： benchmarks parallel programming Speedup Synchronisation Threads

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：