检索结果-内蒙古大学图书馆

Workshop on Accelerator programming using Directives (WACCPD)

作者： Aaron Jarmusch Aaron Liu Christian Munley Daniel Horta Vaidhyanathan Ravichandran Joel Denny Kyle Friedline Sunita Chandrasekaran University of Delaware Oak Ridge National Laboratory

ISBN: (纸本)9781665490207

OpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers’ implementations. To address this challenge, the OpenACC Validation and Verification team has proposed a validation testsuite to verify the OpenACC implementations across various compilers with an infrastructure for a more streamlined execution. This paper will cover the following aspects: (a) the new developments since the last publication on the testsuite, (b) outline the use of the infrastructure, (c) discuss tests that highlight our workflow process, (d) analyze the results from executing the testsuite on various systems, and (e) outline future developments.

关键词： Analytical models Program processors parallel programming Conferences Reliability

来源：评论

学校读者我要写书评

暂无评论

The OpenMP Cluster programming Model

arXiv

引用

arXiv 2022年

作者： Yviquel, Hervé Pereira, Marcio Francesquini, Emílio Valarini, Guilherme Leite, Gustavo Rosso, Pedro Ceccato, Rodrigo Cusihualpa, Carla Dias, Vitoria Rigo, Sandro Souza, Alan Araujo, Guido Institute of Computing University of Campinas - Unicamp Brazil Federal University of Abc - Ufabc Brazil Petrobras Brazil

Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP's offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and inter-node parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application. Copyright © 2022, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Designing an Independent Study to Create HPC Learning Experiences for Undergraduates

Designing an Independent Study to Create HPC Learning Experi...

引用

IEEE International Conference on High Performance Computing Workshops (HiPCW)

作者： Sandino Vargas-Pérez Department of Computer Science Kalamazoo College Kalamazoo MI USA

This paper aims to present a multi-tiered approach to designing learning experiences in HPC for undergraduate students that significantly reinforce comprehension of CS topics while working with new concepts in parallel and distributed computing. The paper will detail the experience of students working in the design, construction, and testing of a computing cluster including budgeting, hardware purchase and setup, software installation and configuration, interconnection networks, communication, benchmarking, and running parallel code using MPI and OpenMP. The case study of building a relatively low-cost, small-scale computing cluster that can be used as a template for CS senior projects or independent studies, also yielded an opportunity to involve students in the creation of teaching tools for parallel computing at many levels of the CS curriculum.

关键词： Computer science parallel programming Education Taxonomy Hardware Supercomputers Software

来源：评论

学校读者我要写书评

暂无评论

Static Security Assessment of Large Power Systems Under N-1-1 Contingency

Static Security Assessment of Large Power Systems Under N-1-...

引用

National Power Systems Conference (NPSC)

作者： P S V Prabhakar Ram Krishan Deepak Reddy Pullaguram Department of Electrical Engineering NIT Warangal

Contingency analysis (CA) is one of the critical tools of a static security assessment (SSA). It is used to forecast the operating states of a power system under one or more outages of generators, transmission lines, transformers, etc. To perform SSA, repetitive load flow analyses are required for obtaining the bus voltages, bus injections, and line flows considering each possible outage. A repetitive load flow analysis demands huge computational efforts like efficient system modelling for faster load flow solutions, parallel programming and High performance computing (HPC). In this paper, an N-1-1 CA has been analysed using fast decoupled load flow (FDLF) with a strategy of screening and ranking the catastrophic contingencies. This paper explores a computationally efficient method to analyze the severity and the ranking of N-1-1 contingencies for large power system SSA. The performance of the FDLF based SSA method is demonstrated on two standard IEEE 14 and 118 bus systems.

关键词： Power transmission lines parallel programming High performance computing Contingency management Transformers Computational efficiency Security

来源：评论

学校读者我要写书评

暂无评论

Climbing the summit and pushing the frontier of mixed precision benchmarks at extreme scale 22

Climbing the summit and pushing the frontier of mixed precis...

引用

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

作者： Hao Lu Michael Matheson Vladyslav Oles Austin Ellis Wayne Joubert Feiyi Wang Oak Ridge National Laboratory

The rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers the possibility of transformational changes in computational science. The HPL-AI benchmark is designed to measure the performance of mixed precision arithmetic as opposed to the HPL benchmark which measures double precision performance. Pushing the limits of systems at extreme scale is nontrivial ---little public literature explores optimization of mixed precision computations at this scale. In this work, we demonstrate how to scale up the HPL-AI benchmark on the pre-exascale Summit and exascale Frontier systems at the Oak Ridge Leadership Computing Facility (OLCF) with a cross-platform design. We present the implementation, performance results, and a guideline of optimization strategies employed for delivering portable performance on both AMD and NVIDIA GPUs at extreme scale.

关键词： high performance computing exascale computing parallel programming linear algebra

来源：评论

学校读者我要写书评

暂无评论

Algorithmic Species: A Classification of Affine Loop Nests for parallel programming

引用

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION 2013年第4期9卷 40-40页

作者： Nugteren, Cedric Custers, Pieter Corporaal, Henk Eindhoven Univ Technol Dept Elect Engn NL-5600 MB Eindhoven Netherlands

Code generation and programming have become ever more challenging over the last decade due to the shift towards parallel processing. Emerging processor architectures such as multi-cores and GPUs exploit increasingly parallelism, requiring programmers and compilers to deal with aspects such as threading, concurrency, synchronization, and complex memory partitioning. We advocate that programmers and compilers can greatly benefit from a structured classification of program code. Such a classification can help programmers to find opportunities for parallelization, reason about their code, and interact with other programmers. Similarly, parallelising compilers and source-to-source compilers can take threading and optimization decisions based on the same classification. In this work, we introduce algorithmic species, a classification of affine loop nests based on the polyhedral model and targeted for both automatic and manual use. Individual classes capture information such as the structure of parallelism and the data reuse. To make the classification applicable for manual use, a basic vocabulary forms the base for the creation of a set of intuitive classes. To demonstrate the use of algorithmic species, we identify 115 classes in a benchmark set. Additionally, we demonstrate the suitability of algorithmic species for automated uses by showing a tool to automatically extract species from program code, a species-based source-to-source compiler, and a species-based performance prediction model.

关键词： Performance Languages Algorithm classification parallel programming polyhedral model

来源：评论

学校读者我要写书评

暂无评论

SpDISTAL: compiling distributed sparse tensor computations 22

SpDISTAL: compiling distributed sparse tensor computations

引用

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

作者： Rohan Yadav Alex Aiken Fredrik Kjolstad Stanford University

We introduce SpDISTAL, a compiler for sparse tensor algebra that targets distributed systems. SpDISTAL combines separate descriptions of tensor algebra expressions, sparse data structures, data distribution, and computation distribution. Thus, it enables distributed execution of sparse tensor algebra expressions with a wide variety of sparse data structures and data distributions. SpDISTAL is implemented as a C++ library that targets a distributed task-based runtime system and can generate code for nodes with both multi-core CPUs and multiple GPUs. SpDISTAL generates distributed code that achieves performance competitive with hand-written distributed functions for specific sparse tensor algebra expressions and that outperforms general interpretation-based systems by one to two orders of magnitude.

关键词： parallel programming programming computer science

来源：评论

学校读者我要写书评

暂无评论

Distributed Out-of-Memory SVD on CPU/GPU Architectures

arXiv

引用

arXiv 2022年

作者： Boureima, Ismael Bhattarai, Manish Eren, Maksim E. Solovyev, Nick Djidjev, Hristo Alexandrov, Boian S. Theoretical Division LANL Los Alamos United States Information Systems LANL Los Alamos U.S and IICT Sofia Bulgaria

We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous (CPU+GPU) high performance computing (HPC) systems. Various implementations of SVD have been proposed, with most only estimate the singular values as the estimation of the singular vectors can significantly increase the time and memory complexity of the algorithm. In this work, we propose an implementation of SVD based on the power method, which is a truncated singular values and singular vectors estimation method. Memory utilization bottlenecks in the power method used to decompose a matrix A are typically associated with the computation of the Gram matrix AT A , which can be significant when A is large and dense, or when A is super-large and sparse. The proposed implementation is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. We reduce the memory complexity of AT A by using a batching strategy where the intermediate factors are computed block by block, and we hide I/O latency associated with both host-to-device (H2D) and device-to-host (D2H) batch copies by overlapping each batch copy with compute using CUDA streams. Furthermore, we use optimized NCCL based communicators to reduce the latency associated with collective communications (both intra-node and inter-node). In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores (or tensors cores when available), resulting in an implementation with good scaling. We demonstrate the scalability of our distributed out of core SVD algorithm to successfully decompose dense matrix of size 1TB and sparse matrix of size 128 PB with 1e-6 sparsity. Copyright © 2022, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Modelling the Earth's geomagnetic environment on Cray machines using PETSc and SLEPc

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2020年第20期32卷 e5660-e5660页

作者： Brown, Nick Bainbridge, Brian Beggan, Ciaran Brown, William Hamilton, Brian Macmillan, Susan Univ Edinburgh Bayes Ctr EPCC Edinburgh EH8 9BT Midlothian Scotland British Geol Survey Lyell Ctr Edinburgh Midlothian Scotland

The British Geological Survey's global geomagnetic model, Model of the Earth's Magnetic Environment (MEME), is an important tool for calculating the strength and direction of the Earth's magnetic field, which is continually in flux. While the ability to collect data from ground-based observation sites and satellites has grown rapidly, the memory bound nature of the original code has proved a significant limitation on the size of the modelling problem required. In this paper, we describe work done replacing the bespoke, sequential, eigensolver with that of the PETSc/SLEPc package for solving the system of normal equations. Adopting PETSc/SLEPc also required fundamental changes in how we built and distributed the data structures, and as such, we describe an approach for building symmetric matrices that provides good load balance and avoids the need for close coordination between the processes or replication of work. We also study the memory bound nature of the code from an irregular memory accesses perspective and combine detailed profiling with software cache prefetching to significantly optimise this. Performance and scaling characteristics are explored on ARCHER, a Cray XC30, where we achieved a speed up for the solver of 294 times by replacing the model's bespoke approach with SLEPc.

关键词： HPC Model of the Earth's Magnetic Environment (MEME) MPI parallel programming PETSc SLEPc software prefetching

来源：评论

学校读者我要写书评

暂无评论

High-level and efficient structured stream parallelism for rust on multi-cores

引用

JOURNAL OF COMPUTER LANGUAGES 2021年 65卷

作者： Pieper, Ricardo Loff, Junior Hoffmann, Renato B. Griebler, Dalvan Fernandes, Luiz G. Pontifical Catholic Univ Rio Grande Sul PUCRS Sch Technol BR-90619900 Porto Alegre RS Brazil Tres De Maio Fac SETREM Lab Adv Res Cloud Comp LARCC BR-98910000 Tres De Maio Brazil

This work aims at contributing with a structured parallel programming abstraction for Rust in order to provide ready-to-use parallel patterns that abstract low-level and architecture-dependent details from application programmers. We focus on stream processing applications running on shared-memory multi-core architectures (i.e, video processing, compression, and others). Therefore, we provide a new high-level and efficient parallel programming abstraction for expressing stream parallelism, named Rust-SSP. We also created a new stream benchmark suite for Rust that represents real-world scenarios and has different application characteristics and workloads. Our benchmark suite is an initiative to assess existing parallelism abstraction for this domain, as parallel implementations using these abstractions were provided. The results revealed that Rust-SSP achieved up to 41.1% better performance than other solutions. In terms of programmability, the results revealed that Rust-SSP requires the smallest number of extra lines of code to enable stream parallelism.

关键词： programming language parallel programming parallelism abstractions Stream processing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：