检索结果-内蒙古大学图书馆

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Alok Mishra Smeet Chheda Carlos Soto Abid M. Malik Meifeng Lin Barbara Chapman Stony Brook University Stony Brook NY USA Brookhaven National Laboratory Upton NY USA

ISBN: (数字)9781665497473

ISBN: (纸本)9781665497480

The HPC industry is inexorably moving towards an era of extremely heterogeneous architectures, with more devices configured on any given HPC platform and potentially more kinds of devices, some of them highly specialized. Writing a separate code suitable for each target system for a given HPC application is not practical. The better solution is to use directive-based parallel programming models such as OpenMP. OpenMP provides a number of options for offloading a piece of code to devices like GPUs. To select the best option from such options during compilation, most modern compilers use analytical models to estimate the cost of executing the original code and the different offloading code variants. Building such an analytical model for compilers is a difficult task that necessi-tates a lot of effort on the part of a compiler engineer. Recently, machine learning techniques have been successfully applied to build cost models for a variety of compiler optimization problems. In this paper, we present COMPOFF, a cost model that statically estimates the Cost of OpenMP OFFloading using a neural network model. We used six different transformations on a parallel code of Wilson Dslash Operator to support GPU offloading, and we predicted their cost of execution on different GPUs using COMPOFF during compile time. Our results show that this model can predict offloading costs with a root mean squared error in prediction of less than 0.5 seconds. Our preliminary findings indicate that this work will make it much easier and faster for scientists and compiler developers to port legacy HPC applications that use OpenMP to new heterogeneous computing environment.

关键词： Analytical models Costs Codes parallel programming Computational modeling Machine learning Predictive models

来源：评论

学校读者我要写书评

暂无评论

Heterogeneous programming for the Homogeneous Majority

Heterogeneous Programming for the Homogeneous Majority

引用

Performance, Portability and Productivity in HPC (P3HPC)

作者： Tom Deakin James Cownie Wei-Chen Lin Simon McIntosh-Smith Department of Computer Science University of Bristol Bristol UK

ISBN: (纸本)9781665460224

In order to take advantage of the burgeoning diversity in processors at the frontier of supercomputing, the HPC community is migrating and improving codes to utilise heterogeneous nodes, where accelerators, principally GPUs, are highly prevalent in top-tier supercomputer designs. Programs therefore need to embrace at least some of the complexities of heterogeneous architectures. parallel programming models have evolved to express heterogeneous paradigms whilst providing mechanisms for writing portable, performant programs. History shows that technologies first introduced at the frontier percolate down to local workhorse systems. However, we expect there will always be a mix of systems, some heterogeneous, but some remaining as homogeneous CPU systems. Thus it is important to ensure codes adapted for heterogeneous systems continue to run efficiently on CPUs. In this study, we explore how well widely used heterogeneous programming models perform on CPU-only platforms, and survey the performance portability they offer on the latest CPU architectures.

关键词： Productivity Codes Limiting parallel programming Graphics processing units Benchmark testing Writing

来源：评论

学校读者我要写书评

暂无评论

UniQ: a unified programming model for efficient quantum circuit simulation 22

UniQ: a unified programming model for efficient quantum circ...

引用

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

作者： Chen Zhang Haojie Wang Zixuan Ma Lei Xie Zeyu Song Jidong Zhai Tsinghua University Beijing China

Quantum circuit simulation is critical for verifying quantum computers. Given exponential complexity in the simulation, existing simulators use different architectures to accelerate the simulation. However, due to the variety of both simulation methods and modern architectures, it is challenging to design a high-performance yet portable *** this work, we propose UniQ, a unified programming model for multiple simulation methods on various hardware architectures. We provide a unified application abstraction to describe different applications, and a unified hierarchical hardware abstraction upon different hardware. Based on these abstractions, UniQ can perform various circuit transformations without being aware of either concrete application or architecture detail, and generate high-performance execution schedules on different platforms without much human effort. Evaluations on CPU, GPU, and Sunway platforms show that UniQ can accelerate quantum circuit simulation by up to 28.59× (4.47× on average) over state-of-the-art frameworks, and successfully scale to 399,360 cores on 1,024 nodes.

关键词： quantum simulation parallel programming

来源：评论

学校读者我要写书评

暂无评论

CLUSTERBUILDER - A DSL TO DEPLOY A parallel APPLICATION OVER A WORKSTATION CLUSTER

arXiv

引用

arXiv 2022年

作者： Kerridge, Jon M. School of Computing Edinburgh Napier University Merchiston Campus EdinburghEH10 5DT United Kingdom

Many organisations have a large network of connected computers, which at times may be idle. These could be used to run larger data processing problems were it not for the difficulty of organising and managing the deployment of such applications. ClusterBuilder is designed to make this task much simpler. ClusterBuilder uses its own Domain Specific Language (DSL) to describe the processing required that removes the need for a deep understanding of parallel programming techniques. The application uses extant sequential data objects which are then invoked in a parallel manner. ClusterBuilder uses robust software components and the created architecture is proved to be correct and free from deadlock and livelock. The performance of the system is demonstrated using the Mandelbrot set, which is executed on both a single multi-core processor and a cluster of workstations. It is shown that the cluster-based system has better performance characteristics than a multi-core processor solution. © 2022, CC BY.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Multi-element Correlator & Beamformer using OpenCL on FPGA Accelerator Card

Multi-element Correlator & Beamformer using OpenCL on FPGA A...

引用

URSI Regional Conference on Radio Science ( URSI-RCRS)

作者： Raghuttam Hombal Mekhala V. Muley Harshvardhan S. Reddy Sanjay S. Kudale Jayanta Roy Savitribai Phule Pune University Pune India GMRT Metrewave Radio Telescope NCRA-TIFR Pune India

Radio Interferometry refers to the process of combining signals from multiple antennas to form an image of the radio source in the sky. Radio-astronomical signal processing using array telescopes is computationally challenging and poses strict performance and energy-efficiency requirements. The GMRT is one of the largest arrays with many antennas working in the metre wavelength. The ongoing developmental activities for expansion of the GMRT (called eGMRT) demand a many fold increase in the computational cost and power budget while providing an increased collecting area as well as field-of-view by building more antennas each equipped with phased array feed (PAF). Recent FPGAs provide higher Flops per Watt making it an energy-efficient hardware platform suitable for projects like the eGMRT requiring a high compute-to-power ratio. However, the traditional programming model for FPGAs is a primary drawback of using FPGAs for high-performance computing. Aided by the recent advancement of parallel programming on FPGAs using Open Computing Language (OpenCL), allows FPGAs to be used as general purpose accelerators like GPUs. The aim of this project is to design an energy-efficient multi-element correlator and beamformer on an FPGA Accelerator Card using OpenCL and to explore the possibilities of using such systems for real-time, number-crunching tasks.

关键词： Phased arrays parallel programming Array signal processing Bandwidth Telescopes Correlators Energy efficiency

来源：评论

学校读者我要写书评

暂无评论

Comparison of Canonical Forms for Model Predictive Control

Comparison of Canonical Forms for Model Predictive Control

引用

International Carpathian Control Conference (ICCC)

作者： Pavel Č elovský Renata Wagnerová Department of Control Systems and Instrumentation VSB - Technical University of Ostrava Ostrava Czech Republic

ISBN: (数字)9781665466363

ISBN: (纸本)9781665466370

This paper presents the use of various canonical forms of mathematical models for predictive control design. The article represents five canonical forms, including Frobeni's canonical form (serial programming) and Jordan's canonical form (parallel programming). The individual canonical shapes are compared for the same controlled system and the same setting of the adjustable parameters of the predictive controller. The ITAE (integral time absolute error) integration criterion will be used for comparison. The paper aims to determine which canonical form is most suitable for the selected system and which of them achieves the highest quality of the control process.

关键词： Shape parallel programming Transfer functions Predictive models Control systems Delays State-space methods

来源：评论

学校读者我要写书评

暂无评论

Distributed Out-of-Memory NMF on CPU/GPU Architectures

arXiv

引用

arXiv 2022年

作者： Boureima, Ismael Bhattarai, Manish Eren, Maksim Skau, Erik Romero, Philip Eidenbenz, Stephan Alexandrov, Boian Theoritical Divison Los Alamos National Laboratory Los AlamosNM87545 United States Computer Computational and Statistical Science Division Los Alamos National Laboratory Los AlamosNM87545 United States HPC Divison Los Alamos National Laboratory Los AlamosNM87545 United States

We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library (NCCL) based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10−6 Copyright © 2022, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Pattern-based Autotuning of OpenMP Loops using Graph Neural Networks

Pattern-based Autotuning of OpenMP Loops using Graph Neural ...

引用

Artificial Intelligence and Machine Learning for Scientific Applications (AI4S), IEEE/ACM International Workshop on

作者： Akash Dutta Jordi Alcaraz Ali TehraniJamsaz Anna Sikora Eduardo Cesar Ali Jannesari Department of Computer Science Iowa State University Ames Iowa USA OACISS University of Oregon Eugene USA CAOS Department Universitat Autonoma de Barcelona Spain

ISBN: (纸本)9781665462082

Stagnation of Moore's law has led to the increased adoption of parallel programming for enhancing performance of scientific applications. Frequently occurring code and design patterns in scientific applications are often used for transforming serial code to parallel. But, identifying these patterns is not easy. To this end, we propose using Graph Neural Networks for modeling code flow graphs to identify patterns in such parallel code. Additionally, identifying the runtime parameters for best performing parallel code is also challenging. We propose a pattern-guided deep learning based tuning approach, to help identify the best runtime parameters for OpenMP loops. Overall, we aim to identify commonly occurring patterns in parallel loops and use these patterns to guide auto-tuning efforts. We validate our hypothesis on 20 different applications from Polybench, and STREAM benchmark suites. This deep learning-based approach can identify the considered patterns with an overall accuracy of 91%. We validate the usefulness of using patterns for auto-tuning on tuning the number of threads, scheduling policies and chunk size on a single socket system, and the thread count and affinity on a multi-socket machine. Our approach achieves geometric mean speedups of $1.1\times$ and $4.7\times$ respectively over default OpenMP configurations, compared to brute-force speedups of $1.27\times$ and $4.93\times$ respectively.

关键词： Semiconductor device modeling Codes Runtime parallel programming Sockets Moore's Law Instruction sets

来源：评论

学校读者我要写书评

暂无评论

Efficient and Eventually Consistent Collective Operations

arXiv

引用

arXiv 2022年

作者： Iakymchuk, Roman Faustino, Amândio Emerson, Andrew Barreto, João Bartsch, Valeria Rodrigues, Rodrigo Monteiro, José C. Fraunhofer ITWM Kaiserslautern67663 Germany Sorbonne Université Paris75252 France Lisboa1000-029 Portugal CINECA Casalecchio di Reno 40033 Italy

Collective operations are common features of parallel programming models that are frequently used in High-Performance (HPC) and machine/ deep learning (ML/ DL) applications. In strong scaling scenarios, collective operations can negatively impact the overall application performance: with the increase in core count, the load per rank decreases, while the time spent in collective operations increases logarithmically. In this article, we propose a design for eventually consistent collectives suitable for ML/ DL computations by reducing communication in Broadcast and Reduce, as well as by exploring the Stale Synchronous parallel (SSP) synchronization model for the Allreduce collective. Moreover, we also enrich the GASPI ecosystem with frequently used classic/ consistent collective operations – such as Allreduce for large messages and AlltoAll used in an HPC code. Our implementations show promising preliminary results with significant improvements, especially for Allreduce and AlltoAll, compared to the vendor-provided MPI alternatives. © 2022, CC BY.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Profile-Guided parallel Task Extraction and Execution for Domain Specific Heterogeneous SoC

Profile-Guided Parallel Task Extraction and Execution for Do...

引用

IEEE International Conference on Big Data and Cloud Computing (BdCloud)

作者： Liangliang Chang Joshua Mack Benjamin Willis Xing Chen John Brunhaver Ali Akoglu Chaitali Chakrabarti The School of Electrical Computer and Energy Engineering Arizona State University USA Electrical and Computer Engineering Department The University of Arizona USA

In this study, we introduce a methodology for automatically transforming user applications in the radar and communication domain written in $\boldsymbol{\mathrm{C}/\mathrm{C}++}$ based on dynamic profiling to a parallel representation targeted for a heterogeneous SoC. We present our approach for instrumenting the user application binary during the compilation process with barrier synchronization primitives that enable runtime system schedule and execute independent tasks concurrently over the available compute resources. We demonstrate the capabilities of our integrated compile time and runtime flow through task-level parallel and functionally correct execution of real-life applications. We perform validation of our integrated system by executing four distinct applications each carrying various degrees of task level parallelism over the Xeon-based multi-core homogeneous processor. We use the proposed compilation and code transformation methodology to re-target each application for execution on a heterogeneous SoC composed of three ARM cores and one FFT accelerator that is emulated on the Xilinx Zynq Ultra $\mathbf{Scale}+$ platform. We demonstrate our runtime's ability to process application binary, dispatch independent tasks over the available compute resources of the emulated SoC on the Zynq FPGA based on three different scheduling heuristics. Finally we demonstrate execution of each application individually with task level parallelism on the Zynq FPGA and execution of workload scenarios composed of multiple instances of the same application as well as mixture of two distinct applications to demonstrate ability to realize both application and task level parallel execution. Our integrated approach offers a path forward for application developers to take full advantage of the target SoC without requiring users to become hardware and parallel programming experts.

关键词： Schedules Runtime Processor scheduling parallel programming parallel processing Hardware Radar applications

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：