检索结果-内蒙古大学图书馆

International Conference on Field Programmable Logic and Applications

作者： Yashael Faith Arthanto David Ojika Joo-Young Kim School of Electrical Engineering KAIST Daejeon Republic of Korea Flapmax Austin USA

By providing highly efficient one-sided communication with globally shared memory space, Partitioned Global Address Space (PGAS) has become one of the most promising parallel computing models in high-performance computing (HPC). Meanwhile, FPGA is getting attention as an alternative compute platform for HPC systems with the benefit of custom computing and design flexibility. However, the exploration of PGAS has not been conducted on FPGAs, unlike the traditional message passing interface. This paper proposes FSHMEM, a software/hardware framework that enables the PGAS programming model on FPGAs. We implement the core functions of GASNet specification on FPGA for native PGAS integration in hardware, while its programming interface is designed to be highly compatible with legacy software. Our experiments show that FSHMEM achieves the peak bandwidth of 3813 MB/s, which is more than 95% of the theoretical maximum, outperforming the prior works by 9.5×. It records 0.35us and 0.59us latency for remote write and read operations, respectively. Finally, we conduct a case study on the two Intel D5005 FPGA nodes integrating Intel's deep learning accelerator. The two-node system programmed by FSHMEM achieves 1.94× and 1.98× speedup for matrix multiplication and convolution operation, respectively, showing its scalability notential for HPC infrastructure.

关键词： parallel programming Computational modeling Scalability Message passing Memory management parallel processing Software

来源：评论

学校读者我要写书评

暂无评论

KokkACC: Enhancing Kokkos with OpenACC

KokkACC: Enhancing Kokkos with OpenACC

引用

Workshop on Accelerator programming using Directives (WACCPD)

作者： Pedro Valero-Lara Seyong Lee Marc Gonzalez-Tallada Joel Denny Jeffrey S. Vetter Oak Ridge National Laboratory (ORNL)

ISBN: (纸本)9781665490207

Template metaprogramming is gaining popularity as a high-level solution for achieving performance portability on heterogeneous computing resources. Kokkos is a representative approach that offers programmers high-level abstractions for generic programming while most of the device-specific code generation and optimizations are delegated to the compiler through template specializations. For this, Kokkos provides a set of device-specific code specializations in multiple back ends, such as CUDA and HIP. Unlike CUDA or HIP, OpenACC is a high-level and directive-based programming model. This descriptive model allows developers to insert hints (pragmas) into their code that help the compiler to parallelize the code. The compiler is responsible for the transformation of the code, which is completely transparent to the programmer. This paper presents an OpenACC back end for Kokkos: KokkACC. As an alternative to Kokkos’s existing device-specific back ends, KokkACC is a multi-architecture back end providing a high-productivity programming environment enabled by OpenACC’s high-level and descriptive programming model. Moreover, we have observed competitive performance; in some cases, KokkACC is faster (up to 9×) than NVIDIA’s CUDA back end and much faster than OpenMP’s GPU offloading back end. This work also includes implementation details and a detailed performance study conducted with a set of mini-benchmarks (AXPY and DOT product) and three mini-apps (LULESH, miniFE and SNAP, a LAMMPS proxy mini-app).

关键词： Performance evaluation Codes parallel programming Conferences Graphics processing units US Department of Transportation Heterogeneous networks

来源：评论

学校读者我要写书评

暂无评论

A Comparison Between Automatically Versus Manually parallelized NAS Benchmarks

arXiv

引用

arXiv 2022年

作者： Barakhshan, Parinaz Eigenmann, Rudolf University of Delaware NewarkDE United States

We compare automatically and manually parallelized NAS Benchmarks in order to identify code sections that differ. We discuss opportunities for advancing automatic parallelizers. We find ten patterns that pose challenges for current parallelization technology. We also measure the potential impact of advanced techniques that could perform the needed transformations automatically. While some of our findings are not surprising and difficult to attain – compilers need to get better at identifying parallelism in outermost loops and in loops containing function calls – other opportunities are within reach and can make a difference. They include combining loops into parallel regions, avoiding load imbalance, and improving reduction parallelization. Advancing compilers through the study of hand-optimized code is a necessary path to move the forefront of compiler research. Very few recent papers have pursued this goal, however. The present work tries to fill this void. © 2022, CC BY.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

The Evolution of a New Model of Computation

The Evolution of a New Model of Computation

引用

Workshop on Irregular Applications: Architecture and Algorithms (IA3)

作者： Brian A. Page Peter Kogge Laboratory of Physical Sciences (LPS) College Park MD USA Computer Science and Engineering University of Notre Dame Notre Dame IN

ISBN: (纸本)9781665475075

The conventional model of parallel programming today involves either copying data across cores (and then having to track its most recent value), or not copying and requiring deep software stacks to perform even the simplest operation on data that is “remote”, i.e., out of the range of loads and stores from the current core. As application requirements grow to larger data sets, with more irregular access to them, both conventional approaches start to exhibit severe scaling limitations. This paper reviews some growing evidence of the potential value of a new model of computation that skirts between the two: data does not move (i.e., is not copied), but computation instead moves to the data. Several different applications involving large sparse computations, streaming of data, and complex mixed mode operations have been coded for a novel platform where thread movement is handled invisibly by the hardware. The evidence to date indicates that parallel scaling for this paradigm can be significantly better than any mix of conventional models.

关键词： parallel programming Computational modeling Message passing Memory management Machine learning Linear algebra Data models

来源：评论

学校读者我要写书评

暂无评论

An On-the-Fly Method to Exchange Vector Clocks in Distributed-Memory Programs

An On-the-Fly Method to Exchange Vector Clocks in Distribute...

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Simon Schwitanski Felix Tomski Joachim Protze Christian Terboven Matthias S. Mü ller IT Center RWTH Aachen University Aachen Germany

ISBN: (数字)9781665497473

ISBN: (纸本)9781665497480

Vector clocks are logical timestamps used in correctness tools to analyze the happened-before relation between events in parallel program executions. In particular, race detectors use them to find concurrent conflicting memory accesses, and replay tools use them to reproduce or find alternative execution paths. To record the happened-before relation with vector clocks, tool developers have to consider the different synchronization concepts of a programming model, e.g., barriers, locks, or message exchanges. Especially in distributed-memory programs, various concepts result in explicit and implicit synchronization between processes. Previously implemented vector clock exchanges are often specific to a single programming model, and a translation to other programming models is not trivial. Consequently, analyses relying on the vector clock exchange remain model-specific. This paper proposes an abstraction layer for on-the-fly vector clock exchanges for distributed-memory programs. Based on the programming models MPI, OpenSHMEM, and GASPI, we define common synchronization primitives and explain how model-specific procedures map to our model-agnostic abstraction layer. The exchange model is general enough also to support synchronization concepts of other parallel programming models. We present our implementation of the vector clock abstraction layer based on the Generic Tool Infrastructure with translators for MPI and OpenSHMEM. In an overhead study using the SPEC MPI 2007 benchmarks, the slowdown of the implemented vector clock exchange ranges from 1.1x to 12.6x for runs with up to 768 processes.

关键词： Analytical models Distributed processing Codes parallel programming Instruction sets Conferences Detectors

来源：评论

学校读者我要写书评

暂无评论

COMPOFF: A Compiler Cost model using Machine Learning to predict the Cost of OpenMP Offloading

COMPOFF: A Compiler Cost model using Machine Learning to pre...

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Alok Mishra Smeet Chheda Carlos Soto Abid M. Malik Meifeng Lin Barbara Chapman Stony Brook University Stony Brook NY USA Brookhaven National Laboratory Upton NY USA

ISBN: (数字)9781665497473

ISBN: (纸本)9781665497480

The HPC industry is inexorably moving towards an era of extremely heterogeneous architectures, with more devices configured on any given HPC platform and potentially more kinds of devices, some of them highly specialized. Writing a separate code suitable for each target system for a given HPC application is not practical. The better solution is to use directive-based parallel programming models such as OpenMP. OpenMP provides a number of options for offloading a piece of code to devices like GPUs. To select the best option from such options during compilation, most modern compilers use analytical models to estimate the cost of executing the original code and the different offloading code variants. Building such an analytical model for compilers is a difficult task that necessi-tates a lot of effort on the part of a compiler engineer. Recently, machine learning techniques have been successfully applied to build cost models for a variety of compiler optimization problems. In this paper, we present COMPOFF, a cost model that statically estimates the Cost of OpenMP OFFloading using a neural network model. We used six different transformations on a parallel code of Wilson Dslash Operator to support GPU offloading, and we predicted their cost of execution on different GPUs using COMPOFF during compile time. Our results show that this model can predict offloading costs with a root mean squared error in prediction of less than 0.5 seconds. Our preliminary findings indicate that this work will make it much easier and faster for scientists and compiler developers to port legacy HPC applications that use OpenMP to new heterogeneous computing environment.

关键词： Analytical models Costs Codes parallel programming Computational modeling Machine learning Predictive models

来源：评论

学校读者我要写书评

暂无评论

Heterogeneous programming for the Homogeneous Majority

Heterogeneous Programming for the Homogeneous Majority

引用

Performance, Portability and Productivity in HPC (P3HPC)

作者： Tom Deakin James Cownie Wei-Chen Lin Simon McIntosh-Smith Department of Computer Science University of Bristol Bristol UK

ISBN: (纸本)9781665460224

In order to take advantage of the burgeoning diversity in processors at the frontier of supercomputing, the HPC community is migrating and improving codes to utilise heterogeneous nodes, where accelerators, principally GPUs, are highly prevalent in top-tier supercomputer designs. Programs therefore need to embrace at least some of the complexities of heterogeneous architectures. parallel programming models have evolved to express heterogeneous paradigms whilst providing mechanisms for writing portable, performant programs. History shows that technologies first introduced at the frontier percolate down to local workhorse systems. However, we expect there will always be a mix of systems, some heterogeneous, but some remaining as homogeneous CPU systems. Thus it is important to ensure codes adapted for heterogeneous systems continue to run efficiently on CPUs. In this study, we explore how well widely used heterogeneous programming models perform on CPU-only platforms, and survey the performance portability they offer on the latest CPU architectures.

关键词： Productivity Codes Limiting parallel programming Graphics processing units Benchmark testing Writing

来源：评论

学校读者我要写书评

暂无评论

UniQ: a unified programming model for efficient quantum circuit simulation 22

UniQ: a unified programming model for efficient quantum circ...

引用

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

作者： Chen Zhang Haojie Wang Zixuan Ma Lei Xie Zeyu Song Jidong Zhai Tsinghua University Beijing China

Quantum circuit simulation is critical for verifying quantum computers. Given exponential complexity in the simulation, existing simulators use different architectures to accelerate the simulation. However, due to the variety of both simulation methods and modern architectures, it is challenging to design a high-performance yet portable *** this work, we propose UniQ, a unified programming model for multiple simulation methods on various hardware architectures. We provide a unified application abstraction to describe different applications, and a unified hierarchical hardware abstraction upon different hardware. Based on these abstractions, UniQ can perform various circuit transformations without being aware of either concrete application or architecture detail, and generate high-performance execution schedules on different platforms without much human effort. Evaluations on CPU, GPU, and Sunway platforms show that UniQ can accelerate quantum circuit simulation by up to 28.59× (4.47× on average) over state-of-the-art frameworks, and successfully scale to 399,360 cores on 1,024 nodes.

关键词： quantum simulation parallel programming

来源：评论

学校读者我要写书评

暂无评论

CLUSTERBUILDER - A DSL TO DEPLOY A parallel APPLICATION OVER A WORKSTATION CLUSTER

arXiv

引用

arXiv 2022年

作者： Kerridge, Jon M. School of Computing Edinburgh Napier University Merchiston Campus EdinburghEH10 5DT United Kingdom

Many organisations have a large network of connected computers, which at times may be idle. These could be used to run larger data processing problems were it not for the difficulty of organising and managing the deployment of such applications. ClusterBuilder is designed to make this task much simpler. ClusterBuilder uses its own Domain Specific Language (DSL) to describe the processing required that removes the need for a deep understanding of parallel programming techniques. The application uses extant sequential data objects which are then invoked in a parallel manner. ClusterBuilder uses robust software components and the created architecture is proved to be correct and free from deadlock and livelock. The performance of the system is demonstrated using the Mandelbrot set, which is executed on both a single multi-core processor and a cluster of workstations. It is shown that the cluster-based system has better performance characteristics than a multi-core processor solution. © 2022, CC BY.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Multi-element Correlator & Beamformer using OpenCL on FPGA Accelerator Card

Multi-element Correlator & Beamformer using OpenCL on FPGA A...

引用

URSI Regional Conference on Radio Science ( URSI-RCRS)

作者： Raghuttam Hombal Mekhala V. Muley Harshvardhan S. Reddy Sanjay S. Kudale Jayanta Roy Savitribai Phule Pune University Pune India GMRT Metrewave Radio Telescope NCRA-TIFR Pune India

Radio Interferometry refers to the process of combining signals from multiple antennas to form an image of the radio source in the sky. Radio-astronomical signal processing using array telescopes is computationally challenging and poses strict performance and energy-efficiency requirements. The GMRT is one of the largest arrays with many antennas working in the metre wavelength. The ongoing developmental activities for expansion of the GMRT (called eGMRT) demand a many fold increase in the computational cost and power budget while providing an increased collecting area as well as field-of-view by building more antennas each equipped with phased array feed (PAF). Recent FPGAs provide higher Flops per Watt making it an energy-efficient hardware platform suitable for projects like the eGMRT requiring a high compute-to-power ratio. However, the traditional programming model for FPGAs is a primary drawback of using FPGAs for high-performance computing. Aided by the recent advancement of parallel programming on FPGAs using Open Computing Language (OpenCL), allows FPGAs to be used as general purpose accelerators like GPUs. The aim of this project is to design an energy-efficient multi-element correlator and beamformer on an FPGA Accelerator Card using OpenCL and to explore the possibilities of using such systems for real-time, number-crunching tasks.

关键词： Phased arrays parallel programming Array signal processing Bandwidth Telescopes Correlators Energy efficiency

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：