检索结果-内蒙古大学图书馆

parallel programming models for heterogeneous many-cores: a comprehensive survey

CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING 2020年第4期2卷 382-400页

作者： Fang, Jianbin Huang, Chun Tang, Tao Wang, Zheng Natl Univ Def Technol Coll Comp Inst Comp Syst Changsha Peoples R China Univ Leeds Sch Comp Leeds W Yorkshire England

Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.

关键词： Heterogeneous computing Many-core architectures parallel programming models

来源：评论

学校读者我要写书评

暂无评论

Optimization Techniques for GPU-Based parallel programming models in High-Performance Computing

引用

信息工程期刊（中英文版） 2024年第1期12卷 7-11页

作者： Shuntao Tang Wei Chen Xihua University

This study embarks on a comprehensive examination of optimization techniques within GPU-based parallel programming models,pivotal for advancing high-performance computing(HPC).Emphasizing the transition of GPUs from graphic-centric processors to versatile computing units,it delves into the nuanced optimization of memory access,thread management,algorithmic design,and data *** optimizations are critical for exploiting the parallel processing capabilities of GPUs,addressingboth the theoretical frameworks and practical *** integrating advanced strategies such as memory coalescing,dynamic scheduling,and parallel algorithmic transformations,this research aims to significantly elevate computational efficiency and *** findings underscore the potential of optimized GPU programming to revolutionize computational tasks across various domains,highlighting a pathway towards achieving unparalleled processing power and efficiency in HPC *** paper not only contributes to the academic discourse on GPU optimization but also provides actionable insights for developers,fostering advancements in computational sciences and technology.

关键词： Optimization Techniques GPU-Based parallel programming models High-Performance Computing

来源：评论

学校读者我要写书评

暂无评论

An empirical performance evaluation of SYCL on ARM multi-core processors

引用

CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING 2025年第1期7卷 1-16页

作者： Liang, Hanzheng Deng, Chencheng Zhang, Peng Fang, Jianbin Tang, Tao Huang, Chun Natl Univ Def Technol Coll Comp Sci & Technol Changsha 410073 Peoples R China

SYCL is a modern royalty-free heterogeneous programming specification maintained by the Khronos Group. Recently, it has become increasingly more prevalent and matured, leading to various assessments of its performance, portability, and programmability. While previous evaluations have mainly focused on X86 CPUs, NVIDIA GPUs, and AMD GPUs, how well SYCL performs on ARM multi-core CPUs is still unknown. In this paper, we evaluate three SYCL implementations (i.e., DPCPP, AdaptiveCPP, and MLIR-SYCL) on ARM multi-core CPUs, to uncover performance traps and offer optimization techniques. We use the SYCL-Bench benchmark suite to assess the performance of DPCPP, AdaptiveCPP, and MLIR-SYCL against their OpenMP counterparts. We also assess the compiler and runtime overhead to evaluate the usability and productivity of the SYCL implementations. Our empirical results demonstrate that these SYCL implementations can achieve satisfactory performance on ARM multi-core processors. Additionally, we highlight several key optimizations, such as NUMA management, which must be carefully addressed to enhance performance.

关键词： parallel programming models SYCL ARM CPUs Performance evaluation

来源：评论

学校读者我要写书评

暂无评论

Enhancing Kokkos with OpenACC

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2024年第5期38卷 409-426页

作者： Valero-Lara, Pedro Lee, Seyong Gonzalez-Tallada, Marc Denny, Joel Teranishi, Keita Vetter, Jeffrey S. Oak Ridge Natl Lab 1 Bethel Valley Rd Oak Ridge TN 37830 USA Univ Politecn Cataluna Barcelona Spain

C++ template metaprogramming has emerged as a prominent approach for achieving performance portability in heterogeneous computing. Kokkos represents a notable paradigm in this domain, offering programmers a suite of high-level abstractions for generic programming while deferring much of the device-specific code generation and optimization to the compiler through template specializations. Kokkos furnishes a range of device-specific code specializations across multiple back ends, including CUDA and HIP. Diverging from conventional back ends, the OpenACC implementation presents a high-level, multicompiler, multidevice, and directive-based programming model. This paper presents recent advancements in the OpenACC back end for Kokkos (i.e., KokkACC) and focuses on its integration into the Kokkos ecosystem, exploration of automatic device selection capabilities to enhance productivity, and performance evaluation on modern hardware such as NVIDIA H100 GPUs. The study includes implementation details and a thorough performance assessment across various computational benchmarks, including minibenchmarks (AXPY and DOT product), miniapps (LULESH, MiniFE, and SNAP-LAMMPS), and a scientific kernel based on the lattice Boltzmann method.

关键词： OpenACC C plus plus metaprogramming Kokkos CUDA OpenMP target parallel programming models

来源：评论

学校读者我要写书评

暂无评论

***: MPI-Based Asynchronous Task Execution for Python

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2023年第2期34卷 611-622页

作者： Rogowski, Marcin Aseeri, Samar Keyes, David Dalcin, Lisandro King Abdullah Univ Sci & Technol KAUST Extreme Comp Res Ctr ECRC Thuwal 239556900 Saudi Arabia King Abdullah Univ Sci & Technol KAUST Comp Sci Program Thuwal 239556900 Saudi Arabia

We present ***, a lightweight, asynchronous task execution framework targeting the Python programming language and using the Message Passing Interface (MPI) for interprocess communication. *** follows the interface of the *** package from the Python standard library and can be used as its drop-in replacement, while allowing applications to scale over multiple compute nodes. We discuss the design, implementation, and feature set of *** and compare its performance to other solutions on both shared and distributed memory architectures. On a shared-memory system, we show *** to consistently outperform Python's *** with speedup ratios between 1.4X and 3.7X in throughput (tasks per second) and between 1.9X and 2.9X in bandwidth. On a Cray XC40 system, we compare *** to Dask - a well-known Python parallel computing package. Although we note more varied results, we show *** to outperform Dask in most scenarios.

关键词： MPI Python parallelism master-worker parallel programming models distributed computing high performance computing task execution multiprocessing

来源：评论

学校读者我要写书评

暂无评论

Exploring Traditional and Emerging parallel programming models using a Proxy Application

Exploring Traditional and Emerging Parallel Programming Mode...

引用

IEEE 27th International parallel and Distributed Processing Symposium (IPDPS)

作者： Karlin, Ian Bhatele, Abhinav Keasler, Jeff Chamberlain, Bradford L. Cohen, Jonathan DeVito, Zachary Haque, Riyaz Laney, Dan Luke, Edward Wang, Felix Richards, David Schulz, Martin Still, Charles H. Lawrence Livermore Natl Lab POB 808 Livermore CA 94551 USA Cray Res Inc Washington DC 98164 USA Stanford Univ Stanford CA 94305 USA Univ Calif Los Angeles Los Angeles CA 90095 USA Mississippi State Univ Mississippi State MS 39762 USA Univ Illinois Urbana IL 61801 USA

ISBN: (纸本)9780769549712

parallel machines are becoming more complex with increasing core counts and more heterogeneous architectures. However, the commonly used parallel programming models, C/C++ with MPI and/or OpenMP, make it difficult to write source code that is easily tuned for many targets. Newer language approaches attempt to ease this burden by providing optimization features such as automatic load balancing, overlap of computation and communication, message-driven execution, and implicit data layout optimizations. In this paper, we compare several implementations of LULESH, a proxy application for shock hydrodynamics, to determine strengths and weaknesses of different programming models for parallel computation. We focus on four traditional (OpenMP, MPI, MPI+ OpenMP, CUDA) and four emerging (Chapel, Charm++, Liszt, Loci) programming models. In evaluating these models, we focus on programmer productivity, performance and ease of applying optimizations.

关键词： parallel programming models productivity performance co-design proxy application

来源：评论

学校读者我要写书评

暂无评论

Demistifying HPC-Quantum integration: it's all about scheduling

Demistifying HPC-Quantum integration: it's all about schedul...

引用

Workshop on High Performance and Quantum Computing Integration (HPQCI)

作者： Viviani, Paolo LINKS Fdn Turin Italy

ISBN: (纸本)9798400706431

Recent research on the integration between HPC and quantum computer was mostly focused on the software stack and quantum circuit compilation aspects, neglecting critical issues like HPC resource allocation and job scheduling given the scarcity of QPUs, and disregarding the heterogeneity of current quantum technologies and their computational models (e.g., digital vs. analogue). This work would like to bring the attention to issues that are critical to achieve integration with operational HPC environments given the current status of quantum computers maturity and heterogeneity.

关键词： quantum computing HPC parallel programming models job scheduling resource allocation

来源：评论

学校读者我要写书评

暂无评论

Pure: Evolving Message Passing To Better Leverage Shared Memory Within Nodes 24

Pure: Evolving Message Passing To Better Leverage Shared Mem...

引用

29th ACM SIGPLAN Annual Symposium on Principles and Practice of parallel programming (PPoPP)

作者： Psota, James Solar-Lezama, Armando MIT CSAIL Cambridge MA 02139 USA

ISBN: (纸本)9798400704352

Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced with the ability to use tasks to make use of idle cores. Pure leverages shared memory in two ways: (a) by allowing cores to steal work from each other while waiting on messages to arrive, and, (b) by leveraging *** lock-free data structures in shared memory to achieve highperformance messaging and collective operations between the ranks within nodes. We use microbenchmarks to evaluate Pure's key messaging and collective features and also show application speedups up to 2.1 Chi on the CoMD molecular dynamics and the miniAMR adaptive mesh *** applications scaling up to 4,096 cores.

关键词： parallel programming models distributed runtime systems task-based parallelism concurrent data structures lock-free data structures

来源：评论

学校读者我要写书评

暂无评论

sKokkos: Enabling Kokkos with Transparent Device Selection on Heterogeneous Systems using OpenACC 24

sKokkos: Enabling Kokkos with Transparent Device Selection o...

引用

7th International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia)

作者： Valero-Lara, Pedro Lee, Seyong Denny, Joel Teranishi, Keita Vetter, Jeffrey S. Gonzalez-Tallada, Marc Oak Ridge Natl Lab Oak Ridge TN 37830 USA Univ Politecn Cataluna Barcelona Spain

ISBN: (纸本)9798400708893

This paper presents a new feature to enable Kokkos with transparent device selection. For application developers, it is not easy to identify which device is the most appropriate to use in a heterogeneous system, since this depends on the characteristics of both the application and the hardware. In Kokkos, a backend is associated with one specific programming model/hardware. Programmers decide which backend to use at compilation time. This new feature implemented on the OpenACC backend eliminates the burden of deciding which device to use, providing a highly productive programming solution for Kokkos applications. This work includes implementation details and a performance study conducted with a set of mini-benchmarks (i.e., AXPY and dot product), kernels (Lattice-Bolzmann method), and two mini-apps (LULESH and miniFE) on two heterogeneous systems with different hardware capabilities. This new Kokkos feature provides high accelerations of up to 35x thanks to automatic and transparent device selection.

关键词： Kokkos OpenACC C plus plus Metaprogramming Heterogeneous Systems CPU GPU parallel programming models Auto-tuning

来源：评论

学校读者我要写书评

暂无评论

programming big data analysis: principles and solutions

引用

JOURNAL OF BIG DATA 2022年第1期9卷 1-50页

作者： Belcastro, Loris Cantini, Riccardo Marozzo, Fabrizio Orsino, Alessio Talia, Domenico Trunfio, Paolo Univ Calabria Arcavacata Di Rende Italy Dtok Lab Arcavacata Di Rende Italy

In the age of the Internet of Things and social media platforms, huge amounts of digital data are generated by and collected from many sources, including sensors, mobile devices, wearable trackers and security cameras. This data, commonly referred to as Big Data, is challenging current storage, processing, and analysis capabilities. New models, languages, systems and algorithms continue to be developed to effectively collect, store, analyze and learn from Big Data. Most of the recent surveys provide a global analysis of the tools that are used in the main phases of Big Data management (generation, acquisition, storage, querying and visualization of data). Differently, this work analyzes and reviews parallel and distributed paradigms, languages and systems used today to analyze and learn from Big Data on scalable computers. In particular, we provide an in-depth analysis of the properties of the main parallel programming paradigms (MapReduce, workflow, BSP, message passing, and SQL-like) and, through programming examples, we describe the most used systems for Big Data analysis (e.g., Hadoop, Spark, and Storm). Furthermore, we discuss and compare the different systems by highlighting the main features of each of them, their diffusion (community of developers and users) and the main advantages and disadvantages of using them to implement Big Data analysis applications. The final goal of this work is to help designers and developers in identifying and selecting the best/appropriate programming solution based on their skills, hardware availability, application domains and purposes, and also considering the support provided by the developer community.

关键词： parallel programming models programming systems Big Data analysis MapReduce Workflow Message Passing Bulk Synchronous parallel SQL-like

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：