检索结果-内蒙古大学图书馆

JACC: Leveraging HPC Meta-programming and Performance Portability with the Just-in-Time and LLVM-based Julia Language

JACC: Leveraging HPC Meta-Programming and Performance Portab...

引用

2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024

作者： Valero-Lara, Pedro Godoy, William F. Mankad, Het Teranishi, Keita Vetter, Jeffrey S. Blaschke, Johannes Schanen, Michel Oak Ridge National Laboratory Oak RidgeTN United States Lawrence Berkeley National Laboratory BerkeleyCA United States Argonne National Laboratory LemontIL United States

ISBN: (纸本)9798350355543

We present JACC (Julia for Accelerators), the first high-level, and performance-portable model for the just-in-time and LLVM-based Julia language. JACC provides a unified and lightweight front end across different back ends available in Julia, enabling the same Julia code to run efficiently on many HPC CPU and GPU targets. We evaluated the performance of JACC for common HPC kernels as well as for the most computationally demanding kernels used in applications, HPCCG, a supercomputing benchmark test for sparse domains, and HARVEY, a blood flow simulator to assist in the diagnosis and treatment of patients suffering from vascular diseases. We carried out the performance analysis on the most advanced US DOE supercomputers: Aurora, Frontier, and Perlmutter. Overall, we show that JACC has a negligible overhead versus vendor-specific solutions, reporting GPU speedups with no extra cost to programmability. © 2024 IEEE.

关键词： GPU Acceleration Julia Metaprogramming Performance Portability programming productivity

来源：评论

学校读者我要写书评

暂无评论

Performance without Pain = productivity Data Layout and Collective Communication in UPC 08

Performance without Pain = Productivity Data Layout and Coll...

引用

ACM SIGPLAN Symposium on Principles and Practice of Parallel programming (PPoPP 08)

作者： Nishtala, Rajesh Almasi, George Cascaval, Calin Univ Calif Berkeley Div Comp Sci Berkeley CA 94720 USA

ISBN: (纸本)9781595939609

The next generations of supercomputers are projected to have hundreds of thousands of processors. However, as the numbers of processors grow, the scalability of applications will be the dominant challenge. This forces us to reexamine some of our fundamental ways that we approach the design and use of parallel languages and runtime systems. In this paper we show how the globally shared arrays in a popular Partitioned Global Address Space (PGAS) language, Unified Parallel C (UPC), can be combined with a new collective interface to improve both performance and scalability. This interface allows subsets, or teams, of threads to perform a collective together. As opposed to MPI's communicators, our interface allows set of threads to be placed in teams instantly rather than explicitly constructing communicators, thus allowing for a more dynamic team construction and manipulation. We motivate our ideas with three application kernels: Dense Matrix Multiplication, Dense Cholesky factorization and multidimensional Fourier transforms. We describe how the three aforementioned applications can be succinctly written in UPC thereby aiding productivity. We also show how such an interface allows for scalability by running on up to 16,384 processors on the BlueGene/L. In a few lines of UPC code, we wrote a dense matrix multiply routine achieves 28.8 TFlop/s and a 3D FFT that achieves 2.1 TFlop/s. We analyze our performance results through models and show that the machine resources rather than the interfaces themselves limit the performance.

关键词： Blue Gene Collective Communication PGAS Parallel programming programming productivity UPC

来源：评论

学校读者我要写书评

暂无评论

A parallel programming assessment for stream processing applications on multi-core systems

引用

COMPUTER STANDARDS & INTERFACES 2023年 84卷 1页

作者： Andrade, Gabriella Griebler, Dalvan Santos, Rodrigo Fernandes, Luiz Gustavo Pontifical Catholic Univ Rio Grande Sul PUCRS Sch Technol Parallel Applicat Modeling Grp GMAP Porto Alegre Brazil Tres De Maio Fac Setrem Lab Adv Res Cloud Comp LARCC Tres De Maio Brazil Fed Univ State Rio De Janeiro UNIRIO Dept Appl Informat Rio De Janeiro Brazil

Multi-core systems are any computing device nowadays and stream processing applications are becoming recurrent workloads, demanding parallelism to achieve the desired quality of service. As soon as data, tasks, or requests arrive, they must be computed, analyzed, or processed. Since building such applications is not a trivial task, the software industry must adopt parallel APIs (Application programming Interfaces) that simplify the exploitation of parallelism in hardware for accelerating time-to-market. In the last years, research efforts in academia and industry provided a set of parallel APIs, increasing productivity to software developers. However, a few studies are seeking to prove the usability of these interfaces. In this work, we aim to present a parallel programming assessment regarding the usability of parallel API for expressing parallelism on the stream processing application domain and multi-core systems. To this end, we conducted an empirical study with beginners in parallel application development. The study covered three parallel APIs, reporting several quantitative and qualitative indicators involving developers. Our contribution also comprises a parallel programming assessment methodology, which can be replicated in future assessments. This study revealed important insights such as recurrent compile-time and programming logic errors performed by beginners in parallel programming, as well as the programming effort, challenges, and learning curve. Moreover, we collected the participants' opinions about their experience in this study to understand deeply the results achieved.

关键词： Parallel software Parallel computing systems programming productivity programming effort Stream parallelism programming usability

来源：评论

学校读者我要写书评

暂无评论

Collective Asynchronous Remote Invocation (CARI): A High-Level and Effcient Communication API for Irregular Applications

引用

Procedia Computer Science 2011年 4卷 26-35页

作者： Wakeel Ahmad Bryan Carpenter Aamir Shafi School of Electrical Engineering and Computer Science National University of Sciences and Technology Islamabad 44000 Pakistan School of Computing University of Portsmouth PO1 2UP UK Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge MA 02139 USA

The Message Passing Interface (MPI) standard continues to dominate the landscape of parallel computing as the de facto API for writing large-scale scientific applications. But the critics argue that it is a low-level API and harder to practice than shared memory approaches. This paper addresses the issue of programming productivity by proposing a high-level, easy-to-use, and effcient programming API that hides and segregates complex low-level message passing code from the application specific code. Our proposed API is inspired by communication patterns found in Gadget-2, which is an MPI-based parallel production code for cosmological N-body and hydrodynamic simulations. In this paper—we analyze Gadget-2 with a view to understanding what high-level Single Program Multiple Data (SPMD) communication abstractions might be developed to replace the intricate use of MPI in such an irregular application—and do so without compromising the effciency. Our analysis revealed that the use of low-level MPI primitives—bundled with the computation code—makes Gadget-2 diffcult to understand and probably hard to maintain. In addition, we found out that the original Gadget-2 code contains a small handful of—complex and recurring—patterns of message passing. We also noted that these complex patterns can be reorganized into a higherlevel communication library with some modifications to the Gadget-2 code. We present the implementation and evaluation of one such message passing pattern (or schedule) that we term Collective Asynchronous Remote Invocation (CARI). As the name suggests, CARI is a collective variant of Remote Method Invocation (RMI), which is an attractive, high-level, and established paradigm in distributed systems programming. The CARI API might be implemented in several ways—we develop and evaluate two versions of this API on a compute cluster. The performance evaluation reveals that CARI versions of the Gadget-2 code perform as well as the original Gadget-2 code but the lev

关键词： SPMD Communication programming productivity CARI Asynchronous CARI Synchronous CARI

来源：评论

学校读者我要写书评

暂无评论

A Framework for Efficient Execution of Data Parallel Irregular Applications on Heterogeneous Systems

引用

PARALLEL PROCESSING LETTERS 2015年第2期25卷 1550004-1550004页

作者： Ribeiro, Roberto Barbosa, Joao Santos, Luis Paulo

Exploiting the computing power of the diversity of resources available on heterogeneous systems is mandatory but a very challenging task. The diversity of architectures, execution models and programming tools, together with disjoint address spaces and different computing capabilities, raise a number of challenges that severely impact on application performance and programming productivity. This problem is further compounded in the presence of data parallel irregular applications. This paper presents a framework that addresses development and execution of data parallel irregular applications in heterogeneous systems. A unified task-based programming and execution model is proposed, together with inter and intra-device scheduling, which, coupled with a data management system, aim to achieve performance scalability across multiple devices, while maintaining high programming productivity. Intra-device scheduling on wide SIMD/SIMT architectures resorts to consumer-producer kernels, which, by allowing dynamic generation and rescheduling of new work units, enable balancing irregular workloads and increase resource utilization. Results show that regular and irregular applications scale well with the number of devices, while requiring minimal programming effort. Consumer-producer kernels are able to sustain significant performance gains as long as the workload per basic work unit is enough to compensate overheads associated with intra-device scheduling. This not being the case, consumer kernels can still be used for the irregular application. Comparisons with an alternative framework, StarPU, which targets regular workloads, consistently demonstrate significant speedups. This is, to the best of our knowledge, the first published integrated approach that successfully handles irregular workloads over heterogeneous systems.

关键词： Heteregeneous systems irregular applications efficiency programming productivity

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：