检索结果-内蒙古大学图书馆

PROCEEDINGS OF THE ACM ON programming languages-PACMPL 2024年第POPL期8卷 1118-1149页

作者： Westrick, Sam Fluet, Matthew Rainey, Mike Acar, Umut A. Carnegie Mellon Univ Pittsburgh PA 15213 USA Rochester Inst Technol Rochester NY USA

On any modern computer architecture today, parallelism comes with a modest cost, born from the creation and management of threads or tasks. Today, programmers battle this cost by manually optimizing/tuning their codes to minimize the cost of parallelism without harming its benefit, performance. This is a difficult battle: programmers must reason about architectural constant factors hidden behind layers of software abstractions, including thread schedulers and memory managers, and their impact on performance, also at scale. In languages that support higher-order functions, the battle hardens: higher order functions can make it difficult, if not impossible, to reason about the cost and benefits of parallelism. Motivated by these challenges and the numerous advantages of high-level languages, we believe that it has become essential to manage parallelism automatically so as to minimize its cost and maximize its benefit. This is a challenging problem, even when considered on a case-by-case, application-specific basis. But if a solution were possible, then it could combine the many correctness benefits of high-level languages with performance by managing parallelism without the programmer effort needed to ensure performance. This paper proposes techniques for such automatic management of parallelism by combining static (compilation) and run-time techniques. Specifically, we consider the parallel ML language with task parallelism, and describe a compiler pipeline that embeds "potential parallelism" directly into the call-stack and avoids the cost of task creation by default. We then pair this compilation pipeline with a run-time system that dynamically converts potential parallelism into actual parallel tasks. Together, the compiler and run-time system guarantee that the cost of parallelism remains low without losing its benefit. We prove that our techniques have no asymptotic impact on the work and span of parallel programs and thus preserve their asymptotic properties. W

关键词： parallel programming languages granularity control compilers

来源：评论

学校读者我要写书评

暂无评论

Python shared atomic data types

引用

SOFTWARE-PRACTICE & EXPERIENCE 2023年第12期53卷 2393-2407页

作者： Ren, Xiquan Anlianwanjia Dept Store Software Dev Jinan Peoples R China

Although atomicity plays a key role in data operations of shared variables in parallel computation, researchers haven't treated atomicity in Python in much detail. This study provides a novel approach to integrate the CPU-based atomic C APIs into Python shared variables by C Foreign Function Interface for Python (CFFI) on all major platforms and utilises Cython to optimise calculation in CPython. Evidence shows that the resulting product, Shared Atomic Enterprise (SAE), could accelerate data operations on shared data types to a large extent. These findings provide a solid evidence base for the massive utilisation of Python atomic operations in parallel computation and concurrent programming.

关键词： concurrent programming structures parallel programming languages Python atomicity

来源：评论

学校读者我要写书评

暂无评论

Portable Implementations of Work Stealing 24

Portable Implementations of Work Stealing

引用

7th International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia)

作者： Yasugi, Masahiro Hiraishi, Tasuku Takeuchi, Chihiro Kyushu Inst Technol Iizuka Fukuoka Japan Kyoto Tachibana Univ Kyoto Japan Japan Res Inst Ltd Tokyo Japan

ISBN: (纸本)9798400708893

Work stealing is a well-known technique for dynamic load balancing;however, manually writing work-stealing protocols is errorprone. We can use the Tascell parallel programming language for the correct and portable implementation of work stealing;the implementation combines polling and adequate mutual exclusion. In Tascell, we can express on-demand concurrency for backtracking-based load balancing where a worker performs a sequential computation with its own execution stack unless it is requested to spawn a task. To spawn a larger task by temporarily backtracking, nested functions can be used for legitimate execution stack access. As nested functions for extended C languages, we can use GCC's heavyweight implementation with runtime code generation or lightweight implementations by enhancing GCC;however, compiler-based implementations are poor in portability. In this study, we implement and evaluate more portable Tascell frameworks called "Tascell/SC" by using transformation-based portable implementations of nested functions. In addition, we propose Tascell-inspired portable frameworks only in C++ called "Tascell++" by using lambda expressions in C++11 for legitimate execution stack access.

关键词： work stealing dynamic load balancing parallel programming languages

来源：评论

学校读者我要写书评

暂无评论

On Demand Specialization of SYCL Kernels with Specialization Constant Length Allocations (SCLA) 24

On Demand Specialization of SYCL Kernels with Specialization...

引用

12th International Workshop on OpenCL and SYCL, IWOCL 2024

作者： Pérez, Víctor Lomüller, Victor Sommer, Lukas Oppermann, Julian Biessy, Romain Ciglarič, Tadej Goli, Mehdi Codeplay Software Ltd United Kingdom

No abstract available.

ISBN: (纸本)9798400717901

No abstract available.

关键词： Just-in-time compilers parallel programming languages SPIR-V SYCL Specialization constants

来源：评论

学校读者我要写书评

暂无评论

Special issue on advances in techniques for assessment performance portability of HPC applications

引用

Future Generation Computer Systems 2025年 171卷

作者： Marowka, Ami Stpiczyński, Przemyslaw Wyrzykowski, Roman Parallel Research Lab Petach Tikva49729 Israel Institute of Computer Science Maria Curie-Sklodowska University Lublin20-033 Poland Department of Computer Science Czestochowa University of Technology Czestochowa42-201 Poland

This special issue aims to present new developments and advances in techniques for assessment performance portability of high performance computing applications. It contains revised and extended versions of selected papers presented at the 10th Workshop on Language-Based parallel programming Models, WLPP 2024, which was a part of 15th International Conference on parallel Processing and Applied Mathematics, PPAM 2024, held on September 8–11, 2024, in Ostrava, Czech Republic. © 2025

关键词： HPC applications parallel programming languages Performance portability

来源：评论

学校读者我要写书评

暂无评论

User-driven Online Kernel Fusion for SYCL

引用

ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION 2023年第2期20卷 21-21页

作者： Perez, Victor Sommer, Lukas Lomuller, Victor Narasimhan, Kumudha Goli, Mehdi Codeplay Software Ltd Level C Argyle House3 Lady Lawson St Edinburgh EH3 9DR Scotland

Heterogeneous programming models are becoming increasingly popular to support the ever-evolving hardware architectures, especially for new and emerging specialized accelerators optimizing specific tasks. While such programs provide performance portability of the existing applications across various heterogeneous architectures to some extent, short-running device kernels can affect an application performance due to overheads of data transfer, synchronization, and kernel launch. While in applications with one or two short-running kernels the overhead can be negligible, it can be noticeable when these short-running kernels dominate the overall number of kernels in an application, as it is the case in graph-based neural network models, where there are several small memory-bound nodes alongside few large compute-bound nodes. To reduce the overhead, combining several kernels into a single, more optimized kernel is an active area of research. However, this task can be time-consuming and error-prone given the huge set of potential combinations. This can push programmers to seek a tradeoffbetween (a) task-specific kernels with low overhead but hard to maintain and (b) smaller modular kernels with higher overhead but easier to maintain. While there are DSL-based approaches, such as those provided for machine learning frameworks, which offer the possibility of such a fusion, they are limited to a particular domain and exploit specific knowledge of that domain and, as a consequence, are hard to port elsewhere. This study explores the feasibility of a user-driven kernel fusion through an extension to the SYCL API to address the automation of kernel fusion. The proposed solution requires programmers to define the subgraph regions that are potentially suitable for fusion without any modification to the kernel code or the function signature. We evaluate the performance benefit of our approach on common neural networks and study the performance improvement in detail.

关键词： Just-in-time compilers runtime environments parallel programming languages

来源：评论

学校读者我要写书评

暂无评论

Task parallel Assembly Language for Uncompromising parallelism 2021

Task Parallel Assembly Language for Uncompromising Paralleli...

引用

42nd ACM SIGPLAN International Conference on programming Language Design and Implementation (PLDI)

作者： Rainey, Mike Newton, Ryan R. Hale, Kyle Hardavellas, Nikos Campanoni, Simone Dinda, Peter Acar, Umut A. Carnegie Mellon Univ Pittsburgh PA 15213 USA Facebook New York NY USA IIT Chicago IL 60616 USA Northwestern Univ Chicago IL 60611 USA

ISBN: (纸本)9781450383912

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.

关键词： parallel programming languages granularity control

来源：评论

学校读者我要写书评

暂无评论

High level programming abstractions for leveraging hierarchical memories with micro-core architectures

引用

JOURNAL OF parallel AND DISTRIBUTED COMPUTING 2020年第0期138卷 128-138页

作者： Jamieson, Maurice Brown, Nick Univ Edinburgh EPCC Bayes Ctr 47 Potterrow Edinburgh Midlothian Scotland

Micro-core architectures combine many low memory, low power computing cores together in a single package. These are attractive for use as accelerators but due to limited on-chip memory and multiple levels of memory hierarchy, the way in which programmers offload kernels needs to be carefully considered. In this paper we use Python as a vehicle for exploring the semantics and abstractions of higher level programming languages to support the offloading of computational kernels to these devices. By moving to a pass by reference model, along with leveraging memory kinds, we demonstrate the ability to easily and efficiently take advantage of multiple levels in the memory hierarchy, even ones that are not directly accessible to the micro-cores. Using a machine learning benchmark, we perform experiments on both Epiphany-Ill and MicroBlaze based micro-cores, demonstrating the ability to compute with data sets of arbitrarily large size. To provide context of our results, we explore the performance and power efficiency of these technologies, demonstrating that whilst these two micro-core technologies are competitive within their own embedded class of hardware, there is still a way to go to reach HPC class GPUs. (C) 2019 Elsevier Inc. All rights reserved.

关键词： parallel programming languages Interpreters Runtime environments Hardware accelerators Neural networks

来源：评论

学校读者我要写书评

暂无评论

Enabling Fast and Highly Effective FPGA Design Process Using the CAPI SNAP Framework 34th

Enabling Fast and Highly Effective FPGA Design Process Using...

引用

34th International Conference on High Performance Computing (ISC High Performance)

作者： Castellane, Alexandre Mesnet, Bruno OpenCAPI & CAPI Snap Enablement IBM France 1 Rue Vieille Poste F-34006 Montpellier France

ISBN: (纸本)9783030343569;9783030343552

The CAPI SNAP (Storage, Network, and Analytics programming) is an open source framework which enables C/C++ as well as FPGA programmers to quickly create FPGA-based accelerated computing that works on server host data, as well as data from storage, flash, Ethernet, or other connected resources. The SNAP framework is based on the IBM Coherent Accelerator Processor Interface (CAPI). From POWER8 with CAPI1.0, to POWER9 with CAPI2.0 and OpenCAPI, programmers can have access to a very simple framework to develop accelerated applications using high speed and very lowlatency interfaces to access an external FPGA. With SNAP, no specific hardware skill is required to port or develop an application and then accelerate it. Even more, a cloud environment is being offered as a cost effective, ready-to-use environment for a first-time right experience as well as a deeper development so that it can be achieved with very little investment.

关键词： Innovative hardware/software co-design Processor architecture Chip multiprocessors Custom and reconfigurable logic Solutions for parallel programming challenges parallel programming languages Libraries Models and notations Alternative and specialized parallel operating systems and runtime systems

来源：评论

学校读者我要写书评

暂无评论

Extending a Work-Stealing Framework with Priorities and Weights 9

Extending a Work-Stealing Framework with Priorities and Weig...

引用

9th IEEE/ACM Workshop on Irregular Applications - Architectures and Algorithms (IA3)

作者： Nakashima, Ryusuke Yoritaka, Hiroshi Yasugi, Masahiro Hiraishi, Tasuku Umatani, Seiji Kyushu Inst Technol Fukuoka Japan Kyoto Univ Kyoto Japan Kanagawa Univ Yokohama Kanagawa Japan Ad Sol Nissin Corp Tokyo Japan

ISBN: (纸本)9781728159874

This paper proposes priority- and weight-based steal strategies for an idle worker (thief) to select a victim worker in work-stealing frameworks. Typical work-stealing frameworks employ uniformly random victim selection. We implemented the proposed strategies on a work-stealing framework called Tascell;Tascell programmers can let each worker estimate and declare, as a real number, the amount of remaining work required to complete its current task so that declared values are used as priorities or weights in the enhanced Tascell framework. To reduce the total task-division cost, the proposed strategies avoid stealing small tasks. With a priority-based strategy, a thief selects the victim that has the highest known priority at that point in time. With a weight-based non-uniformly random strategy, a thief uses the relative weights of victim candidates as their selection probabilities. The proposed selection strategies outperformed uniformly random victim selection. Our evaluation uses a parallel implementation of the "highly serial" version of the Barnes-Hut force-calculation algorithm in a shared memory environment and five benchmark programs in a distributed memory environment.

关键词： parallel programming languages work stealing priority weight concurrency many-core Barnes-Hut algorithm

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：