检索结果-内蒙古大学图书馆

Performance Characterization of Python Runtimes for Multi-device task parallel programming

INTERNATIONAL JOURNAL OF parallel programming 2025年第2期53卷 1-24页

作者： Ruys, William Lee, Hochan You, Bozhi Talati, Shreya Park, Jaeyoung Almgren-Bell, James Yan, Yineng Fernando, Milinda Erez, Mattan Gligoric, Milos Burtscher, Martin Rossbach, Christopher J. Pingali, Keshav Biros, George Univ Texas Austin Austin TX 78712 USA Texas State Univ San Marcos TX USA

Modern Python programs in high-performance computing call into compiled libraries and kernels for performance-critical tasks. However, effectively parallelizing these finer-grained, and often dynamic, kernels across modern heterogeneous platforms remains a challenge. This paper designs and optimizes a multi-threaded runtime for Python tasks on single-node multi-GPU systems, including tasks that use resources across multiple devices. We perform an experimental study which examines the impact of Python's Global Interpreter Lock (GIL) on runtime performance and the potential gains under a GIL-less PEP703 future. This work explores tasks with variants for different different device sets, introducing new programming abstractions and runtime mechanisms to simplify their management and enhance portability. Our experimental analysis, using tasks graphs from synthetic and real applications, shows at least a 3x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} (and up to 6x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}) performance improvement over its predecessor in scenarios with high GIL contention. Our implementation of multi-device tasks achieves 8x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} less overhead per task relative to a multi-process alternative using Ray.

关键词： GPU tasking systems HPC in Python GPU programming in Python Global Interpreter Lock task parallel programming

来源：评论

学校读者我要写书评

暂无评论

task parallel programming on the Cell Processor

引用

IT-INFORMATION TECHNOLOGY 2011年第2期53卷 76-82页

作者： Prell, Andreas Univ Bayreuth Bayreuth Germany

parallel programming with tasks - task parallel programming - is a promising approach to simplifying multithreaded programming in the chip multiprocessor (CMP) era. tasks are used to describe independent units of work that can be assigned to threads at runtime in a way that is transparent to the programmer. Thus, the programmer can concentrate on identifying tasks and leave it to the runtime system to take advantage of the potential parallelism. Supporting the task abstraction on heterogeneous CMPs is more challenging than on conventional CMPs. In this article, we take a look at a lightweight task model and its implementation on the Cell processor, the most prominent heterogeneous CMP available today. Choosing a simple task model over a more complex one makes it possible to target fine-grained parallelism and still improve much in terms of programmability.

关键词： parallel programming heterogeneous chip multiprocessor Cell processor task parallel programming task pool work-stealing scheduling parallel runtime system

来源：评论

学校读者我要写书评

暂无评论

Characterizing and mitigating work time inflation in task parallel programs

引用

SCIENTIFIC programming 2013年第3-4期21卷 123-136页

作者： Olivier, Stephen L. de Supinski, Bronis R. Schulz, Martin Prins, Jan F. Univ N Carolina Dept Comp Sci Chapel Hill NC 27599 USA Lawrence Livermore Natl Lab Livermore CA USA

task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation - additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems. Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.

关键词： task parallel programming locality task scheduling affinity NUMA OpenMP

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：