Modern Python programs in high-performance computing call into compiled libraries and kernels for performance-critical tasks. However, effectively parallelizing these finer-grained, and often dynamic, kernels across m...
详细信息
Modern Python programs in high-performance computing call into compiled libraries and kernels for performance-critical tasks. However, effectively parallelizing these finer-grained, and often dynamic, kernels across modern heterogeneous platforms remains a challenge. This paper designs and optimizes a multi-threaded runtime for Python tasks on single-node multi-GPU systems, including tasks that use resources across multiple devices. We perform an experimental study which examines the impact of Python's Global Interpreter Lock (GIL) on runtime performance and the potential gains under a GIL-less PEP703 future. This work explores tasks with variants for different different device sets, introducing new programming abstractions and runtime mechanisms to simplify their management and enhance portability. Our experimental analysis, using tasks graphs from synthetic and real applications, shows at least a 3x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} (and up to 6x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}) performance improvement over its predecessor in scenarios with high GIL contention. Our implementation of multi-device tasks achieves 8x\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document} less overhead per task relative to a multi-process alternative using Ray.
parallelprogramming with tasks - task parallel programming - is a promising approach to simplifying multithreaded programming in the chip multiprocessor (CMP) era. tasks are used to describe independent units of work...
详细信息
parallelprogramming with tasks - task parallel programming - is a promising approach to simplifying multithreaded programming in the chip multiprocessor (CMP) era. tasks are used to describe independent units of work that can be assigned to threads at runtime in a way that is transparent to the programmer. Thus, the programmer can concentrate on identifying tasks and leave it to the runtime system to take advantage of the potential parallelism. Supporting the task abstraction on heterogeneous CMPs is more challenging than on conventional CMPs. In this article, we take a look at a lightweight task model and its implementation on the Cell processor, the most prominent heterogeneous CMP available today. Choosing a simple task model over a more complex one makes it possible to target fine-grained parallelism and still improve much in terms of programmability.
taskparallelism raises the level of abstraction in shared memory parallelprogramming to simplify the development of complex applications. However, taskparallel applications can exhibit poor performance due to threa...
详细信息
taskparallelism raises the level of abstraction in shared memory parallelprogramming to simplify the development of complex applications. However, taskparallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation - additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various taskparallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems. Our locality framework for taskparallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.
暂无评论