On any modern computer architecture today, parallelism comes with a modest cost, born from the creation and management of threads or tasks. Today, programmers battle this cost by manually optimizing/tuning their codes...
详细信息
On any modern computer architecture today, parallelism comes with a modest cost, born from the creation and management of threads or tasks. Today, programmers battle this cost by manually optimizing/tuning their codes to minimize the cost of parallelism without harming its benefit, performance. This is a difficult battle: programmers must reason about architectural constant factors hidden behind layers of software abstractions, including thread schedulers and memory managers, and their impact on performance, also at scale. In languages that support higher-order functions, the battle hardens: higher order functions can make it difficult, if not impossible, to reason about the cost and benefits of parallelism. Motivated by these challenges and the numerous advantages of high-level languages, we believe that it has become essential to manage parallelism automatically so as to minimize its cost and maximize its benefit. This is a challenging problem, even when considered on a case-by-case, application-specific basis. But if a solution were possible, then it could combine the many correctness benefits of high-level languages with performance by managing parallelism without the programmer effort needed to ensure performance. This paper proposes techniques for such automatic management of parallelism by combining static (compilation) and run-time techniques. Specifically, we consider the parallel ML language with task parallelism, and describe a compiler pipeline that embeds "potential parallelism" directly into the call-stack and avoids the cost of task creation by default. We then pair this compilation pipeline with a run-time system that dynamically converts potential parallelism into actual parallel tasks. Together, the compiler and run-time system guarantee that the cost of parallelism remains low without losing its benefit. We prove that our techniques have no asymptotic impact on the work and span of parallel programs and thus preserve their asymptotic properties. W
Although atomicity plays a key role in data operations of shared variables in parallel computation, researchers haven't treated atomicity in Python in much detail. This study provides a novel approach to integrate...
详细信息
Although atomicity plays a key role in data operations of shared variables in parallel computation, researchers haven't treated atomicity in Python in much detail. This study provides a novel approach to integrate the CPU-based atomic C APIs into Python shared variables by C Foreign Function Interface for Python (CFFI) on all major platforms and utilises Cython to optimise calculation in CPython. Evidence shows that the resulting product, Shared Atomic Enterprise (SAE), could accelerate data operations on shared data types to a large extent. These findings provide a solid evidence base for the massive utilisation of Python atomic operations in parallel computation and concurrent programming.
Work stealing is a well-known technique for dynamic load balancing;however, manually writing work-stealing protocols is errorprone. We can use the Tascell parallelprogramming language for the correct and portable imp...
详细信息
ISBN:
(纸本)9798400708893
Work stealing is a well-known technique for dynamic load balancing;however, manually writing work-stealing protocols is errorprone. We can use the Tascell parallelprogramming language for the correct and portable implementation of work stealing;the implementation combines polling and adequate mutual exclusion. In Tascell, we can express on-demand concurrency for backtracking-based load balancing where a worker performs a sequential computation with its own execution stack unless it is requested to spawn a task. To spawn a larger task by temporarily backtracking, nested functions can be used for legitimate execution stack access. As nested functions for extended C languages, we can use GCC's heavyweight implementation with runtime code generation or lightweight implementations by enhancing GCC;however, compiler-based implementations are poor in portability. In this study, we implement and evaluate more portable Tascell frameworks called "Tascell/SC" by using transformation-based portable implementations of nested functions. In addition, we propose Tascell-inspired portable frameworks only in C++ called "Tascell++" by using lambda expressions in C++11 for legitimate execution stack access.
This special issue aims to present new developments and advances in techniques for assessment performance portability of high performance computing applications. It contains revised and extended versions of selected p...
详细信息
Heterogeneous programming models are becoming increasingly popular to support the ever-evolving hardware architectures, especially for new and emerging specialized accelerators optimizing specific tasks. While such pr...
详细信息
Heterogeneous programming models are becoming increasingly popular to support the ever-evolving hardware architectures, especially for new and emerging specialized accelerators optimizing specific tasks. While such programs provide performance portability of the existing applications across various heterogeneous architectures to some extent, short-running device kernels can affect an application performance due to overheads of data transfer, synchronization, and kernel launch. While in applications with one or two short-running kernels the overhead can be negligible, it can be noticeable when these short-running kernels dominate the overall number of kernels in an application, as it is the case in graph-based neural network models, where there are several small memory-bound nodes alongside few large compute-bound nodes. To reduce the overhead, combining several kernels into a single, more optimized kernel is an active area of research. However, this task can be time-consuming and error-prone given the huge set of potential combinations. This can push programmers to seek a tradeoffbetween (a) task-specific kernels with low overhead but hard to maintain and (b) smaller modular kernels with higher overhead but easier to maintain. While there are DSL-based approaches, such as those provided for machine learning frameworks, which offer the possibility of such a fusion, they are limited to a particular domain and exploit specific knowledge of that domain and, as a consequence, are hard to port elsewhere. This study explores the feasibility of a user-driven kernel fusion through an extension to the SYCL API to address the automation of kernel fusion. The proposed solution requires programmers to define the subgraph regions that are potentially suitable for fusion without any modification to the kernel code or the function signature. We evaluate the performance benefit of our approach on common neural networks and study the performance improvement in detail.
Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by ord...
详细信息
ISBN:
(纸本)9781450383912
Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect programmers to implement this compromise by optimizing their code manually. This process is labor intensive, requires deep expertise, and reduces code quality. Recent work on heartbeat scheduling shows a promising approach that manifests the potentially vast amounts of available, latent parallelism, at a regular rate, based on even beats in time. The idea is to amortize the overheads of parallelism over the useful work performed between the beats. Heartbeat scheduling is promising in theory, but the reality is complicated: it has no known practical implementation. In this paper, we propose a practical approach to heartbeat scheduling that involves equipping the assembly language with a small set of primitives. These primitives leverage existing kernel and hardware support for interrupts to allow parallelism to remain latent, until a heartbeat, when it can be manifested with low cost. Our Task parallel Assembly Language (TPAL) is a compact, RISC-like assembly language. We specify TPAL through an abstract machine and implement the abstract machine as compiler transformations for C/C++ code and a specialized run-time system. We present an evaluation on both the Linux and the Nautilus kernels, considering a range of heartbeat interrupt mechanisms. The evaluation shows that TPAL can dramatically reduce the overheads of parallelism without compromising scalability.
Micro-core architectures combine many low memory, low power computing cores together in a single package. These are attractive for use as accelerators but due to limited on-chip memory and multiple levels of memory hi...
详细信息
Micro-core architectures combine many low memory, low power computing cores together in a single package. These are attractive for use as accelerators but due to limited on-chip memory and multiple levels of memory hierarchy, the way in which programmers offload kernels needs to be carefully considered. In this paper we use Python as a vehicle for exploring the semantics and abstractions of higher level programminglanguages to support the offloading of computational kernels to these devices. By moving to a pass by reference model, along with leveraging memory kinds, we demonstrate the ability to easily and efficiently take advantage of multiple levels in the memory hierarchy, even ones that are not directly accessible to the micro-cores. Using a machine learning benchmark, we perform experiments on both Epiphany-Ill and MicroBlaze based micro-cores, demonstrating the ability to compute with data sets of arbitrarily large size. To provide context of our results, we explore the performance and power efficiency of these technologies, demonstrating that whilst these two micro-core technologies are competitive within their own embedded class of hardware, there is still a way to go to reach HPC class GPUs. (C) 2019 Elsevier Inc. All rights reserved.
The CAPI SNAP (Storage, Network, and Analytics programming) is an open source framework which enables C/C++ as well as FPGA programmers to quickly create FPGA-based accelerated computing that works on server host data...
详细信息
ISBN:
(纸本)9783030343569;9783030343552
The CAPI SNAP (Storage, Network, and Analytics programming) is an open source framework which enables C/C++ as well as FPGA programmers to quickly create FPGA-based accelerated computing that works on server host data, as well as data from storage, flash, Ethernet, or other connected resources. The SNAP framework is based on the IBM Coherent Accelerator Processor Interface (CAPI). From POWER8 with CAPI1.0, to POWER9 with CAPI2.0 and OpenCAPI, programmers can have access to a very simple framework to develop accelerated applications using high speed and very lowlatency interfaces to access an external FPGA. With SNAP, no specific hardware skill is required to port or develop an application and then accelerate it. Even more, a cloud environment is being offered as a cost effective, ready-to-use environment for a first-time right experience as well as a deeper development so that it can be achieved with very little investment.
This paper proposes priority- and weight-based steal strategies for an idle worker (thief) to select a victim worker in work-stealing frameworks. Typical work-stealing frameworks employ uniformly random victim selecti...
详细信息
ISBN:
(纸本)9781728159874
This paper proposes priority- and weight-based steal strategies for an idle worker (thief) to select a victim worker in work-stealing frameworks. Typical work-stealing frameworks employ uniformly random victim selection. We implemented the proposed strategies on a work-stealing framework called Tascell;Tascell programmers can let each worker estimate and declare, as a real number, the amount of remaining work required to complete its current task so that declared values are used as priorities or weights in the enhanced Tascell framework. To reduce the total task-division cost, the proposed strategies avoid stealing small tasks. With a priority-based strategy, a thief selects the victim that has the highest known priority at that point in time. With a weight-based non-uniformly random strategy, a thief uses the relative weights of victim candidates as their selection probabilities. The proposed selection strategies outperformed uniformly random victim selection. Our evaluation uses a parallel implementation of the "highly serial" version of the Barnes-Hut force-calculation algorithm in a shared memory environment and five benchmark programs in a distributed memory environment.
暂无评论