Several application programming interfaces (APIs) and frameworks have been proposed to simplify the development of general-purpose GPU (GPGPU) applications. GPGPU application development typically involves specific cu...
详细信息
Several application programming interfaces (APIs) and frameworks have been proposed to simplify the development of general-purpose GPU (GPGPU) applications. GPGPU application development typically involves specific customization for the target operating systems and hardware devices. The effort to port applications from one API to the other (or to develop multitarget applications) is complicated by the availability of a plethora of specifications, which in essence offers very similar underlying functionality. In this work we provide an in-depth study of six state-of-the-art GPGPU APIs. From these we derive a taxonomy of the common semantics and propose a unified specification. We describe a methodology to translate this unified specification into different target APIs. This simplifies cross-platform application development and provides a clean framework for benchmarking. Our proposed unified specification is called GPGPU unified specification and translation (GUST) and it captures common functionality found in compute-only APIs (e.g., compute unified device architecture and open computing language), in the compute pipeline of traditional graphic-oriented APIs (e.g., open graphic language and Direct3D11) and in last-generation bare-metal APIs (e.g., Vulkan and Direct3D12). The proposed translation methodology solves differences between specific APIs in a transparent manner, without hiding available tuning knobs for compute kernel optimizations and fostering best programming practices in a simple manner.
Performing modeling and visualization of task-based parallel algorithms is challenging. Libraries such as Intel Threading Building Blocks (TBB) and Microsoft's parallel Patterns Library provide high-level algorith...
详细信息
ISBN:
(纸本)9781424460229
Performing modeling and visualization of task-based parallel algorithms is challenging. Libraries such as Intel Threading Building Blocks (TBB) and Microsoft's parallel Patterns Library provide high-level algorithms that are implemented using low-level tasks. Current tools present performance at this lower level. Developers like to tune and debug at the same level as the coding abstraction, so in this paper we propose tools and a two step methodology that target this level of abstraction. In the first step, the system level metrics of utilization and overhead are collected to determine if performance is acceptable. If a problem is suspected, the second step of our methodology projects these metrics on to the algorithms contained in the application. Using these projections many common performance issues can be quickly diagnosed. We demonstrate our methodology using a prototype implementation that is integrated with the Intel Threading Building Blocks library. We show the flexibility of the approach by analyzing three applications, including a client-server benchmark that uses a parallel_for nested within a parallel pipeline.
Fault tolerance is an important requirement for long-running parallel programs. This paper presents a different approach to fault-tolerance support in message-passing parallel programs based on their structural and be...
详细信息
ISBN:
(纸本)9781424452910
Fault tolerance is an important requirement for long-running parallel programs. This paper presents a different approach to fault-tolerance support in message-passing parallel programs based on their structural and behavioral characteristics, commonly known as patterns. A classification of these patterns and their applicable fault-tolerance strategies is aimed to facilitate an application developer to incorporate appropriate fault-tolerance strategies to an application. Fault-tolerance strategies for two of the patterns are discussed, and one specific strategy is elaborated and analyzed. The presented strategies have been incorporated into a fault-tolerance support framework called FT-PAS. One objective of the framework is to separate the fault tolerance related details from an application developer's main objectives (separation-of-concerns). The paper presents the additional key features of the framework, and concludes with a discussion on current and future research directions.
This paper describes the ideas and developments of the project EP-CACHE. Within this project new methods and tools are developed to improve the analysis and the optimization of programs for cache architectures, especi...
详细信息
This paper describes the ideas and developments of the project EP-CACHE. Within this project new methods and tools are developed to improve the analysis and the optimization of programs for cache architectures, especially for SMP clusters. The tool set comprises the semi-automatic instrumentation of user programs, the monitoring of the cache behavior, the visualization of the measured data, and optimization techniques for improving the user program for better cache usage. As current hardware performance counters do not give sufficient user relevant information, new hardware monitors are designed that provide more detailed information about the cache utilization related to the data structures and code blocks in the user program. The expense of the hardware and software realization will be assessed to minimize the risk of a real implementation of the investigated monitors. The usefulness of the hardware monitors is evaluated by a cache simulator. (c) 2004 Published by Elsevier B.V.
This paper describes the ideas and developments of the project EP-CACHE. Within this project new methods and tools are developed to improve the analysis and the optimization of programs for cache architectures, especi...
详细信息
This paper describes the ideas and developments of the project EP-CACHE. Within this project new methods and tools are developed to improve the analysis and the optimization of programs for cache architectures, especially for SMP clusters. The tool set comprises the semi-automatic instrumentation of user programs, the monitoring of the cache behavior, the visualization of the measured data, and optimization techniques for improving the user program for better cache usage. As current hardware performance counters do not give sufficient user relevant information, new hardware monitors are designed that provide more detailed information about the cache utilization related to the data structures and code blocks in the user program. The expense of the hardware and software realization will be assessed to minimize the risk of a real implementation of the investigated monitors. The usefulness of the hardware monitors is evaluated by a cache simulator. (c) 2004 Published by Elsevier B.V.
The MPI-2 standard defines functions for Remote Memory Access (RMA) by allowing one process to specify all communication parameters both for the sending and the receiving side, which is also referred to as one-sided c...
详细信息
ISBN:
(纸本)354039110X
The MPI-2 standard defines functions for Remote Memory Access (RMA) by allowing one process to specify all communication parameters both for the sending and the receiving side, which is also referred to as one-sided communication. Having experienced parallelprogramming as a complex and error-prone task, we have developed the MPI correctness checking tool MARMOT covering the MPI-1.2 standard and are now aiming at extending it to support application developers also for the more frequently used parts of MPI-2 such as one-sided communication. In this paper we describe our tool, which is designed to check the correct usage of the MPI API automatically at run-time, and we also analyse to what extent it is possible to do so for RMA.
Tool Gear is a software infrastructure for developing performance analysis and other tools. Unlike existing integrated toolkits, which focus on providing a suite of capabilities, Tool Gear is designed to help toot dev...
详细信息
ISBN:
(纸本)1892512459
Tool Gear is a software infrastructure for developing performance analysis and other tools. Unlike existing integrated toolkits, which focus on providing a suite of capabilities, Tool Gear is designed to help toot developers create new tools quickly. It combines dynamic instrumentation capabilities with an efficient database and a sophisticated and extensible graphical user interface. This paper describes the design of Tool Gear and presents examples of tools that have been built with it.
This paper presents and analyzes two different strategies of heterogeneous distribution of computations solving dense linear algebra problems on heterogeneous networks of computers. The first strategy is based on hete...
详细信息
This paper presents and analyzes two different strategies of heterogeneous distribution of computations solving dense linear algebra problems on heterogeneous networks of computers. The first strategy is based on heterogeneous distribution of processes over processors and homogeneous block cyclic distribution of data over the processes. The second is based on homogeneous distribution of processes over processors and heterogeneous block cyclic distribution of data over the processes. Both strategies were implemented in the mpC language-a dedicated parallel extension of ANSI C for efficient and portable programming of heterogeneous networks of computers. The first strategy was implemented using calls to ScaLAPACK;the second strategy was implemented with calls to LAPACK and BLAS. Cholesky factorization on a heterogeneous network of workstations is used to demonstrate that the heterogeneous distributions have an advantage over the traditional homogeneous distribution (C) 2001 Academic Press.
High-performance computing (HPC) application developers can no longer afford the luxury of programming to just one platform. The usefulness and longevity of software now depend on portability as well as performance. T...
详细信息
High-performance computing (HPC) application developers can no longer afford the luxury of programming to just one platform. The usefulness and longevity of software now depend on portability as well as performance. This paper examines the appropriateness of existing tools in developing and tuning applications that must be both efficient and portable. A series of design tradeoffs have been made, with the result that today's tools are slanted toward one goal or the other but do not respond adequately to both. These tradeoffs are explored through a series of examples drawn from current tools. Suggestions are presented for how tools might better support the evolving needs of HPC programmers. (C) 2001 Elsevier Science B.V. All rights reserved.
Relative debugging is a powerful paradigm that lets us locate errors in programs that result from porting or rewriting code. The authors describe their experience using relative debugging to compare a program written ...
详细信息
Relative debugging is a powerful paradigm that lets us locate errors in programs that result from porting or rewriting code. The authors describe their experience using relative debugging to compare a program written in a sequential language with one that was ported to the data-parallel language ZPL.
暂无评论