This paper describes the ideas and developments of the project EP-CACHE. Within this project new methods and tools are developed to improve the analysis and the optimization of programs for cache architectures, especi...
详细信息
This paper describes the ideas and developments of the project EP-CACHE. Within this project new methods and tools are developed to improve the analysis and the optimization of programs for cache architectures, especially for SMP clusters. The tool set comprises the semi-automatic instrumentation of user programs, the monitoring of the cache behavior, the visualization of the measured data, and optimization techniques for improving the user program for better cache usage. As current hardware performance counters do not give sufficient user relevant information, new hardware monitors are designed that provide more detailed information about the cache utilization related to the data structures and code blocks in the user program. The expense of the hardware and software realization will be assessed to minimize the risk of a real implementation of the investigated monitors. The usefulness of the hardware monitors is evaluated by a cache simulator. (c) 2004 Published by Elsevier B.V.
This paper describes the ideas and developments of the project EP-CACHE. Within this project new methods and tools are developed to improve the analysis and the optimization of programs for cache architectures, especi...
详细信息
This paper describes the ideas and developments of the project EP-CACHE. Within this project new methods and tools are developed to improve the analysis and the optimization of programs for cache architectures, especially for SMP clusters. The tool set comprises the semi-automatic instrumentation of user programs, the monitoring of the cache behavior, the visualization of the measured data, and optimization techniques for improving the user program for better cache usage. As current hardware performance counters do not give sufficient user relevant information, new hardware monitors are designed that provide more detailed information about the cache utilization related to the data structures and code blocks in the user program. The expense of the hardware and software realization will be assessed to minimize the risk of a real implementation of the investigated monitors. The usefulness of the hardware monitors is evaluated by a cache simulator. (c) 2004 Published by Elsevier B.V.
Relative debugging is a powerful paradigm that lets us locate errors in programs that result from porting or rewriting code. The authors describe their experience using relative debugging to compare a program written ...
详细信息
Relative debugging is a powerful paradigm that lets us locate errors in programs that result from porting or rewriting code. The authors describe their experience using relative debugging to compare a program written in a sequential language with one that was ported to the data-parallel language ZPL.
The goal of this paper to identify and discuss the basic issues of and solutions to parallel processing on clusters of workstations (COWs). Firstly, identification and expressing parallelism in application programs ar...
详细信息
The goal of this paper to identify and discuss the basic issues of and solutions to parallel processing on clusters of workstations (COWs). Firstly, identification and expressing parallelism in application programs are discussed. The following approaches to finding and expressing parallelism are characterized: parallelprogramming languages, parallel programming tools, sequential programming supported by distributed shared memory (DSM), and parallelising compilers. Secondly, efficient management of available parallelism is discussed. As parallel execution requires an efficient management of processes and computational resources, a parallel execution environment proposed here is to be built based on a distributed operating system. This system, in order to allow parallel programs to achieve high performance and transparency, should provide services such as global scheduling, process migration, local and remote process creation, computation coordination, group communication and distributed shared memory. (C) 1999 Elsevier Science B.V. All rights reserved.
In this paper, we present several tools for analyzing parallel programs. The tools are built on top of a compiler infrastructure, which provides advanced capabilities for symbolic program analysis and manipulation. Th...
详细信息
In this paper, we present several tools for analyzing parallel programs. The tools are built on top of a compiler infrastructure, which provides advanced capabilities for symbolic program analysis and manipulation. The tools can display characteristics of a program and relate this information to data gathered from instrumented program runs and other performance analysis tools. They also support an interactive compilation scenario, giving the user feedback on how the compilation process performed and how to improve it. We will present case studies demonstrating the tool use. These include the characterization of an industrial application and the study of new compiler techniques and portable parallel languages. (C) 1998 Elsevier Science B.V. All rights reserved.
This paper presents and analyzes two different strategies of heterogeneous distribution of computations solving dense linear algebra problems on heterogeneous networks of computers. The first strategy is based on hete...
详细信息
This paper presents and analyzes two different strategies of heterogeneous distribution of computations solving dense linear algebra problems on heterogeneous networks of computers. The first strategy is based on heterogeneous distribution of processes over processors and homogeneous block cyclic distribution of data over the processes. The second is based on homogeneous distribution of processes over processors and heterogeneous block cyclic distribution of data over the processes. Both strategies were implemented in the mpC language-a dedicated parallel extension of ANSI C for efficient and portable programming of heterogeneous networks of computers. The first strategy was implemented using calls to ScaLAPACK;the second strategy was implemented with calls to LAPACK and BLAS. Cholesky factorization on a heterogeneous network of workstations is used to demonstrate that the heterogeneous distributions have an advantage over the traditional homogeneous distribution (C) 2001 Academic Press.
High-performance computing (HPC) application developers can no longer afford the luxury of programming to just one platform. The usefulness and longevity of software now depend on portability as well as performance. T...
详细信息
High-performance computing (HPC) application developers can no longer afford the luxury of programming to just one platform. The usefulness and longevity of software now depend on portability as well as performance. This paper examines the appropriateness of existing tools in developing and tuning applications that must be both efficient and portable. A series of design tradeoffs have been made, with the result that today's tools are slanted toward one goal or the other but do not respond adequately to both. These tradeoffs are explored through a series of examples drawn from current tools. Suggestions are presented for how tools might better support the evolving needs of HPC programmers. (C) 2001 Elsevier Science B.V. All rights reserved.
Several application programming interfaces (APIs) and frameworks have been proposed to simplify the development of general-purpose GPU (GPGPU) applications. GPGPU application development typically involves specific cu...
详细信息
Several application programming interfaces (APIs) and frameworks have been proposed to simplify the development of general-purpose GPU (GPGPU) applications. GPGPU application development typically involves specific customization for the target operating systems and hardware devices. The effort to port applications from one API to the other (or to develop multitarget applications) is complicated by the availability of a plethora of specifications, which in essence offers very similar underlying functionality. In this work we provide an in-depth study of six state-of-the-art GPGPU APIs. From these we derive a taxonomy of the common semantics and propose a unified specification. We describe a methodology to translate this unified specification into different target APIs. This simplifies cross-platform application development and provides a clean framework for benchmarking. Our proposed unified specification is called GPGPU unified specification and translation (GUST) and it captures common functionality found in compute-only APIs (e.g., compute unified device architecture and open computing language), in the compute pipeline of traditional graphic-oriented APIs (e.g., open graphic language and Direct3D11) and in last-generation bare-metal APIs (e.g., Vulkan and Direct3D12). The proposed translation methodology solves differences between specific APIs in a transparent manner, without hiding available tuning knobs for compute kernel optimizations and fostering best programming practices in a simple manner.
Fault tolerance is an important requirement for long-running parallel programs. This paper presents a different approach to fault-tolerance support in message-passing parallel programs based on their structural and be...
详细信息
ISBN:
(纸本)9781424452910
Fault tolerance is an important requirement for long-running parallel programs. This paper presents a different approach to fault-tolerance support in message-passing parallel programs based on their structural and behavioral characteristics, commonly known as patterns. A classification of these patterns and their applicable fault-tolerance strategies is aimed to facilitate an application developer to incorporate appropriate fault-tolerance strategies to an application. Fault-tolerance strategies for two of the patterns are discussed, and one specific strategy is elaborated and analyzed. The presented strategies have been incorporated into a fault-tolerance support framework called FT-PAS. One objective of the framework is to separate the fault tolerance related details from an application developer's main objectives (separation-of-concerns). The paper presents the additional key features of the framework, and concludes with a discussion on current and future research directions.
Performing modeling and visualization of task-based parallel algorithms is challenging. Libraries such as Intel Threading Building Blocks (TBB) and Microsoft's parallel Patterns Library provide high-level algorith...
详细信息
ISBN:
(纸本)9781424460229
Performing modeling and visualization of task-based parallel algorithms is challenging. Libraries such as Intel Threading Building Blocks (TBB) and Microsoft's parallel Patterns Library provide high-level algorithms that are implemented using low-level tasks. Current tools present performance at this lower level. Developers like to tune and debug at the same level as the coding abstraction, so in this paper we propose tools and a two step methodology that target this level of abstraction. In the first step, the system level metrics of utilization and overhead are collected to determine if performance is acceptable. If a problem is suspected, the second step of our methodology projects these metrics on to the algorithms contained in the application. Using these projections many common performance issues can be quickly diagnosed. We demonstrate our methodology using a prototype implementation that is integrated with the Intel Threading Building Blocks library. We show the flexibility of the approach by analyzing three applications, including a client-server benchmark that uses a parallel_for nested within a parallel pipeline.
暂无评论