In this paper, new constructs for synchronization in parallel programming languages are presented for shared memory multiprocessors. The motivation behind the design of these new constructs is to relieve programmers f...
详细信息
We present an Adaptive Mesh Refinement benchmark for evaluating programmability and performance of modern parallel programming languages. Benchmarks employed today by language developing teams, originally designed for...
详细信息
ISBN:
(纸本)9781595939746
We present an Adaptive Mesh Refinement benchmark for evaluating programmability and performance of modern parallel programming languages. Benchmarks employed today by language developing teams, originally designed for performance evaluation of computer architectures, do not fully capture the complexity of state-of-the-art computational software systems running on today's parallel machines or to be run on the emerging ones from the multi-cores to the peta-scale High Productivity Computer Systems. This benchmark, extracted from a real application framework, presents challenges for a programming language in both expressiveness and performance. It consists of an infrastructure for finite difference calculations on block-structured adaptive meshes and a solver for elliptic Partial Differential Equations built on this infrastructure. Adaptive Mesh Refinement algorithms are challenging to implement due to the irregularity introduced by local mesh refinement. We describe those challenges posed by this benchmark through two reference implementations (C++/Fortran/MPI and Titanium) and in the context of three programming models.
This special issue aims to present new developments and advances in techniques for assessment performance portability of high performance computing applications. It contains revised and extended versions of selected p...
详细信息
This paper discusses a novel approach to implementing OpenMP on clusters, Traditional approaches to do so rely on Software Distributed Shared Memory systems to handle shared data. We discuss these and then introduce a...
详细信息
This paper discusses a novel approach to implementing OpenMP on clusters, Traditional approaches to do so rely on Software Distributed Shared Memory systems to handle shared data. We discuss these and then introduce an alternative approach that translates OpenMP to Global Arrays (GA), explaining the basic strategy. GA requires a data distribution. We do not expect the user to supply this, rather, we show how we perform data distribution and work distribution according to the user-supplied OpenMP static loop schedules. An inspector executor strategy is employed for irregular applications in order to gather information on accesses to potentially non-local data, group non-local data transfers and overlap communications with local computations. Furthermore, a new directive INVARIANT is proposed to provide information about the dynamic scope of data access patterns. This directive can help us generate efficient codes for irregular applications using the inspector executor approach, We also illustrate how to deal with some hard cases containing reshaping and strided accesses during the translation. Our experiments show promising results for the corresponding regular and irregular GA codes. (c) 2005 Elsevier B.V. All rights reserved.
We survey parallelprogramming models and languages using six criteria to assess their suitability for realistic portable parallelprogramming. We argue that an ideal model should be easy to program, should have a sof...
详细信息
We survey parallelprogramming models and languages using six criteria to assess their suitability for realistic portable parallelprogramming. We argue that an ideal model should be easy to program, should have a software development methodology, should be architecture-independent, should be easy to understand, should guarantee performance, and should provide accurate information about the cost of programs. These criteria reflect our belief that developments in parallelism must be driven by a parallel software industry based on portability and efficiency. We consider programming models in six categories, depending on the level of abstraction they provide. Those that are very abstract conceal even the presence of parallelism at the software level. Such models make software easy to build and port, but efficient and predictable performance is usually hard to achieve. At the other end of the spectrum, low-level models make all of the messy issues of parallelprogramming explicit (how many threads, how to place them, how to express communication, and how to schedule communication), so that software is hard to build and not very portable, but is usually efficient. Most recent models are near the center of this spectrum, exploring the best tradeoffs between expressiveness and performance. A few models have achieved both abstractness and efficiency. Both kinds of models raise the possibility of parallelism as part of the mainstream of computing.
Many parallel algorithms are naturally expressed at a fine level of granularity, often finer than a MIMD parallel system can exploit efficiently. Most builders of parallel systems have looked to either the programmer ...
详细信息
Many parallel algorithms are naturally expressed at a fine level of granularity, often finer than a MIMD parallel system can exploit efficiently. Most builders of parallel systems have looked to either the programmer or a parallelizing compiler to increase the granularity of such algorithms. In this paper, we explore a third approach to the granularity problem by analyzing two strategies for combining parallel tasks dynamically at runtime. We reject the simpler load-based inlining method, where tasks are combined based on dynamic load level, in favor of the safer and more robust lazy task creation method, where tasks are created only retroactively as processing resources become available. These strategies grew out of work on Mul-T [17], an efficient parallel implementation of Scheme, but could be used with other languages as well. We describe our Mul-T implementations of lazy task creation for two contrasting machines, and present performance statistics which show the method's effectiveness. Lazy task creation allows efficient execution of naturally expressed algorithms of a substantially finer grain than possible with previous parallel Lisp systems. Earlier versions of this paper appeared as [20] and [21].
This paper proposes extensions of sequential programminglanguages for parallelprogramming that have the following features: 1) Dynamic Structures: The process structure is dynamic, Processes and variables can be cre...
详细信息
This paper proposes extensions of sequential programminglanguages for parallelprogramming that have the following features: 1) Dynamic Structures: The process structure is dynamic, Processes and variables can be created and deleted. 2) Paradigm Integration: The programming notation supports shared memory and message passing models. 3) Determinism: Demonstrating that a program is deterministic-all executions with the same input produce the same output-is straightforward. Programs can be written so that compilers can verify that the programs are deterministic. Nondeterministic constructs can be introduced in a sequence of refinement steps to obtain greater efficiency if required. The ideas have been incorporated in an extension of Fortran, but the underlying sequential imperative language is not central to the ideas described here. A compiler for the Fortran extension, called Fortran M, is available by anonymous ftp from Argonne National Laboratory. Fortran M has been used for a variety of parallel applications.
In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established...
详细信息
In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. The polyphase filter is a standard tool in digital signal processing and as such a well established algorithm. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards (Fermi, Kepler, Maxwell), on the Intel Xeon CPU and Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this, the first makes use of L1/Texture cache, the second uses shared memory. We discuss the usability of each of our implementations along with their behaviours. We measure performance in execution time, which is a critical factor for real-time systems, we also present results in terms of bandwidth (GB/s), compute (GFLOP/s/s) and type conversions (GTc/s). We include a presentation of our results in terms of the sample rate which can be processed in real-time by a chosen platform, which more intuitively describes the expected performance in a signal processing setting. Our findings show that, for the GPUs considered, the performance of our polyphase filter when using lower precision input data is limited by type conversions rather than device bandwidth. We compare these results to an implementation on the Xeon Phi. We show that our Xeon Phi implementation has a performance that is 1.5 x to 1.92 x greater than our CPU implementation, however is not insufficient to compete with the performance of GPUs. We conclude with a comparison of our best performing code to two other implementations of the polyphase filter, showing that our implementation is faster in nearly all cases. This work forms part of the Astro-Accelerate project, a many-core accelerated real-time data processing library for digital signal processing of time-domain radio astronomy data. (C) 2016 Els
The goal of this paper to identify and discuss the basic issues of and solutions to parallel processing on clusters of workstations (COWs). Firstly, identification and expressing parallelism in application programs ar...
详细信息
The goal of this paper to identify and discuss the basic issues of and solutions to parallel processing on clusters of workstations (COWs). Firstly, identification and expressing parallelism in application programs are discussed. The following approaches to finding and expressing parallelism are characterized: parallel programming languages, parallelprogramming tools, sequential programming supported by distributed shared memory (DSM), and parallelising compilers. Secondly, efficient management of available parallelism is discussed. As parallel execution requires an efficient management of processes and computational resources, a parallel execution environment proposed here is to be built based on a distributed operating system. This system, in order to allow parallel programs to achieve high performance and transparency, should provide services such as global scheduling, process migration, local and remote process creation, computation coordination, group communication and distributed shared memory. (C) 1999 Elsevier Science B.V. All rights reserved.
We investigate the well-known parallel Random Access Machine (PRAM) model of parallel computation as a practical parallel pl programming model. The two components of this project are a general-purpose PRAM programming...
详细信息
We investigate the well-known parallel Random Access Machine (PRAM) model of parallel computation as a practical parallel pl programming model. The two components of this project are a general-purpose PRAM programming language, called Fork95, and a library, called PAD, of fundamental, efficiently implemented parallel algorithms and data structures. We outline the main features of Fork95 as they apply to the implementation of PAD, and describe the implementation of library procedures for prefix-sums and sorting. The Fork95 compiler generates code for the SB-PRAM, a hardware emulation of the PRAM, which is currently being completed at the University of Saarbrucken. Both language and library can immediately be used with this machine. The project is, however, of independent interest. The programming environment can help the algorithm designer to evaluate the practicality of new parallel algorithms, and can furthermore be used as a tool for teaching and communication of parallel algorithms. (C) 1999 Elsevier Science B.V. All rights reserved.
暂无评论