Heterogeneous platforms composed of multiple different types of computing devices (such as CPUs, GPUs, and Intel MICs) have been widely used recently. However, most of parallel applications developed in such a heterog...
详细信息
Heterogeneous platforms composed of multiple different types of computing devices (such as CPUs, GPUs, and Intel MICs) have been widely used recently. However, most of parallel applications developed in such a heterogeneous platform usually only utilize a certain kind of computing device due to the lack of easy-to-use heterogeneous cooperative parallel programming models. To reduce the difficulty of heterogeneous cooperative parallel programming, a directive-based heterogeneous cooperative parallel programming framework called HeteroPP is proposed. HeteroPP provides an easier way for programmers to fully exploit multiple different types of computing devices to concurrently and cooperatively perform data-parallel applications on heterogeneous platforms. An extension to OpenMP directives and clauses is proposed to make it possible for programmers to easily offload a data-parallel compute kernel to multiple different types of computing devices. A source-to-source compiler is designed to help programmers to automatically generate multiple device-specific compute kernels that can be concurrently and cooperatively performed on heterogeneous platforms. Many experiments are conducted with 12 typical data-parallel applications implemented with HeteroPP on a heterogeneous CPU-GPU-MIC platform. The results show that HeteroPP not only greatly simplifies the heterogeneous cooperative parallel programming, but also can fully utilize the CPUs, GPU, and MIC to efficiently perform these applications.
Haskell is a modern, functional programming language with an interesting story to tell about parallelism: rather than using concurrent threads and locks, Haskell offers a variety of libraries that enable concise, high...
详细信息
Haskell is a modern, functional programming language with an interesting story to tell about parallelism: rather than using concurrent threads and locks, Haskell offers a variety of libraries that enable concise, high-level parallel programs with results that are guaranteed to be deterministic (independent of the number of cores and the scheduling being used).
New computational techniques for simulating a large array of wind turbines are highly needed to model modern electrical grid networks. In this paper, an implementation of a doubly fed induction generator wind turbine ...
详细信息
New computational techniques for simulating a large array of wind turbines are highly needed to model modern electrical grid networks. In this paper, an implementation of a doubly fed induction generator wind turbine model solver is proposed. This solver will run on an NVIDIA graphic processing unit, and it will be coded using the compute unified device architecture (CUDA). The implementation will integrate a linear time-invariant system represented by state-space matrices. It has been implemented a CUDA kernel capable of simulating many wind turbines in parallel with different wind profiles and using different configurations. Strategies such as optimizing memory access and overlapping data transfers with the kernel were used to obtain the results. The CUDA implementation reaches an occupancy of 95%, while simulating 500 wind turbines where each unit is subject to a different wind profile or using different configuration parameters.
Whilst there have been great advances in HPC hardware and software in recent years, the languages and models that we use to program these machines have remained much more static. This is not from a lack of effort, but...
详细信息
Whilst there have been great advances in HPC hardware and software in recent years, the languages and models that we use to program these machines have remained much more static. This is not from a lack of effort, but instead by virtue of the fact that the foundation that many programming languages are built on is not sufficient for the level of expressivity required for parallel work. The result is an implicit trade-off between programmability and performance which is made worse due to the fact that, whilst many scientific users are experts within their own fields, they are not HPC experts. Type oriented programming looks to address this by encoding the complexity of a language via the type system. Most of the language functionality is contained within a loosely coupled type library that can be flexibly used to control many aspects such as parallelism. Due to the high level nature of this approach there is much information available during compilation which can be used for optimisation and, in the absence of type information, the compiler can apply sensible default options thus supporting both the expert programmer and novice alike. We demonstrate that, at no performance or scalability penalty when running on up to 8196 cores of a Cray XE6 system, codes written in this type oriented manner provide improved programmability. The programmer is able to write simple, implicit parallel, HPC code at a high level and then explicitly tune by adding additional type information if required. (C) 2017 Elsevier Ltd. All rights reserved.
The efficient utilization of the available resources in modern parallel computing systems requires advanced parallel programming expertise. However, parallel programming is more difficult than sequential programming. ...
详细信息
The efficient utilization of the available resources in modern parallel computing systems requires advanced parallel programming expertise. However, parallel programming is more difficult than sequential programming. To alleviate the difficulties of parallel programming, high-level programming frameworks, such as OpenMP, have been proposed. Yet, there is evidence that novice parallel programmers make common mistakes that may lead to performance degradation or unexpected program behavior. In this paper, we present our cognitive parallel programming assistant (PAPA) that aims at educating and assisting novice parallel programmers to avoid common OpenMP mistakes. PAPA combines different IBM Watson services to provide a dialog-based interaction (through text and voice) for programmers. We use the Watson Conversation service to implement the dialog-based interaction, and the Speech-to-Text and Text-to-Speech services to enable the voice interaction. The Watson Natural Language Understanding and WordsAPl Synonyms services are used to train PAPA with OpenMP-related publications. We evaluate our approach using a user experience questionnaire with a number of novice parallel programmers at Linnaeus University. (C) 2018 Elsevier B.V. All rights reserved.
A system called Agora was designed and implemented that supports the development of multilanguage parallel applications for heterogeneous machines. Agora hinges on two ideas: the first one is that shared memory can be...
详细信息
A system called Agora was designed and implemented that supports the development of multilanguage parallel applications for heterogeneous machines. Agora hinges on two ideas: the first one is that shared memory can be a suitable abstraction to program concurrent, multilanguage modules running on heterogeneous machines. The second idea is that a shared memory abstraction can be efficiently supported across different computer architectures that are not connected by a physical shared memory, e.g., local area network workstations or ensemble machines. Agora has been in use for more than a year. The authors describe the Agora shared memory and its software implementation on both tightly and loosely coupled architectures. Measurements of the current implementation are also included.
Actor model-based design is actively researched for parallel embedded SW design since the model exposes the potential parallelism explicitly in an architecture-neutral form. In most actor-oriented models, actors are s...
详细信息
Actor model-based design is actively researched for parallel embedded SW design since the model exposes the potential parallelism explicitly in an architecture-neutral form. In most actor-oriented models, actors are self-contained and data channels are the only sharable object between actors, and they compose a system in a flat layer. In contrast, it is common to use shared library functions and construct vertically layered software for efficiency and modularity. To fill this gap between modeling and implementation, we propose a special actor, library task, with new types of ports: library master port and library slave port. It is a sharable and mappable object that defines a set of function interfaces inside. N:1 master-slave connection allows sharing a library task and the master-slave connection can specify vertically layered software and client-server applications naturally. To support the library task in our embedded software design environment, we develop an automatic mapping algorithm as well as an automatic code generator. The design environment with the library task is applied for two target platforms: IBM CELL Broadband Engine and an ARM-based multicore simulator. Preliminary experiments show that the special actor, or library task, extends the expression power of the previous actor model with efficiently generated codes.
Reports on the initial experimental evaluation of ROPE (reusability-oriented parallel programming environment), a software component reuse system. ROPE helps the designer find and understand components by using a new ...
详细信息
Reports on the initial experimental evaluation of ROPE (reusability-oriented parallel programming environment), a software component reuse system. ROPE helps the designer find and understand components by using a new classification method called structured relational classification. ROPE is part of a development environment for parallel programs which uses a declarative/hierarchical graphical programming interface. This interface allows use of components with different levels of abstraction, ranging from design units to actual code modules. ROPE supports reuse of all the component types defined in the development environment. Programs developed with the aid of ROPE were found to have error rates far less than those developed without ROPE.
This paper presents the implementation of multitasking functions of DYNIX Sequent computers on the UNIX operating system. The Sequent computers are shared memory multiprocessor computers running the DYNIX operating sy...
详细信息
This paper presents the implementation of multitasking functions of DYNIX Sequent computers on the UNIX operating system. The Sequent computers are shared memory multiprocessor computers running the DYNIX operating system. These functions support data and function partitioning. They let the user implement subprograms by the processors of a Sequent computer in parallel. The functions can synchronize, lock, and unlock data and program segments. As a result, the simulator allows the users to develop their multitasking programs on a uniprocessor computer such as a SUN workstation, and later port them to a Sequent computer. Further, the simulator adds a level of abstraction on top of UNIX for concurrent programming. The functions of the simulator allow the user to handle the communication and synchronization of the processes in a program at a higher level of abstraction, while concentrating on the design of multitasking algorithms. The simulator is applied to a parallel selection algorithm. (C) 1998 John Wiley & Sons, Ltd.
We discuss a programming methodology based on the use of multilinear algebra to design and implement parallel algorithms for linear computations. In particular, we review techniques for implementing expressions involv...
详细信息
We discuss a programming methodology based on the use of multilinear algebra to design and implement parallel algorithms for linear computations. In particular, we review techniques for implementing expressions involving the tensor product. We then show how the tensor product can be used to formulate Strassen's matrix multiplication algorithm. We report on our experience using this formulation and these techniques to implement a parallel version of Strassen's matrix multiplication algorithm on the Encore Multimax.
暂无评论