This article evaluates the scalability and productivity of six parallel programming models for heterogeneous architectures, and finds that task-based models using code and data annotations require the minimum programm...
详细信息
This article evaluates the scalability and productivity of six parallel programming models for heterogeneous architectures, and finds that task-based models using code and data annotations require the minimum programming effort while sustaining nearly best performance. However, achieving this result requires both extensions of programmingmodels to control locality and granularity and proper interoperability with platform-specific optimizations.
Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-perfo...
详细信息
Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.
parallel machines are becoming more complex with increasing core counts and more heterogeneous architectures. However, the commonly used parallel programming models, C/C++ with MPI and/or OpenMP, make it difficult to ...
详细信息
ISBN:
(纸本)9780769549712
parallel machines are becoming more complex with increasing core counts and more heterogeneous architectures. However, the commonly used parallel programming models, C/C++ with MPI and/or OpenMP, make it difficult to write source code that is easily tuned for many targets. Newer language approaches attempt to ease this burden by providing optimization features such as automatic load balancing, overlap of computation and communication, message-driven execution, and implicit data layout optimizations. In this paper, we compare several implementations of LULESH, a proxy application for shock hydrodynamics, to determine strengths and weaknesses of different programmingmodels for parallel computation. We focus on four traditional (OpenMP, MPI, MPI+ OpenMP, CUDA) and four emerging (Chapel, Charm++, Liszt, Loci) programmingmodels. In evaluating these models, we focus on programmer productivity, performance and ease of applying optimizations.
This study embarks on a comprehensive examination of optimization techniques within GPU-based parallel programming models,pivotal for advancing high-performance computing(HPC).Emphasizing the transition of GPUs from g...
详细信息
This study embarks on a comprehensive examination of optimization techniques within GPU-based parallel programming models,pivotal for advancing high-performance computing(HPC).Emphasizing the transition of GPUs from graphic-centric processors to versatile computing units,it delves into the nuanced optimization of memory access,thread management,algorithmic design,and data *** optimizations are critical for exploiting the parallel processing capabilities of GPUs,addressingboth the theoretical frameworks and practical *** integrating advanced strategies such as memory coalescing,dynamic scheduling,and parallel algorithmic transformations,this research aims to significantly elevate computational efficiency and *** findings underscore the potential of optimized GPU programming to revolutionize computational tasks across various domains,highlighting a pathway towards achieving unparalleled processing power and efficiency in HPC *** paper not only contributes to the academic discourse on GPU optimization but also provides actionable insights for developers,fostering advancements in computational sciences and technology.
In this paper, we introduce a middleware infrastructure that provides software services for developing and deploying high-performance parallel programming models and distributed applications on clusters and networked ...
详细信息
In this paper, we introduce a middleware infrastructure that provides software services for developing and deploying high-performance parallel programming models and distributed applications on clusters and networked heterogeneous systems. This middleware infrastructure utilizes distributed agents residing on the participating machines and communicating with one another to perform the required functions. An intensive study of the parallel programming models in Java has helped identify the common requirements for a runtime support environment, which we used to define the middleware functionality. A Java-based prototype, based on this architecture, has been developed along with a Java Object-Passing Interface (JOPI) class library. Since this system is written completely in Java, it is portable and allows executing programs in parallel across multiple heterogeneous platforms. With the middleware infrastructure, users need not deal with the mechanisms of deploying and loading user classes on the heterogeneous system. Moreover, details of scheduling, controlling, monitoring, and executing user jobs are hidden, while the management of system resources is made transparent to the user. Such uniform services are essential for facilitating the development and deployment of scalable high-performance Java applications on clusters and heterogeneous systems. An initial deployment of a parallel Java programming model over a heterogeneous, distributed system shows good performance results. In addition, a framework for the agents' startup mechanism and organization is introduced to provide scalable deployment and communication among the agents.
SYCL is a modern royalty-free heterogeneous programming specification maintained by the Khronos Group. Recently, it has become increasingly more prevalent and matured, leading to various assessments of its performance...
详细信息
SYCL is a modern royalty-free heterogeneous programming specification maintained by the Khronos Group. Recently, it has become increasingly more prevalent and matured, leading to various assessments of its performance, portability, and programmability. While previous evaluations have mainly focused on X86 CPUs, NVIDIA GPUs, and AMD GPUs, how well SYCL performs on ARM multi-core CPUs is still unknown. In this paper, we evaluate three SYCL implementations (i.e., DPCPP, AdaptiveCPP, and MLIR-SYCL) on ARM multi-core CPUs, to uncover performance traps and offer optimization techniques. We use the SYCL-Bench benchmark suite to assess the performance of DPCPP, AdaptiveCPP, and MLIR-SYCL against their OpenMP counterparts. We also assess the compiler and runtime overhead to evaluate the usability and productivity of the SYCL implementations. Our empirical results demonstrate that these SYCL implementations can achieve satisfactory performance on ARM multi-core processors. Additionally, we highlight several key optimizations, such as NUMA management, which must be carefully addressed to enhance performance.
Despite rapid growth in workstation and networking technologies, the workstation environment continues to pose challenging problems to shared processing. In this paper, we present a computational model and system for ...
详细信息
Despite rapid growth in workstation and networking technologies, the workstation environment continues to pose challenging problems to shared processing. In this paper, we present a computational model and system for the generation of distributed applications in such an environment. The well-known RPC model is modified by a novel concept known as template attachment. A computation consists of a network of sequential procedures which have been encapsulated in templates. A small selection of templates is available from which a distributed application with the desired communication behavior can be rapidly built. The system generates all the required low-level code for correct synchronization, communication, and scheduling. This results in a system that is easy to use and flexible, and can provide a programmer with the desired amount of control in using idle processing power over a network of workstations. The practical feasibility of the model has been demonstrated by implementing it for Unix1-based workstation environments.
Dataflow programming consists in developing a program by describing its sequential stages and the interactions between them. The runtime systems supporting this kind of programming are responsible for exploiting the p...
详细信息
Dataflow programming consists in developing a program by describing its sequential stages and the interactions between them. The runtime systems supporting this kind of programming are responsible for exploiting the parallelism by concurrently executing the different stages as soon as their dependencies are met. In this paper we introduce a new parallelprogramming model and framework based on the dataflow paradigm. It presents a new combination of features that allows to easily map programs to shared or distributed memory, exploiting data locality and affinity to obtain the same performance than optimized coarse-grain MPI programs. These features include: It is a unique one-tier model that supports hybrid shared- and distributed-memory systems with the same abstractions;it can express activities arbitrarily linked, including non-nested cycles;it uses internally a distributed work-stealing mechanism to allow Multiple-Producer/Multiple-Consumer configurations;and it has a runtime mechanism for the reconfiguration of the dependences and communication channels which also allows the creation of task-to-task data affinities. We present an evaluation using examples of different classes of applications. Experimental results show that programs generated using this framework deliver good performance in hybrid distributed- and shared-memory environments, with a similar development effort as other dataflow programmingmodels oriented to shared-memory.
MATLAB (R) has emerged as one of the languages most commonly used by scientists and engineers for technical computing, with approximately one million users worldwide. The primary benefits Of MATLAB are reduced code de...
详细信息
MATLAB (R) has emerged as one of the languages most commonly used by scientists and engineers for technical computing, with approximately one million users worldwide. The primary benefits Of MATLAB are reduced code development time via high levels of abstractions (e.g. first class multi-dimensional arrays and thousands of built in functions), interpretive, interactive programming, and powerful mathematical graphics. The compute intensive nature of technical computing means that many MATLAB users have codes that can significantly benefit from the increased performance offered by parallel computing. plMatlab provides this capability by implementing parallel global array semantics using standard operator overloading techniques. The core data structure in pMatlab is a distributed numerical array whose distribution onto multiple processors is specified with a "map" construct. Communication operations between distributed arrays are abstracted away from the user and plMatlab transparently supports redistribution between any block-cyclic-overlapped distributions up to four dimensions. pMatlab is built on top of the MatlabMPI communication library and runs on any combination of heterogeneous systems that support MATLAB, which includes Windows, Linux, MacOS X, and SunOS. This paper describes the overall design and architecture of the pMatlab implementation. Performance is validated by implementing the HPC Challenge benchmark suite and comparing plMatlab performance with the equivalent C+MPI codes. These results indicate that plMatlab can often achieve comparable performance to C+MPI, usually at one tenth the code size. Finally, we present implementation data collected from a sample of real pMatlab applications drawn from the approximately one hundred users at MIT Lincoln Laboratory. These data indicate that users are typically able to go from a serial code to an efficient pMatlab code in about 3 hours while changing less than 1% of their code.
We survey parallel programming models and languages using six criteria to assess their suitability for realistic portable parallelprogramming. We argue that an ideal model should be easy to program, should have a sof...
详细信息
We survey parallel programming models and languages using six criteria to assess their suitability for realistic portable parallelprogramming. We argue that an ideal model should be easy to program, should have a software development methodology, should be architecture-independent, should be easy to understand, should guarantee performance, and should provide accurate information about the cost of programs. These criteria reflect our belief that developments in parallelism must be driven by a parallel software industry based on portability and efficiency. We consider programmingmodels in six categories, depending on the level of abstraction they provide. Those that are very abstract conceal even the presence of parallelism at the software level. Such models make software easy to build and port, but efficient and predictable performance is usually hard to achieve. At the other end of the spectrum, low-level models make all of the messy issues of parallelprogramming explicit (how many threads, how to place them, how to express communication, and how to schedule communication), so that software is hard to build and not very portable, but is usually efficient. Most recent models are near the center of this spectrum, exploring the best tradeoffs between expressiveness and performance. A few models have achieved both abstractness and efficiency. Both kinds of models raise the possibility of parallelism as part of the mainstream of computing.
暂无评论