the proceedings contain 14 papers. the topics discussed include: on the fence: an offload approach to ordering one-sided communication;caching puts and gets in a pgas language runtime;impact of frequency scaling on on...
ISBN:
(纸本)9781509001859
the proceedings contain 14 papers. the topics discussed include: on the fence: an offload approach to ordering one-sided communication;caching puts and gets in a pgas language runtime;impact of frequency scaling on one sided remote memory accesses;implementing high-performance geometric multigrid solver with naturally grained messages;an evaluation of anticipated extensions for fortran coarrays;preliminary implementation of coarray Fortran translator based on omni XcalableMP;using the parallel research kernels to study pgasmodels;PHLAME: hierarchical locality exploitation using the pgas model;a compiler transformation to overlap communication with dependent computation;toward a data-centric profiler for pgas applications;scaling HabaneroUPC++ on heterogeneous supercomputers;PySHMEM: a high productivity OpenSHMEM interface for Python;and ISx: a scalable integer sort for co-design in the exascale era.
A subset of the Parallel Research Kernels (PRK), simplified parallel application patterns, are used to study the behavior of different runtimes implementing the pgasprogramming model. the goal of this paper is to sho...
详细信息
ISBN:
(纸本)9781509001859
A subset of the Parallel Research Kernels (PRK), simplified parallel application patterns, are used to study the behavior of different runtimes implementing the pgasprogramming model. the goal of this paper is to show that such an approach is practical and effective as we approach the exascale era. Our experimental results indicate that for the kernels we selected, MPI with two-sided communications outperforms the pgas runtimes SHMEM, UPC, Grappa, and MPI-3 with RMA extensions.
We investigated a software cache for pgas PUT and GET operations. the cache is implemented as a software write-back cache with dirty bits, local memory consistency operations, and programmer-guided prefetch. this cach...
详细信息
We investigated a software cache for pgas PUT and GET operations. the cache is implemented as a software write-back cache with dirty bits, local memory consistency operations, and programmer-guided prefetch. this cache supports programmer productivity while enabling communication aggregation and overlap. We evaluated an implementation of this cache for remote data within the Chapel programming language. the cache provides a 2x speedup for several distributed memory application benchmarks written in Chapel across a variety of network configurations. In addition, we observed that improvements to compiler optimization did not remove the benefit of the cache.
A subset of the Parallel Research Kernels (PRK),simplified parallel application patterns, are used to study the behavior of different runtimes implementing the pgasprogramming model. the goal of this paper is to show...
详细信息
A subset of the Parallel Research Kernels (PRK),simplified parallel application patterns, are used to study the behavior of different runtimes implementing the pgasprogramming model. the goal of this paper is to show that such an approach is practical and effective as we approach the exascale era. Our experimental results indicate that forthe kernels we selected, MPI with two-sided communications outperforms the pgas runtimes SHMEM, UPC, Grappa, and MPI-3 with RMA extensions.
the Message Passing Interface (MPI) is one of the most widely used programmingmodels for parallel computing. However, the amount of memory available to an MPI process is limited by the amount of local memory within a...
详细信息
ISBN:
(纸本)9781450300445
the Message Passing Interface (MPI) is one of the most widely used programmingmodels for parallel computing. However, the amount of memory available to an MPI process is limited by the amount of local memory within a compute node. partitionedglobaladdressspace (pgas) models such as Unified Parallel C (UPC) are growing in popularity because of their ability to provide a shared globaladdressspacethat spans the memories of multiple compute nodes. However, taking advantage of UPC can require a large recoding effort for existing parallel applications. In this paper, we explore a new hybrid parallel programming model that combines MPI and UPC. this model allows MPI programmers incremental access to a greater amount of memory, enabling memory-constrained MPI codes to process larger data sets. In addition, the hybrid model offers UPC programmers an opportunity to create static UPC groups that are connected overMPI. As we demonstrate, the use of such groups can significantly improve the scalability of locality-constrained UPC codes. this paper presents a detailed description of the hybrid model and demonstrates its effectiveness in two applications: a random access benchmark and the Barnes-Hut cosmological simulation. Experimental results indicate that the hybrid model can greatly enhance performance;using hybrid UPC groups that span two cluster nodes, RA performance increases by a factor of 1.33 and using groups that span four cluster nodes, Barnes-Hut experiences a twofold speedup at the expense of a 2% increase in code size.
Structured grid linear solvers often require manually packing and unpacking of communication data to achieve high *** this process efficiently is challenging, labor-intensive, and potentially *** this paper, we explor...
详细信息
Structured grid linear solvers often require manually packing and unpacking of communication data to achieve high *** this process efficiently is challenging, labor-intensive, and potentially *** this paper, we explore an alternative approach that communicates the data with naturally grained message sizes without manual packing and unpacking. this approach is the distributed analogue of shared-memory programming, taking advantage of the globaladdressspace in pgas languages to provide substantial programming ease. However, its performance may suffer from the large number of small messages. We investigate the runtime support required in the UPC++ library for this naturally grained version to close the performance gap between the two approaches and attain comparable performance at scale using the High-Performance Geometric Multgrid (HPGMG-FV) benchmark as a driver.
partitionedglobaladdressspace (pgas) and one-sided communication models allow shared data to be transparently and asynchronously accessed by any process within a parallel computation. In order to ensure that update...
详细信息
partitionedglobaladdressspace (pgas) and one-sided communication models allow shared data to be transparently and asynchronously accessed by any process within a parallel computation. In order to ensure that updates are performed in the intended order, the programmer must either use potentially slower ordered communication, or perform operations that order unordered communication, such as a fence in the OpenSHMEM model. Often, implementations of such ordering mechanisms require blocking until pending operations have completed remotely, before allowing new operations to be issued. In this work, we present a new queuing technique for the implementation of one-sided communication ordering that is nonblocking and ensures asynchronous progress for pending communication operations. We describe an implementation of this approach using Portals triggered operations to offload queuing of communication operations across ordering boundaries. By eliminating blocking for ordered communication, this approach is able to provide automatic overlap of communication and computation. We demonstrate the benefit of this technique on several applications and measure performance improvements in the 10%-15% range from allowing computation to progress while ordered communication operations are pending.
暂无评论