OpenSHMEM is one of the key programmingmodels for High Performance Computing (HPC) applications with irregular communication patterns. Particularly, it is useful for problems that cannot be decomposed easily such as ...
详细信息
ISBN:
(纸本)9781728141947
OpenSHMEM is one of the key programmingmodels for High Performance Computing (HPC) applications with irregular communication patterns. Particularly, it is useful for problems that cannot be decomposed easily such as graph partitioning. The programming model supports Remote Memory Access (RMA), atomics, and collective operations. In this paper, we explore and evaluate the In-network Computing approach for accelerating the OpenSHMEM collective operations, particularly barrier, broadcast, and reduction operations. To achieve acceleration, In-network Computing leverages hardware engines on the networking elements and effective software that can efficiently use these capabilities. We explore the value of this approach for collective operations on the InfiniBand Host Channel Adapters (HCAs) and switches. Particularly, we focus on the recently introduced collective offload feature provided by the Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)(TM) capability, which accelerates the barriers and reduction operations;the multicast capability accelerates the broadcast collective operation. To leverage the hardware capabilities, we complement it with an effective software stack that includes Hierarchical Collectives (HCOLL) library, and SHARP layer. Our evaluation on Oak Ridge National Laboratory (ORNL)'s Summit system, which is the fastest supercomputer on the June 2019 Top 500 list, show that the hardware and software acceleration in the In-network Computing approach is key for achieving the performance and scalability required for collectives and applications. For a 5120 process OpenSHMEM job, our results show that the barrier operation is 710% faster, broadcast is 370% faster, and reduction operation is 10 times faster when compared with the implementation of collective operations with no acceleration. Further, experiments with a 2D-Heat kernel show that the In-network Computing approach is very effective for real-world applications.
Concurrent Kleene algebras were introduced by Hoare, Moller, Struth and Wehrman in [HMSW09, HMSW09a, HMSW11] as idem-potent bisemirings that satisfy a concurrency inequation and have a Kleene-star for both sequential ...
详细信息
ISBN:
(数字)9783319062518
ISBN:
(纸本)9783319062501;9783319062518
Concurrent Kleene algebras were introduced by Hoare, Moller, Struth and Wehrman in [HMSW09, HMSW09a, HMSW11] as idem-potent bisemirings that satisfy a concurrency inequation and have a Kleene-star for both sequential and concurrent composition. Kleene algebra with tests (KAT) were defined earlier by Kozen and Smith [KS97]. Concurrent Kleene algebras with tests (CKAT) combine these concepts and give a relatively simple algebraic model for reasoning about operational semantics of concurrent programs. We generalize guarded strings to guarded series-parallel strings, or gsp-strings, to provide a concrete language model for CKAT. Combining nondeterministic guarded automata [Koz03] with branching automata of Lodaya andWeil [LW00] one obtains a model for processing gsp-strings in parallel, and hence an operational interpretation for CKAT. For gsp-strings that are simply guarded strings, the model works like an ordinary nondeterministic guarded automaton. If the test algebra is assumed to be {0, 1} the language model reduces to the regular sets of bounded-width sp-strings of Lodaya and Weil. Since the concurrent composition operator distributes over join, it can also be added to relation algebras with transitive closure to obtain the variety CRAT. We provide semantics for these algebras in the form of coalgebraic arrow frames expanded with concurrency.
The optimization of performance of complex simulation codes with high computational demands, such as Octo-Tiger, is an ongoing challenge. Octo-Tiger is an astrophysics code simulating the evolution of star systems bas...
详细信息
ISBN:
(纸本)9781450364393
The optimization of performance of complex simulation codes with high computational demands, such as Octo-Tiger, is an ongoing challenge. Octo-Tiger is an astrophysics code simulating the evolution of star systems based on the fast multipole method using adaptive octrees as the central data structure. Octo-Tiger was implemented using high-level C++ libraries, specifically HPX and Vc, which allows its use on different hardware platforms. Recently, we have demonstrated excellent scalability in a distributed setting. In this paper, we study the node-level performance of Octo-Tiger on an Intel Knights Landing platform. We focus on Octo-Tiger's fast multipole method, as it is the computationally most demanding component. By using HPX and a futurization approach, we can efficiently traverse the adaptive octrees in parallel. On the core-level, threads process sub-grids using multiple 1074-element stencils. In numerical experiments, simulating the time evolution of a rotating star on an Intel Xeon Phi 7250 Knights Landing processor, Octo-Tiger shows good parallel efficiency. For the fast multipole algorithm, we achieved up to 408 GFLOPS, resulting in a speedup of 2x compared to a 24-core Skylake-SP platform, using the same high-level abstractions.
Cutting-edge functionalities in embedded systems require the use of parallel architectures to meet their performance requirements. This imposes the introduction of a new layer in the software stacks of embedded system...
详细信息
ISBN:
(纸本)9781450388160
Cutting-edge functionalities in embedded systems require the use of parallel architectures to meet their performance requirements. This imposes the introduction of a new layer in the software stacks of embedded systems: the parallelprogramming model. Unfortunately, the tools used to analyze embedded systems fall short to characterize the performance of parallel applications at a parallelprogramming model level, and correlate this with information about non-functional requirements such as real-time, energy, memory usage, etc. HPC tools, like Extrae, are designed with that level of abstraction in mind, but their main focus is on performance evaluation. Overall, providing insightful information about the performance of parallel embedded applications at the parallelprogramming model level, and relate it to the non-functional requirements, is of paramount importance to fully exploit the performance capabilities of parallel embedded architectures. This paper contributes to the state-of-the-art of analysis tools for embedded systems by: (1) analyzing the particular constraints of embedded systems compared to HPC systems (e.g., static setting, restricted memory, limited drivers) to support HPC analysis tools;(2) porting Extrae, a powerful tracing tool from the HPC domain, to the GR740 platform, a SoC used in the space domain;and (3) augmenting Extrae with new features needed to correlate the parallel execution with the following non-functional requirements: energy, temperature and memory usage. Finally, the paper presents the usefulness of Extrae to characterize OpenMP applications and its non-functional requirements, evaluating different aspects of the applications running in the GR740.
We present our design and implementation of a runtime for the Space Consistency model. The Space Consistency model is a generalized form of the full-empty bit synchronization for distributed memory programming, where ...
详细信息
ISBN:
(纸本)9781450388429
We present our design and implementation of a runtime for the Space Consistency model. The Space Consistency model is a generalized form of the full-empty bit synchronization for distributed memory programming, where a memory region is associated with a counter that determines its consistency and readiness for consumption. The model allows for efficient implementation of point-to-point data transfers and collective communication primitives as well. We present the interface design, implementation details, and performance results on Cray XC systems. Our runtime adopts a reduced API design to provide low-overhead initiation and processing of communication primitives, enable threaded execution of runtime functions, and provide efficient pipelining, thus improving the computation-communication overlap. We show the performance benefits of using this runtime both at the microbenchmark level and in application settings.
The use of ghost regions is a common feature of many distributed grid applications. A ghost region holds local read-only copies of remotely-held boundary data which are exchanged and cached many times over the course ...
详细信息
ISBN:
(纸本)9781467323703;9781467323727
The use of ghost regions is a common feature of many distributed grid applications. A ghost region holds local read-only copies of remotely-held boundary data which are exchanged and cached many times over the course of a computation. X10 is a modern parallelprogramming language intended to support productive development of distributed applications. X10 supports the "active message" paradigm, which combines data transfer and computation in one-sided communications. A central feature of X10 is the distributed array, which distributes array data across multiple places, providing standard read and write operations as well as powerful high-level operations. We used active messages to implement ghost region updates for X10 distributed arrays using two different update algorithms. Our implementation exploits multiple levels of parallelism and avoids global synchronization;it also supports split-phase ghost updates, which allows for overlapping computation and communication. We compare the performance of these algorithms on two platforms: an Intel x86-64 cluster over QDR InfiniBand, and a Blue Gene/P system, using both stand-alone benchmarks and an example computational chemistry application code. Our results suggest that on a dynamically threaded architecture, a ghost region update using only pairwise synchronization exhibits superior scaling to an update that uses global collective synchronization.
Recently, graph computation has emerged as an important class of high-performance computing application whose characteristics differ markedly from those of traditional, compute-bound, kernels. Libraries such as BLAS, ...
详细信息
ISBN:
(纸本)9781450319225
Recently, graph computation has emerged as an important class of high-performance computing application whose characteristics differ markedly from those of traditional, compute-bound, kernels. Libraries such as BLAS, LAPACK, and others have been successful in codifying best practices in numerical computing. The data-driven nature of graph applications necessitates a more complex application stack incorporating runtime optimization. In this paper, we present a method of phrasing graph algorithms as collections of asynchronous, concurrently executing, concise code fragments which may be invoked both locally and in remote address spaces. A runtime layer performs a number of dynamic optimizations, including message coalescing, message combining, and software routing. Practical implementations and performance results are provided for a number of representative algorithms.
We argue that programming high-end stream-processing applications requires a form of coordination language that enables the designer to represent interactions between stream-processing functions asynchronously. We fur...
详细信息
We argue that programming high-end stream-processing applications requires a form of coordination language that enables the designer to represent interactions between stream-processing functions asynchronously. We further argue that the level of abstraction that current programming tools engender should be drastically increased and present a coordination language and component technology that is suitable for that purpose. We demonstrate our approach on a real radar-data processing application from which we reuse all existing components and present speed-ups that we were able to achieve on contemporary multi-core hardware. (C) 2010 Published by Elsevier Ltd.
Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced with the ability to use tasks to mak...
详细信息
ISBN:
(纸本)9798400704352
Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced with the ability to use tasks to make use of idle cores. Pure leverages shared memory in two ways: (a) by allowing cores to steal work from each other while waiting on messages to arrive, and, (b) by leveraging *** lock-free data structures in shared memory to achieve highperformance messaging and collective operations between the ranks within nodes. We use microbenchmarks to evaluate Pure's key messaging and collective features and also show application speedups up to 2.1 Chi on the CoMD molecular dynamics and the miniAMR adaptive mesh *** applications scaling up to 4,096 cores.
The data-triggered threads (DTT) programming and execution model can increase parallelism and eliminate redundant computation. However, the initial proposal requires significant architecture support, which impedes exi...
详细信息
ISBN:
(纸本)9781450315616
The data-triggered threads (DTT) programming and execution model can increase parallelism and eliminate redundant computation. However, the initial proposal requires significant architecture support, which impedes existing applications and architectures from taking advantage of this model. This work proposes a pure software solution that supports the DTT model without any hardware support. This research uses a prototype compiler and runtime libraries running on top of existing machines. Several enhancements to the initial software implementation are presented, which further improve the performance. The software runtime system improves the performance of serial C SPEC benchmarks by 15% on a Nehalem processor, but by over 7X over the full suite of single-thread applications. It is shown that the DTT model can work in conjunction with traditional parallelism. The DTT model provides up to 64X speedup over parallel applications exploiting traditional parallelism.
暂无评论